As you work with large amounts of data, it is imperative to ensure its accuracy and consistency. But the dataset that you are working with may be rife with inaccuracies, errors, and inconsistencies. The data that is semantically related or similar may be represented differently in your dataset. It may have inconsistencies in form of different spellings, slightly varying names, capitalizations, formatting and spacing etc.
To facilitate further analysis on this data, you would need to collate the clusters of similar data and standardize them to specific consistent values.
Data Standardization ensures that your data is internally consistent in its content and format. It is important to have standardized data for comparison and collective reporting.
This is where you can leverage the ‘Standardize’ functionality in Trifacta Wrangler.
What It does
Standardize helps you group the various references of the items in your column into its interpretation of meaningful clusters. This clustering is based on pattern-matching between values in the column.
You may then select the values you would like to standardize, enter the new normalized value, and replace that in your sample data set.
You can add this Standardization step in your Recipe to ensure your future data is standardized in similar fashion.
Standardize values to correct spelling differences, capitalization, whitespace, and other errors.
Values must be consistent across rows of the column.
Primarily used for string-based data types.
Through the Standardize page, you can review similar column values and standardize them to values that you specify.
Let us say you are an eCommerce platform trying to analyze orders placed within a time period for a certain category like ‘Men Tshirts’. But since the data entered in the portal for products is directly done by your vendors, there always are slight differences in the item metadata. You will be surprised how many ways the product title for the same product can be written – like Men’s TShirt / Mens TShirt / Mens T-Shirt / Men’s T-shirt
See the example data below. Notice the Product column-
Steps to Standardize your Product data
Select the Product column
Open the Column dropdown and Select ‘Standardize…’ option
Select the Standardize icon from the Transformer toolbar on top
2. This opens the Standardize Tool and shows you the default clusters of data values.
3. Select a Clustering Option. You can find the ‘Clustering Options’ icon on the top to select / change the Clustering type.
Clustering Options are different ways the platform will look for clusters and present them.
No Clustering will be applied
Clustering will be based on similarities between the text of each value
Clustering will be based on the phonetic pronunciations of the values
The default clustering mechanism is based on ‘Similar Strings’
As seen in the image above, Clustering based on ‘Similar Strings’ can take care of slight differences in spellings and capitalization mismatches between values and cluster them together for standardization.
Change the Clustering option to ‘Pronunciation’ to see the difference-
It captures all values related to Men T-Shirts that vary in their hyphenation and capitalization etc. This is because these values are picked on their phonetic pronunciation which is the same for all of these.
4. You can now select the values from the clusters that you wish to Standardize and provide the New value in the right panel. You can apply these changes by clicking on Apply
5. Do the same for all the clusters.
You will be shown a summary of your standardization changes on the bottom-right
At any time, you can revert the changes to the cluster values. Click Revert to source.
6. Add this step to your recipe
You can configure all your standardization values in the tool before adding the step to your recipe. But for debugging purposes, you can separate some or all standardization into separate steps in your Recipe.
Remember that standardization may not help if the data is bad or irrelevant. You do not want to spend time and effort in trying to normalize invalid values and later trying to make sense of them in your analysis and reporting!
Standardize feature assumes that your data is valid or empty and prepares the clusters based on this assumption. Before getting into the Standardization process, remember to remove the bad data and check for irrelevant data values.
To learn more about Standardization in Trifacta, read these guides-