This example flow prepares event log data to be published into a data warehouse. The main types of transformations used here are cleansing and structuring transformations, and this example is best suited for people trying to create data pipelines using Trifacta.
How to use example flows:
To take the most advantage from example flows, we recommend reading through the descriptions and comments in ever node and recipe to understand how to solve your use case with our tool. Once you're done, you can either use it as a starting point for your development or share it with other people in your workspace.
Getting to know the dataset
The dataset in this flow (user_events.csv) contains an event stream with different events in it and different attributes all nested under an events property JSON object. This is the kind of data you would get from analytics tools like Amplitude or Mixpanel.
This dataset contains several empty columns. To see the column profiles more easily use CMD + G. This shortcut will toggle between the list and the grid views.
Flattening JSON objects
Since the data contains JSON objects the first step is flattening the objects and separate properties into new columns.
To flatten JSON objects all you have to do is select all the different values in the histogram and Trifacta will immediately suggest flattening the data. Once you select that options it'll add a recipe step that flattens the object and creates a set of new columns for you.
Basic data cleansing - renaming and dropping unnecessary columns
In the second recipe we rename a couple of the
Filtering missing data
The third node, called "Filtering rows" uses a common transformation that called
filter that allows you to define rules about which rows to keep or delete. In this case, we're also using the
ISMISSING() transformation that evaluates as a boolean value and will perform the transformation when it evaluates to
Preparing to publish
In the recipe called "Grouping and counting" we use a transformation called
group by. Similar to a
pivot in excel, this transformation lets you define what columns you would like to group by (in this case we went with
event_category) and lets you also add aggregate functions.