Here are some best practices to organize a flow for better scalability and maintenance.
For each dataset, define a ‘cleaning’ recipe
When creating a flow, you might import 5 or 10 datasets and create even more recipes. No matter how clean you may think the data is, it will more often than not require some additional cleaning and formatting before you use that dataset into more complex operations such as joining or grouping. A good practice is to always add a cleaning recipe as close as possible to the source (even if you think at first that you will not need it).
Adding a Cleaning recipe right out of the source
Split your recipes in multiple recipes ‘wisely’
A recipe can have as many steps as you want. A spontaneous approach would be to create a big recipe with all the steps. That may not be the wisest decision. If you want to future proof your data preparation flows, you should regroup all the steps that belong to a common behavior inside the same recipe. This applies to your cleaning steps, but this also applies to your enriching steps (like joining with an external dataset), your calculation steps, aggregation steps, and others.
It facilitates unit testing, provides more readability to your flow, and facilitates the maintenance over time.
Remember that after each recipe, you can output the result with a job. This is particularly useful in testing a specific output or refreshing a sample to verify some data. You can also create multiple recipes, which branches out your flow as follows:
Grouping recipes in logical tasks
Create recipes dedicated to joins
A join step can influence the number of columns and rows in the recipe. For this reason, this is also a step that can be time-consuming to open. Having your joins steps in separate recipes will save you some time as you will only load the join step when you really need to.
Create a recipe just for joining