A brief overview of sampling

When a dataset is first created, a background job begins to generate a sample using the first set of rows of the dataset. This initial sample is usually very quick to generate, so that you can get to work right away on your transformations. By default, each sample is 10MB in size or the entire dataset if it's smaller.

Additional samples can be generated from the context panel on the right side of the Transformer page. Sample jobs are independent executions. As you develop your recipe, you might need to take new samples of the data. Through the Transformer page you can specify the type of sample that you wish to create and initiate the job to create the sample. This sampling job runs in the background.

Recipe logic and sampling

When a sample is executed from the Samples panel, it is launched based on the steps leading up to the current location in the recipe steps. For example, if your recipe includes joining in other datasets, those steps are executed, and the sample is generated with dependencies on these other datasets. As a result, if you change your recipe steps that occur before the step where the sample was generated, you can invalidate your sample.

Sample Methodologies

There are six types of samples:

  • First rows/Initial Sample

  • Random

  • Filter-based

  • Anomaly-based

  • Stratified

  • Cluster-based

Certain sample methodologies depend on the Sample Type.

Sample Type

There are two types of sampling: quick and full. A quick scan looks at the first 2GB of data and creates samples from the limited set. A full scan samples from the entire dataset.

More Information

Did this answer your question?