A schema refers to the sequence and data type of columns in a dataset. When working with data in relational tables or conventional file formats, you often deal with varying schemas and changing datasets. Over time, schema sources may often change without warning, ranging from minor differences to major overhauls.
From within the Trifacta application, schema changes may appear as broken recipe steps and can cause data corruption downstream. To assist with these issues, the Trifacta application can be configured to monitor schema changes on your dataset. This feature allows users to identify data sources where the schema has changed recently and fail the job in those cases.
On initial load, the schema information from the dataset is captured and stored separately in the Trifacta database. This information identifies column names, data types, and the ordering of the dataset.
When the dataset is read during job execution, the schema information is read again and compared to the stored version
Schema validation detects if columns are added, removed, or moved.
You can configure the Trifacta application to halt job execution when schema validation issues have been encountered.
Supported File Formats
Relational (JDBC and BigQuery)
Schematized (AVRO and Parquet)
Delimited files (CSV, TSV etc.)
Settings on Run Job Page
Validate Schema: When checked, the schemas of the data sources for this job are checked for any changes since the last time that the datasets were loaded. Differences are reported in the Job Details page as a Schema validation stage.
When schema validation is enabled and a job is launched, the schema validation check is performed in parallel with the data ingestion step. The results of the schema validation check are reported on the Job Details page in the Schema validation stage.
If no errors are detected, then the job is completed as normal. When schema validation detects differences in the Job Details page, those findings can be explored in detail. Click View changes to see a detailed table including:
Column: Name of the column in the dataset.
Finding: Description of the change between the column in the stored schema and the column in the schema read during this job execution.
Use the tabs at the top of the screen to filter the list of findings.
For more information, see Schema Changes Dialog.
Fail job if dataset schemas change: When Validate Schema is enabled, check this flag to automatically fail the job if there are differences between the stored schemas for your datasets and the schemas that are detected when the job is launched.
If this setting is not checked, jobs will complete with warnings and publish output data to targets, even if schema validation detects changes.
The default for validating schema is set at the workspace level. In the Run Job page, these settings are overrides for individual jobs.
For more information, see Run Job Page.
To learn more about this feature, check out the following: