BigQuery is a cloud-based data warehouse platform that is fully integrated into the Google Cloud Platform. BigQuery supports querying through standard SQL.
For flows that have both the source and output in BigQuery, the Trifacta application can execute the job and necessary transformation steps directly in BigQuery. By doing so, no data needs to be transferred out of the data warehouse, and the performance should be significantly better.
In addition to BigQuery sources, you can execute jobs in BigQuery on source files from Base Storage.
By default, the Full execution for GCS file option is enabled for new flows. For more information, see Flow Optimization Settings Dialog.
For more information, see Google Cloud Storage Access.
BigQuery is not a running environment that you explicitly select or specify as part of a job. If all of the requirements are met, then the job is executed in BigQuery when you select Dataflow. For more information, see Overview of Job Execution.
Transferring the execution steps from the batch execution to BigQuery provides the following benefits:
Data transfer between systems (Dataprep, Dataflow, GCS, etc.) is minimized. Only recipe and associated metadata is transferred, but everything else remains in BigQuery. When running a job fully in BigQuery, your data never leaves BigQuery
Recipe steps are converted into SQL that is understandable and native to BigQuery. Execution times can be much faster.
Depending on your environment, the total cost of executing the job may be lower in BigQuery.
For jobs that are executed in BigQuery, you can optionally enable the execution of the visual profile in BigQuery, too. This option is enabled for individual flows. For more information, see Flow Optimization Settings Dialog.
A project owner must enable the following features in the project:
Logical and physical optimization of jobs
For more information, see Dataprep Project Settings Page.
For individual flows, all general and BigQuery optimizations must be enabled. For more information, see Flow Optimization Settings Dialog.
When executing with BigQuery, recipe steps in Trifacta are converted to SQL, executed on the source data and written into temporary tables, before being output to the defined output table.
You will see a notification on your Run Job page indicating if BigQuery execution is enabled and is being used.
If all recipe steps and Wrangle are translatable to SQL, then Full Execution on BigQuery mode is used.
If some steps are not translatable to SQL, then only the translatable steps are executed on BigQuery, while the rest are executed on Dataflow in a Hybrid Execution mode.
For jobs executed in BigQuery, profiling jobs may also be executed in BigQuery.
An Optimization summary (enabled/disabled) for the job can be viewed on the Job Details page in the right panel.
You can also view the output and executed statements directly on BigQuery by clicking on the button 'View on BigQuery' under Job Details.
Beginning v8.11, data sources from BigQuery support sampling via BigQuery execution. BigQuery sampling is limited to full scan samples, as quick scan uses the Trifacta Photon execution.
The following sampling types are supported for execution in BigQuery:
Note: Flows with GCS sources and BigQuery targets can use BigQuery execution for job execution, but cannot use BigQuery execution for sampling.
Note that some transformations and functions are not currently supported for execution in BigQuery. Upserts are also not supported for full execution in BigQuery. For more information on limitations and the unsupported transformations, read this documentation guide.
Beginning v8.9, imported datasets created with customer SQL are supported for execution in the BigQuery running environment. For more information, see BigQuery Running Environment.
Full execution for GCS files
When enabled, jobs that are sourced from files on Google Cloud Storage can be executed in BigQuery when all of the requirements for BigQuery execution are met. By default, this setting is enabled.