View in full screen

Adding publishing options for a job run

When you run a job, you must add at least one publishing destination to your output. These options include:

  • Choosing an output file/folder in GCS, S3, HDFS, ADLS etc.

  • Choosing an output table in BigQuery, Redshift, Hive, or other external relation databases

For each destination, you can choose from a variety of options:

  • File output format - creating, appending or replacing a CSV, JSON, Avro, or Parquet

  • Table output - adding a new table, appending to or replacing an existing table

  • Single file or multiple file output

  • Compression options - including BZIP, GZIP, and Snappy

  • Other options such as headers, delimiters, etc.

Relational Table settings

Select table options:

  1. Table name:

    1. New table: Enter a name for it. You may use a pre-existing table name, and schema checks are performed against it.

    2. Existing table: You cannot modify the name.

  2. Output database: To change the database to which you are publishing, click the database icon in the sidebar. Select a different database.

  3. Publish actions: Select one of the following.

    1. Create new table every run: Each run generates a new table with a timestamp appended to the name.

    2. Append to this table every run: Each run adds any new results to the end of the table.

    3. Truncate the table every run: With each run, all data in the table is truncated and replaced with any new results.

    4. Drop the table every run: With each run, the table is dropped (deleted), and all data is deleted. A new table with the same name is created, and any new results are added to it.

An example of publishing to Redshift:

Publishing to a File System

Output destination settings when publishing to a file system:

  1. Create new table every run: Each run generates a new table with a timestamp appended to the name.

  2. Append to this table every run: Each run adds any new results to the end of the table.

  3. Replace the table every run: With each run, the file is deleted and replaced with new data every run

An example of publishing options for GCS:


Output Options

  • Include Headers as the first row on creation: For CSV outputs, you can choose to include the column headers as the first row in the output. For other formats, these headers are included automatically.

  • Include quotes: For CSV outputs, you can choose to include double quote marks around all values, including headers.

  • Include mismatched values: For CSV outputs, you can choose to include any value that is mismatched for its column data type. When disabled, mismatched values are written as null values.

For more details, see this detailed documentation guide

Publishing to BigQuery

When publishing to a BigQuery table, you have all the options of writing to a table. In addition, there is an option to merge the table, which allows you to update existing rows of data in the target table with corresponding values from your results and optionally to insert any rows in your results into the table. When selected, you specify a primary key of fields and then decide how data is merged into the table.

Publishing Actions

Select one of the following:

  1. Create new table every run: Each run generates a new table with a timestamp appended to the name.

  2. Append to this table every run: Each run adds any new results to the end of the table.

  3. Truncate the table every run: With each run, all data in the table is truncated and replaced with any new results.

  4. Drop the table every run: With each run, the table is dropped (deleted), and all data is deleted. A new table with the same name is created, and any new results are added to it.

  5. Merge the table every run: This publishing option merges the rows in your results with any existing rows in the target BigQuery table. For more information, see Merge Table Operations below.

  6. More options: With each run, you can allow the following rules for strict type matching:

    1. Datetime: Select to publish datetime for strict matching.

    2. Array: Select to publish Array types are published as BigQuery arrays.

      • You can publish to BigQuery arrays if the data is homogeneous.

      • BigQuery supports nested data up to 15 levels deep. Data that is nested deeper than 15 levels results in an error on publication.

      • Pushdown execution on BigQuery for nested data is not supported. Please disable any flow optimizations before running jobs on nested data.

    3. Object: Select to publish Objects and Array of Objects as BigQuery structs.

For further details, refer to this detailed documentation guide.

Merge the table every run

This publishing option merges the rows in your results with any existing rows in the target BigQuery table and optionally to insert or delete matching rows from your results into the table.

  1. In the Table Settings panel, select Merge the table every run.

  2. Primary keys for matching: Select one or more columns whose values determine if a row in your source results matches a row in the target. When these key values match, the following columns are updated.

    • If the matching columns have duplicate rows in the target table, all rows in the target are updated.

    • If the matching columns have duplicate rows in the source, the job fails.

  3. Action on target table for matched rows: Select the action to apply to the target record when a match is found between the key columns:

    1. Update: The values from your results are updated into the columns specified below.

    2. Delete: The row in the target table is deleted.

  4. Insert source rows if no match in target:

    1. When selected, rows in your source that do not have a matching set of values in key columns are inserted into the table as new rows.

    2. When deselected, these unmatched rows are not written to the target table.

  5. Delete target rows if no match in source:

    1. When selected, all rows in the target that do not have a matching set of key fields your source results are deleted.

    2. When deselected, unmatched rows in the target are not deleted.

Publishing to Snowflake

Publish actions: Select one of the following.

  1. Append to this table every run: Each run adds any new results to the end of the table.

  2. Truncate the table every run: With each run, all data in the table is truncated and replaced with any new results.

  3. Drop the table every run: With each run, the table is dropped (deleted), and all data is deleted. A new table with the same name is created, and any new results are added to it.

  4. Merge the table every run: This publishing option merges the rows in your results with any existing rows in the target Snowflake table. For more information, see Merge Table Operations below.

For more information, see Snowflake Table Settings

Handling Duplicates

  1. If the matching columns have duplicate rows in the target table, all rows in the target are updated.

  2. If the matching columns have duplicate rows in the source, the job fails.

More Information

Did this answer your question?