View in full screen

Adding publishing options for a job run

When you run a job, you must add at least one publishing destination to your output. These options include:

  • Choosing an output file/folder in GCS, S3, HDFS, ADLS etc.

  • Choosing an output table in BigQuery, Redshift, Hive, or other external relation databases

For each destination, you can choose from a variety of options:

  • File output format - creating, appending or replacing a CSV, JSON, Avro, or Parquet

  • Table output - adding a new table, appending to or replacing an existing table

  • Single file or multiple file output

  • Compression options - including BZIP, GZIP, and Snappy

  • Other options such as headers, delimiters, etc.

Relational Table settings

Select table options:

  1. Table name:

    1. New table: Enter a name for it. You may use a pre-existing table name, and schema checks are performed against it.

    2. Existing table: You cannot modify the name.

  2. Output database: To change the database to which you are publishing, click the database icon in the sidebar. Select a different database.

  3. Publish actions: Select one of the following.

    1. Create new table every run: Each run generates a new table with a timestamp appended to the name.

    2. Append to this table every run: Each run adds any new results to the end of the table.

    3. Truncate the table every run: With each run, all data in the table is truncated and replaced with any new results.

    4. Drop the table every run: With each run, the table is dropped (deleted), and all data is deleted. A new table with the same name is created, and any new results are added to it.

An example of publishing to Redshift:

Publishing to a File System

Output destination settings when publishing to a file system:

  1. Create new table every run: Each run generates a new table with a timestamp appended to the name.

  2. Append to this table every run: Each run adds any new results to the end of the table.

  3. Replace the table every run: With each run, the file is deleted and replaced with new data every run

An example of publishing options for GCS:

Publishing to BigQuery

When publishing to a BigQuery table, you have all the options of writing to a table. In addition, there is an option to merge the table, which allows you to update existing rows of data in the target table with corresponding values from your results and optionally to insert any rows in your results into the table. When selected, you specify a primary key of fields and then decide how data is merged into the table.

  1. Merge the table every run: This publishing option merges the rows in your results with any existing rows in the target BigQuery table.

    1. Choose columns for matching rows: Select column(s) whose values determine if a row in your source results matches a row in the target. When these key values match, the following columns are updated. (For example - UserId column shown below)

    2. Choose columns to update: Select column(s) whose values are updated from your source results when values from the previous set of columns match. These are the columns that are merged into the table. You can also select 'All columns'

    3. Insert records if keys don't match (checkbox)

      a) When selected, rows in your source that do not have a matching set of values in key columns are inserted into the table as new rows.

      b) When deselected, these unmatched rows are not written to the target table.

An example of publishing options for BigQuery:

Example of merge tables:

Handling Duplicates

  1. If the matching columns have duplicate rows in the target table, all rows in the target are updated.

  2. If the matching columns have duplicate rows in the source, the job fails.

More Information

Did this answer your question?