Creating Outputs

This section covers the configuration of an Output.

Logical Data Flow

Outputs process and map transformed data from the DataForge Data Hub to a Data Warehouse or third party system.  Output typically consists of very limited transformation logic, instead focusing on simple mappings of data fields from the hub tables to the final location.  In this example, we map the output back to Databricks Delta Tables.

Step 1: Create Output Connection

An output connection is necessary in order to complete this phase.  If this was not created during the Setting up Connections section earlier, complete this first. 

Step 2: Create Output

Before configuring any mappings, DataForge requires general information about the desired Output, including a name and where within the connection the output should push data. 

Begin by navigating to the Outputs page from the main menu in DataForge.  From this page, use the New + button to create a new output.

Configure the following parameters on the new output:

  • Name: Databricks Delta Lake - TPCH Customer
  • Description: Publishing Customer data to Databricks Delta table
  • Active: Leave as true/on. This setting will appear throughout DataForge and in this case, dictates that the output is active and should run when new data is ingested on the sources mapped to it.

Output Type

DataForge offers three output types.  For this example, use the Table option.

Output Type Description
File DataForge can output to File(s), which can be CSV, Parquet, Avro, JSON, or Text.
Table DataForge can output to, manage, and ensure accurate refresh of a database Table.
Virtual DataForge can create a database view to expose data for polling by outside systems, otherwise known as a Virtual Output.

Update the following settings with the respective values:

  • Connection: Databricks Delta Lake Output
  • Table Name: tpch_customer_enriched
  • Table Schema: default
    • This is can be changed but the schema needs to already be created in the catalog defined in the output connection for the output process to succeed.
  • Post Output Commands: None

Post Output Commands is a more advanced feature that refers to the ability to attach a Databricks notebook with additional logic that will be triggered to run every time this Output is run using the DataForge SDK.

Click Save to save the new output. Refer to the screenshot above for a complete version of the output settings if needed.

Step 3: Add an Output Channel (Source Mapping)

With the Output details configured and a location within the connection to push the data, the next set is to map the Customer Source to the Output and map the individual columns from the Source to the Output.

Add a Source Mapping

To enable mappings between the Customer source (Databricks Unity Samples - tpch.customer) and the new output (Databricks Delta Lake - TPCH Customer), click on the + Add Source Mapping button.

This brings up a Source Mapping page where we are able to select the source to map and add additional filtering criteria to the mapping. Below is an image of the final result, and instructions of how to update these parameters.

 Select the Customer Source Databricks Unity Samples - tpch.customer to start.

Filter Data in Filter Expression

DataForge provides the ability to enter a filter expression directly on the source output mapping which filters the data from the source hub table and limits records on output. 

Write the filter expression as follows:

[This].total_orders_price > 0

This expression will limit the rows that are output to only records where the total_orders_price rule result is greater than 0.  

Include Rows to Filter on Validation Rules

The Include Rows options are Passed, Warned, and Failed.  This setting detects the validation rules that are written on the source and includes or excludes them depending on the validation settings that were applied. 

For this example, choose the Passed option.  Since the validation rule that was set up previously will add a warning on any record that is not a customer in the Automobile market segment, this output will only output records that pass that validation.

Between the filter expression and the Include Rows setting, this output will only send records that are customers in the Automobile market segment AND have a total order price greater than zero.

Operation Type

For this example, leave as N/A.  DataForge offers the ability to aggregate data and output at a higher grain than the actual data.  For more details, refer to the User Manual's Output Mapping documentation.

Name and Description

The source mapping, also referred to as an Output Channel, allows users to define the channel name that is represented on the output mapping page.  For these two settings, use the following naming values:

  • Name: Databricks Unity Samples - tpch.customer
  • Description: only output Automobile market segment customers with total order price greater than 0

Click Save at the bottom to finish.

Step 4: Map Columns to the Output Channel/Source Mapping

After adding the output channel which ties back to the source processing the ingested data, add columns that should be included in the output.  

Columns can be added to the output in two ways: Add All Source Columns and Manually.  Add All Source Columns will create a column for every raw attribute and rule that lives in the source hub table attached and map these fields to the new columns.  Manually adding columns requires the user to click the +Add Column button, give the column a name, save it, and then click into the column/channel and map a specific field to the output. 

For demonstration purposes, this integration example will create one column manually first and then add all source columns for the remaining columns. However, in normal cases, it is possible to start by adding all source columns.

Add a New Column Manually

Use the +Add Column to create a new column and name it total_orders_price.

Click Save to add the column.  Entering the additional fields in the column is optional.  When fields are mapped to this column, DataForge will automatically recognize the data type and update it.

Next, click into the new column cell on the same row as the channel/source mapping similar to the image below.  This brings up a new pop-up box to allow typing the raw attribute, rule, or system field into the mapping.

Start by either typing a period (.) or [This]. which triggers DataForge to provide a list of fields available for the mapping from this source.

Map the total_orders_price field to this column mapping and click Save.

At this point, one column has been added and a source field mapped to the column like the image below.

Add All Source Columns

To map the remainder of the columns, use the Add All Source Columns option which is available after clicking the vertical triple-dot menu on the channel.

This results in DataForge adding all the raw attribute and rule columns to the output mapping and automapping these fields to those new columns.  The output should look similar to below.  Use the arrows in the bottom corner to scroll through columns.

Step 5: Run Output on All Source Data

Now that the Source and Output have been fully configured, the final step is to run our data all the way through to our destination output into Delta Tables in Databricks.

Start by navigating back to the Customer Source (Databricks Unity Samples - tpch.customer) either through the Main Menu or by clicking the source hyperlink in the output mapping like below which opens it in a new tab.

On the source Inputs tab, click the triple-dot menu on the header row to reprocess all input data loaded to date.  Select the option for Reset Output and then select either the All Channels option or the Databricks Unity Samples - TPCH Customer option.  Since this source is only mapped to one output at this point, both options do the same thing.  If the source were mapped to multiple outputs, you can reset output for one channel/output mapping at a time if needed.

DataForge will start processing output for all inputs.  When complete, a green checkmark will appear in the Status column for all inputs.

Note: For all future inputs, re-running output is not required.  Now that the rules and output are all configured, every new input data pull will be run through these configurations as the data is processed moving forward.  As new rules are created or output configurations changed, reprocessing is available for old inputs similar to this example.

Step 6: Checking the Output Table Data

Since the output is mapped to send data to Databricks Delta Tables, start by opening Databricks from the DataForge main menu again.

Open the Data Explorer in Databricks and expand the hive_metastore catalog as well as the <environment name> schema.  A new table now appears with the name "output_databricks_jdbc_samples_tpch_customer". 

Either use the Create -> Notebook option or use an existing notebook to query this table and check the output results.  The table should contain only records where the c_mktsegment field is "Automobile" and where total_orders_price is greater than zero.

Continue on to the Data Integration Example Recap to complete this example.

 

Updated

Was this article helpful?

0 out of 0 found this helpful