This section covers Relation and Rule configuration.
Logical Data Flow
Relations define intra-source connections and enable users to configure lookups and cross-source rule logic and output mappings. Relations can be thought of as similar to Joins in SQL. When creating a relation, there are two sources involved and a join condition using raw attributes or rules from each source.
Rules consist of Enrichments and Validations and append columns of data to the existing dataset. Validations are always a boolean data type with a true/false result to apply data quality checks. Enrichments execute business logic and can be of any data type supported by DataForge. User-specified configuration rules drive transformations and processing in several automated steps. The Rules represent the majority of the logic and structure of data processing within DataForge, and provide a flexible, yet guided framework for data management.
Note: The supported syntax in rule expressions is specific to Spark SQL. Refer to Spark SQL documentation for more information on specific functions: https://spark.apache.org/docs/latest/api/sql/index.html
In the Data Integration Example, a Relation will be created between the two sources previously set up, Customers and Orders. This relation will be used to write rules used for data transformation and output filtering based on validations.
Step 1: Configure Relation between Sources
Start by opening the previously configured Customer Source (Databricks Unity Samples - tpch.customer) from the Sources page. Click on the Raw Schema tab within the source to see all the raw attributes that have been ingested on the first data pull.
In the previous section, Getting Familiar with the Data, it was observed that both the Customer and the Order datasets have a customer key that can be used to tie the two datasets together. In the customer table, the field was c_custkey, and in the order table, the field was o_custkey. We will use these two fields to build a relation between Sources in DataForge.
Open the Relations tab and click the New + button to create a new relation.
Creating a new relation requires populating a few fields in the pop-up box that appears. As with everything in DataForge, the fields that are marked with an asterisk (*) are mandatory before saving is allowed.
First, give the relation a name. To keep it simple, stick with a format of "Source X - Source Y - Join attribute". In this example, we'll call it "Customer - Orders - custkey". Optionally, enter a description for the relation.
Next, select the [Related] Source from the drop-down available. During configuration, [This] Source will always represent the source that the user is currently selected on. In this case, the relation is being created from the Relations tab of the Customer source, so [This] Source is already populated as the customer source. Select "Databricks Unity Samples - tpch.orders" as the [Related] Source.
Note that the cardinality fields of the relation are empty. These will be populated based on the fields (raw attributes or rules) and the logic that is used in the relation expression next. Cardinality indicates whether the data is unique on each side of the relation. If unique, cardinality is 1. Otherwise, cardinality is Many. Cardinality and which source name shows in [This] vs. [Related] depend on which of the two sources you currently have the relation pulled up in the interface.
Lastly, write the relation expression in the box below the [Related] Source field. As you type, DataForge populates options to use for quick entering rather than manually typing everything out. The relation expression used will be:
[This].c_custkey = [Related].o_custkey
Whether writing relation or rule expressions, [This] tells DataForge to start by looking at this source the expression is being written on. [Related] is used in relation expressions and tells DataForge to refer to the source selected in the Related source drop-down.
After the relation source selection and relation expression logic is validated, the option to Save is available. Save the relation.
After saving the new relation, the cardinality is populated, which is 1 to Many. Since c_custkey is a key column in the customer source, this side of the relation is unique in that every row is unique. On the orders source, o_custkey is not a key column and we have not made it unique so it represents a Many cardinality of the relation.
Step 2: Create New Validation
A Validation Rule consists of validation logic, in the form of a Boolean statement. After checking each statement individually, DataForge tags each record with a pass, failure, or warning flag.
This section makes a simple validation rule that ensures each record that ensures the market segment of the customer is Automobile.
This example portrays a business requirement to filter the location of the customer based on whether it exists in the customers who have a Market Segment of Automobile. The BI reporting tool requires a flag that indicates this property in a simple way (as opposed to the c_mktsegment column, which contains unpredictable values). Creating this Validation Rule simplifies downstream reporting needs and allows greater flexibility when filtering data later on.
Navigate to the Rules tab of the customer source and use the New + button to begin creating a new rule.
In the screen that opens, select Type Validation.
Configure the following parameters. While there are no naming conventions for Validation rules, use the following values.
- Name: Market Segment is Automobile
- Description: Flag when Market Segment is Automobile
- When expression is false, set to: Warn
Notice that the Attribute Name is automatically normalized from the Rule Name, converting to lowercasing with underscores. DataForge normalizes these rule names to avoid data complications downstream. Users can edit the Attribute Names as needed for their rules.
For Expression, input the following. Note that [This] is the syntax to tell DataForge to begin by looking within the currently selected data source. After typing [This] and entering the first period (.), DataForge will pull up a list of available attributes, rules, and system attributes that can be used for quicker expression population.
[This].c_mktsegment = 'AUTOMOBILE'
Click Save to finish. The new Validation rule should look like the image below when opened up again by clicking the rule (anywhere but the expression).
Helpful Tips:
When entering an expression, type ` (backtick) to reveal all of the spark functions available.
When field names are prepended with a [This] such as [This].c_custkey, it indicates that the field exists in the current source.
DataForge supports Rule templates, whereby commonly used Enrichment and Validation rules can be saved as preset templates to apply to various sources. This is a more advanced feature discussed in the User Manual's Rule Template section.
Step 3: Create New Enrichment
An Enrichment Rule creates a new column, based on either a formula or a lookup match and returns the value. Enrichment expressions/logic can be written to reference other raw attributes and rules within the same source, or they can use Relations to build a chain to reference raw attributes and rules from other sources.
Next, we'll create a new enrichment rule to traverse the relation we created previously over to the Orders Source, sum the total price of orders for each customer, and write the result back to our Customer Source.
Use the New + button on the same Rules tab to create a new Enrichment rule.
Configure the rule parameters to match the following:
- Name: Total Orders Price
- Description: Sum total price of orders per customer from Orders Source
- Recalculation Mode: Snapshot
Recalculation Mode indicates if this field needs to be refreshed. For the purposes of this example, we keep the choice as Snapshot, which indicates the data source will only calculate these rules as data is ingested, and not retroactively. For more information on Rule parameters such as Snapshot vs. Keep Current, refer to the User Manual's Rule documentation.
For Expression, enter the following.
SUM([Databricks Unity Samples - tpch.orders].o_totalprice)
Note that we are using the relation in our expression by using [Databricks Unity Samples - tpch.orders] rather than [This]. After an open bracket [ is typed in the expression editor, a list of sources will appear that indicate all of the sources that can be referenced based on relations set up.
When referencing attributes or rules from other sources, a parameter will display in the Expression Parameters for each reference. Relation paths can be modified using the relation path drop-down if is more than one way to reach the related source.
For more information, visit the User Manual's Relation documentation.
Click Save to finish. The Enrichment Rule should look like the image below when complete.
Step 4: Execute Validation and Enrichment Rules
Recall that the rule recalculation mode used for the rules created were both Snapshot. Since data had already been ingested into the Customer Source prior to creating these rules, that data lacks the logic that was just configured in this section.
To remedy this, the rules need to be recalculated on the data previously ingested and processed.
Navigate back to the Inputs tab in the Customer Source.
On the Inputs page, users will see multiple vertical triple-dot menu options (vertical ellipses). Using the triple-dot menu on the header row will allow users to reprocess ALL inputs. Whereas using the triple-dot menu on a specific input row will allow users to reset processing for a specific input of data.
To keep it simple, click the triple-dot menu on the header row and select the option to Recalculate All. This process type reruns all rule logic for all inputs, regardless of whether they were Snapshot or Keep Current rule types.
After selecting Recalculate All, DataForge will reprocess rule logic for all inputs (in this case only one).
Once DataForge has finished this process, a green checkmark will appear for the input again.
Step 5: Ensure Validation and Enrichment Correctness
Click the Data View tab (right most tab available). This opens a new browser tab and opens directly on the hub table that is stored in the background in Databricks, similar to the view seen in the Data Explorer during the Getting Familiar with the Data section.
If the Databricks sql warehouses are not actively running, users will be brought to a page like this below on the right hub table but no metadata showing.
If this happens, click the option to Start the Starter Warehouse or another SQL Warehouse/Cluster. Again, this may take a minute or two to start running in Databricks. Once one of these compute resources is running, the page will reload the right table metadata like below.
Similar to the Getting Familiar with the Data section, click the Create option and choose Query to launch the Query Editor with a starter query for this hub table.
Every Source that is created and managed within DataForge will have a separate hub table living in Databricks that users can query to see and validate the data. These hub tables can be found in the Catalog Explorer under the hive_metastore catalog and the "dataforge" schema.
The hub table nomenclature is always hub_<source_id>. The source id is listed in DataForge and is unique to each source. The source ID will be listed in the URL if users are actively looking at a source.
All Source IDs are listed on the sources page.
Coming back to the new Query that was created from within Databricks, attach the Default Warehouse or one of the running compute resources in the "Connect" drop-down if there is not one already running.
Run the SQL query to view and validate the data.
After validating the new enrichment and validation rule are correct, continue on to the Creating Outputs section of this Data Integration Example.
Updated