Rules

Rules allow DataForge to modify and transform data.


Rules Tab

The Rules tab allows users to select, edit, remove, or add a Source's Rules. By default, only Active Rules are listed. The Active Only toggle changes this setting.
 
Active rule toggle switch
 
To create a new Rule, click the New Rule button. This opens the create/edit Rule modal.
 
New Rule Button
 
To edit an existing rule, click on any column of that rule's row in the table besides the expression column (circled in the picture below). If a rule references another rule in its expression, a link will appear in the expression column that, when clicked, will open the edit rule modal of the rule that is referenced by the link (boxed in the picture below).
 
All columns besides the expression column in the Rule Table can be clicked to open the edit Rule modal.
 

Enrichments vs Validations vs Sub Sources

In DataForge, there are three distinct types of Rules: Enrichments, Validations, and Sub Sources.
 
An Enrichment An Enrichment is a rule type that defines an individual SQL expression resulting in a new column of any supported data type.
 
A Validation is a specific type of Enrichment focused on data quality rules that only return Boolean data type columns. A validation verifies that a given column value (or group of column values) for each row adheres to a specified rule defined in the validation expression. All validations of a source are aggregated into s_validation_status_code system attribute with Pass/Fail/Warn values for each column.
 
A Sub-Source is a specific type of rule that creates a child source underneath the parent source to help transform complex nested array data (array of struct). When a sub-source rule is created pointing to an array of structs, a new source object is created that is tied to the rule. A sub source acts as it's own source for relations and rules where users can build logic within the array of structs with an implicit relation to the parent source provided. All processing is done at the parent Source level and applied to the sub-source. 
 

Rule Parameters

 
Create Rule Modal
 
Fields Available:
Parameter
Default Value
Description
Type Enrichment The type of the Rule. Validations mark records as pass/fail based on a boolean expression in the expression field. Sub-source create new sources tied to the rule for array of struct types.
Rule Name*
 
No Default
The user-defined name of the Rule.
Attribute Name*
 
No Default
The name of the new column of the Rule. This is what will appear in attribute dropdowns of expressions and within expressions. When creating a new rule, this field will automatically be filled in with valid formatting based on what the user enters into the Rule Name field.
Description
 
No Default The user-defined description of the Rule
Expression Data Type
Interpreted from expression
The data type of the result of the Expression, automatically inferred by DataForge. This field is not controlled by the user.

Metadata Icon

(struct type only)

Interpreted from expression result
Struct data rules display a metadata icon next to the expression data type. Opening the metadata icon will show the struct schema fields.
Attribute Data Type
 
Expression Data Type The data type of the Enriched Attribute. DataForge will attempt to convert the data type of the Expression Data Type to the Attribute Data Type. Leave as Default for no conversion.
Recalculation Mode
 
Snapshot Snapshot means that the field will never be recalculated for data inputs, even after attributes the rule depends on get updated, unless a user resets enrichment or uses Recalculate options for prior inputs. Keep current means that if an attribute value a record depends on gets updated, the dependent attribute will also be updated during processing.
When expression is false, set to Warn These are the flags that will be set on records that fail to be converted to another data type. Warn, Fail, or Ignore are the possible options. For Validations only.
Expression No Default Use DataForge QL/Spark SQL syntax to set the Rule transformation logic.
Active TRUE Allows the user to set this Rule as Active or not. If active, it will be applied on new inputs and source resets.
 
Click Save to save the Rule. Clicking Save and Create Validation will create an extra Validation column to mark whether the values from the Expression Data Type succeeded the conversion to the specified Attribute Data Type.
 
 

Rule Expressions and Syntax

Rule expressions are made up of a mix of Spark SQL, and DataForge QL. Specifically, DataForge QL is used to access source data attributes, which will be used within a Spark SQL expression the way column names would be used in a normal Spark SQL context.
 
For a list of Spark SQL functions that can be used in Rules, please see the latest Spark SQL Built-In Functions: https://spark.apache.org/docs/latest/api/sql/index.html
 
To begin entering a rule expression, either:
1) Enter an open bracket "[" to reveal a drop-down of sources with a relation chain to the current source and [This] Source.  This allows you to select attributes and rules from the current source or related sources.
OR
2) Enter a tick mark " ` " to reveal a drop-down of Spark SQL functions. Then the user can continue filling out the expression however they would like, as long as the attributes are accessed using DataForge QL, and the expression as a whole follows Spark SQL syntax.
 
See Example Expressions for visual examples of different types of rules to get started.
 

Complex Data Types

Struct and array data types can be used in rules and templates. Structs can be navigated within a rule using dot notation, such as struct.key. Arrays can be navigated by using the array index (zero based), such as array[0]. DataForge recommends using sub-source rules where columns contain an array of structs.

The expression editor intelli-sense will help navigate nested complex types by listing child keys for each nested level. For working with complex nested arrays, DataForge recommends using Databricks higher-order functions.

complex data types release.gif

 

Using Relations in Rule Expressions

Relations can be traversed in a rule expression to write rules that reference attributes and rules from other sources. When a relation is used in the rule expression, Expression Parameters will populate and provide users visibility into the relation being used with options to change the relation chain direction.

As an example, a rule needs to be written from Source "Customer Data" to pull an attribute from Source "Sales Data" to calculate total sales per customer by traversing a relation between the two sources. 

Before the rule can be written, the relation must exist between the two sources. In this case, the relation is named "FK_SalesOrderHeader_Customer_CustomerID" and the cardinality is 1:M from Customer to Sales.

 To start, the rule is written to reference the "totaldue" attribute from the Sales Data Source.

Since there is a relation used, the Expression Parameters populates with the [Sales Data].totaldue attribute.  However, DataForge recognizes that this relation is 1:M and the rule expression needs to include an aggregate function. To make the rule valid, an aggregate of SUM() is wrapped around the expression.
The relation path can also be modified when there is more than one relation path available between the two sources. Modifying the relation path provides a search option to quickly find specific sources or relations to use. Once the user selects the destination source, DataForge will pick the best relation path available and display it in the Expression Parameters. Users can change any part of the path using the presented drop-downs.  Where applicable, additional hops in the relation chain can be added with an "Add Next" button to expand the traversal. Relation paths displayed in Expression Parameters have intuitive labels formatted as [From Source Name]->relation name->[To Source Name].
After changing the relation chain using the drop-down, DataForge automatically adds any additional relations needed to complete the relation chain.  If there is no drop-down available, the relation is the only available option to complete the path.
Hovering over relation chains in the Expression Parameters will show a tooltip of the relation expression for easy understanding while writing the rule.
Clicking the parameter name in the Expression Parameters will highlight where the parameter is used in the Rule Expression.  This is especially helpful for longer or more complex rules where multiple parameters are referenced like the below example.
 

Example Expressions

Multiplying an attribute from a related source to an attribute of the current source through a primary relation

[This].ListPrice * [Related Source Name].Quantity

Aggregating a column from a primary related source with many cardinality

sum([Related Source Name].SalesTotal)

Aggregating a column from a non-primary related source with many cardinality

sum([Related Source Name].SalesTotal)

Rounding a column down to two decimal places

round([This].AverageCost, 2)

Window function aggregating one attribute of the current source while partitioning on another

sum([This].TotalCost) OVER (PARTITION BY [This].CustomerID)

Validation rule to determine if a numeric column of the source is above a specified threshold

[This].TotalCost > 100

Returning key from struct field using dot notation

[This].struct.key

Returning first index field from an array

[This].array[0]

 


When will my rule get processed?

Enrichment

  • All traversal rules, regardless of keep current
  • Rules that do not contain window functions and do not depend on a window function rule, regardless of keep current
Refresh
  • All window function rules (window function rules must be keep current)
  • All rules that depend on window functions (rules that depend on window functions must be keep current)
Attribute Recalculation
  • All keep current traversal rules WHEN the related source gets new data
  • All rules that depend on the above keep current traversal rules (rules dependent on keep current rules must be keep current)

Data Profiles

Clicking the Data Profile icon brings up the data profile of that raw attribute. Different datatypes provide different data profile data.
 
Data Profile options
 
A modal appears showing the data profile when the datatype label is clicked. Older data profiles for the source can be accessed by using Select profiling timestamp.
 
The Data Profile
 
Data profiles provide the following statistics:

Common

  • Attribute Type
  • Data Type
  • Number of Rows
  • Min
  • Max
  • Unique %
  • Null %
  • Top 5 Values
  • Bottom 5 Values
  • Distribution Percentiles (10%, 25%, 50%, 75%, 90%)

Text

  • Min Length
  • Max Length
  • Avg Length
  • Numeric %
  • Blank %
  • Special Char %

Numeric

  • Average
  • Median
  • Standard Deviation
  • Zero %

Timestamp

  • Average
  • Median
  • Standard Deviation

Sub-Source Rules

Rules can be added to a sub-source similar to a regular source with the following limitations:

  • No validation rules allowed within the sub-source
  • No unique flag available for sub-source rule(s)

For full documentation, visit Sub-Sources

Updated

Was this article helpful?

0 out of 0 found this helpful