Rules allow DataForge to modify and transform data.
Rules Tab
Enrichments vs Validations vs Sub Sources
Rule Parameters
Parameter
|
Default Value
|
Description |
Type | Enrichment | The type of the Rule. Validations mark records as pass/fail based on a boolean expression in the expression field. Sub-source create new sources tied to the rule for array of struct types. |
Rule Name*
|
No Default
|
The user-defined name of the Rule. |
Attribute Name*
|
No Default
|
The name of the new column of the Rule. This is what will appear in attribute dropdowns of expressions and within expressions. When creating a new rule, this field will automatically be filled in with valid formatting based on what the user enters into the Rule Name field. |
Description
|
No Default | The user-defined description of the Rule |
Expression Data Type |
Interpreted from expression
|
The data type of the result of the Expression, automatically inferred by DataForge. This field is not controlled by the user. |
Metadata Icon (struct type only) |
Interpreted from expression result
|
Struct data rules display a metadata icon next to the expression data type. Opening the metadata icon will show the struct schema fields. |
Attribute Data Type
|
Expression Data Type | The data type of the Enriched Attribute. DataForge will attempt to convert the data type of the Expression Data Type to the Attribute Data Type. Leave as Default for no conversion. |
Recalculation Mode
|
Snapshot | Snapshot means that the field will never be recalculated for data inputs, even after attributes the rule depends on get updated, unless a user resets enrichment or uses Recalculate options for prior inputs. Keep current means that if an attribute value a record depends on gets updated, the dependent attribute will also be updated during processing. |
When expression is false, set to | Warn | These are the flags that will be set on records that fail to be converted to another data type. Warn, Fail, or Ignore are the possible options. For Validations only. |
Expression | No Default | Use DataForge QL/Spark SQL syntax to set the Rule transformation logic. |
Active | TRUE | Allows the user to set this Rule as Active or not. If active, it will be applied on new inputs and source resets. |
Rule Expressions and Syntax
Complex Data Types
Struct and array data types can be used in rules and templates. Structs can be navigated within a rule using dot notation, such as struct.key. Arrays can be navigated by using the array index (zero based), such as array[0]. DataForge recommends using sub-source rules where columns contain an array of structs.
The expression editor intelli-sense will help navigate nested complex types by listing child keys for each nested level. For working with complex nested arrays, DataForge recommends using Databricks higher-order functions.
Using Relations in Rule Expressions
Relations can be traversed in a rule expression to write rules that reference attributes and rules from other sources. When a relation is used in the rule expression, Expression Parameters will populate and provide users visibility into the relation being used with options to change the relation chain direction.
As an example, a rule needs to be written from Source "Customer Data" to pull an attribute from Source "Sales Data" to calculate total sales per customer by traversing a relation between the two sources.
Before the rule can be written, the relation must exist between the two sources. In this case, the relation is named "FK_SalesOrderHeader_Customer_CustomerID" and the cardinality is 1:M from Customer to Sales.
To start, the rule is written to reference the "totaldue" attribute from the Sales Data Source.
Example Expressions
Multiplying an attribute from a related source to an attribute of the current source through a primary relation
[This].ListPrice * [Related Source Name].Quantity
Aggregating a column from a primary related source with many cardinality
sum([Related Source Name].SalesTotal)
Aggregating a column from a non-primary related source with many cardinality
sum([Related Source Name].SalesTotal)
Rounding a column down to two decimal places
round([This].AverageCost, 2)
Window function aggregating one attribute of the current source while partitioning on another
sum([This].TotalCost) OVER (PARTITION BY [This].CustomerID)
Validation rule to determine if a numeric column of the source is above a specified threshold
[This].TotalCost > 100
Returning key from struct field using dot notation
[This].struct.key
Returning first index field from an array
[This].array[0]
When will my rule get processed?
Enrichment
-
All traversal rules, regardless of keep current
-
Rules that do not contain window functions and do not depend on a window function rule, regardless of keep current
-
All window function rules (window function rules must be keep current)
-
All rules that depend on window functions (rules that depend on window functions must be keep current)
-
All keep current traversal rules WHEN the related source gets new data
-
All rules that depend on the above keep current traversal rules (rules dependent on keep current rules must be keep current)
Data Profiles
Common
- Attribute Type
- Data Type
- Number of Rows
- Min
- Max
- Unique %
- Null %
- Top 5 Values
- Bottom 5 Values
- Distribution Percentiles (10%, 25%, 50%, 75%, 90%)
Text
-
Min Length
-
Max Length
-
Avg Length
-
Numeric %
-
Blank %
-
Special Char %
Numeric
-
Average
-
Median
-
Standard Deviation
-
Zero %
Timestamp
-
Average
-
Median
-
Standard Deviation
Sub-Source Rules
Rules can be added to a sub-source similar to a regular source with the following limitations:
- No validation rules allowed within the sub-source
- No unique flag available for sub-source rule(s)
For full documentation, visit Sub-Sources.
Updated