Data Processing Engine Overview
The Logical Data Flow and Data Processing Steps
DataForge processes data from left to right through the logical data flow. It is helpful to think of DataForge as modifying a SQL statement. Ultimately each configuration within the UI modifies different sections of a SQL query executed on the ingested data.
DataForge does not create data, it only modifies, transforms, and processes data. Across the processing steps data is ingested, transformed, synthesized and ultimately outputted, with the resources to conduct each step gradually increasing.
The Data Processing Steps section below is a deep dive into each of the data processing steps. An understanding of the Logical Data Flow and the UI is necessary to implement DataForge.
Data Processing Steps Overview
The logical data flow with data locations underneath.
Data processing in DataForge follows a set of standard steps. During each step it can be helpful to think of a column being appended to a table. During each step the data is modified with additional columns which represent id's, keys, updates, and metadata. Especially as business logic is applied in Enrichment through Refresh.
Ingestion
This is the first step to get data into DataForge. File sources are brought over as-is, and non-file types (table, API, etc) are extracted as parquet files. During this phase data is strictly copied into the appropriate cloud storage location. The data is not transformed in any way.
Parse
Ingested files are converted to a common file format. This is done to allow consistency for downstream steps, so those steps do not need to be aware of the original ingestion format. In the parse step, the file is opened, observed and transformed into the standard format. The copy of the data is stored in the parse section of the datalake source folders.
Change Data Capture (CDC)
With all of the data in same format, Change Data Capture is the first step that enacts business logic on the data. This step tags data changes, and applies specific column and meta data based on the configured source Refresh Type.
Refresh Type | Impact |
Full | Deletes data previously in DataForge and uploads the data as if starting from scratch. Full reload each ingestion. |
Key | Refresh compares the key column(s) value and optional date/timestamp to update the data to be the most up to date. |
Sequence | Refresh looks at the sequence overlap based on a sequence column(s) identified on the source. |
Timestamp | Refresh looks at the user-defined date column(s) for a time range data overlap to update the data in the Hub table. Since time series data is usually large, ranges are often more efficient to determine up to date data. |
None | All data is assumed to be new data, and no refresh check occurs. Data is appended to existing hub table. |
Custom | Provides the most flexibility. Refresh runs a user-defined custom delete query on the Hub table before merging the ingested data. An optional parameter exists for specific partition columns. |
Enrichment
Enrichment executes business rules, calculations, and transformations against the data. Enrichment is the primary step for these transformations and executes two types of rules: Enrichments and Validations. The scope for the calculations in this step are at the row level. Custom columns are created per rule. No windowing or aggregation or keep current logic occurs at this step. There is no change in the grain of the data at this step.
When thinking of the data processing steps in terms of the components and subcomponents of a SQL statement, Enrichment accounts for actions such as a single SELECT column (Enrichment rule), or a single WHERE clause (Validation rule).
Refresh
Refresh represents the merge against the source hub table and the creation of the "one source of truth" dataset. Only one hub table exists per source. Depending on the refresh type, information pertaining to history and tracking changes may also be captured.
Attribute Recalculation
Attribute Recalculation modifies the source hub table, and applies business logic that requires the entire table: cross row calculations, windowing, ranking. In the Source's rules, there are settings that define each rule as Snapshot or Keep Current. Where Snapshot rules are calculated during Enrichment, Keep Current rules are calculated doing Attribute Recalculation. When "Keep Current" is selected for a rule, every time new data is ingested and processed through the logical data processing flow, Attribute Recalculation will recalculate the keep current rule expression for all data in the Hub table. Recalculations modify the Hub table in place.
As you move along the data processing steps, the resources to manipulate the data become more intense. As much as possible, utilize the Enrichment stage with snapshot rules earlier in the data processing flow than keep current rules after Refresh.
Output
Output is the mapping of a source to a destination. DataForge by default persists logic all the way to the outputted destination. Output maps source hub table columns to an output schema, and then sends the output data to the appropriate destination. Data processing up until this point does not adjust the grain of the data, so at this point aggregations and relational database logic can occur. Outputs allow you to define and filter rows of data.
Data Profile
DataForge includes an optional Data Profile process to provide insight into Source data. By default, Data Profile is turned on and is controlled in the Source settings. Detailed column-level data profile information can be found on the raw schema or the rules created on a Source.
Updated