Developer Workflow

Overview

DataForge developers use a repeatable workflow that's easy to follow and quick to develop new configurations.  This article explains the developer workflow and how everything ties together in the platform.

As DataForge has many benefits, developers may not always need to go through every step of the developer workflow.  For example, some customers only use DataForge to ingest tables one time and other customers set recurring ingestions and use the main code framework to develop transformation logic and publish out to external systems.

 

Connect a Source and Ingest Data

Create a connection

The first step to developing in DataForge is creating a connection. Connections are set up for either sources or outputs. Source connections allow users to connect to source systems to ingest data. Output connections allow developers to push data out to target systems. Once the developer creates a connection, it can be re-used to set up as many sources or outputs as needed.  

Create sources from connection data (tables, files, event, api, or custom)

After a connection is established, developers create sources that utilize that connection.  Sources can be created directly from connection metadata (for table connections) or by hand where the user chooses the connection to be used as they create a new source.  At a minimum, the user defines a source name and description, connection type and connection, source query or file mask, and refresh type which decides how to refresh data in the hub table as new data is ingested.  Optionally, the user modifies additional parameters and options related to how data is handled in each process - Ingestion, Parse, CDC, Enrichment, Refresh.

Ingest first set of data to work with

With the source settings established, the developer ingests the first set of data into the source from the source system. For some developers, this is the end result they are looking for in the developer workflow as they can now utilize the data in the source hub table in Databricks.  Other times, this is only the beginning.  Now that data exists within the source, developers can define any transformation logic by reviewing the source schema and creating relations and rules.  Optionally, developers can set up outputs to push this newly ingested data from the DataForge source out to another system.

 

Relate and Enrich Sources

Create relations between sources

Developers create relations between sources, which act similarly to JOINs, and allow the developer to build further transformation logic on a source.  Developers use relations in rule expressions to reference or aggregate data across sources.

Create rules on source

Developers create rules in the source to either validate existing data or to enrich the data by adding additional columns to the source hub table.  Developers use Spark SQL to write rule expressions that define the logic in the source.  These rule expressions can be as simple or complex as the developer needs.  Rule expressions can utilize relations and window functions as needed.  Rules are set up as either snapshot or keep current to allow the developer to decide whether the rule should always recalculate.

Recalculate the source

After transformation logic is built into the source using relations and rules, the developer uses the Recalculate options on the source to recalculate any data that has already been brought into the platform before the transformation logic was set up.  Developers have options to recalculate all rules or only those that may have changed since the last time rule results were calculated.

Validate hub table results and reiterate on rule creation as needed

With the source hub table recalculated to include all rule logic, developers can now validate the source hub table data is what it needs to be.  Developers utilize standard configuration nomenclature when querying the data as every source has both a hub table and a view that are visible in Databricks.  

 

Publish Data to Output Systems

Create an output and map the source + attributes

Developers create an output that allows them to publish the source hub tables out to external systems, such as SQL server, Snowflake, and Databricks Delta lake. Similar to setting up a source, the developer starts by defining the output settings with a minimum of output name and description, output type and connection, and table schema/table name, file type/file name, or event topic. After finishing the output settings, developers map the source needed to the output.  Additional parameters can be used to further define the dataset that gets pushed through the output like attribute filters and validation. Developers add any columns that are needed in the external system and map source attributes to these output columns, optionally using automap features to ease this process.

Reset output on source or output channel

At this point, developers have both a source and output and data that lives in the source that hasn't been published yet.  From here, developers use the reset output options on the source process options to push the data out to the external system. 

Validate output results

Once the data is received in the external system, developers validate that the data from the source hub table has made it correctly to the output destination.

 

Recap

The DataForge developer workflow is easy to follow and can be modified to meet the developer's needs.  In this article, we covered each step of the workflow starting from connecting to a source and ingesting data all the way to publishing the data out to external systems. Below is a visual representation of the end to end workflow that was covered.

 

Updated

Was this article helpful?

0 out of 0 found this helpful