Developer Workflow

Overview

DataForge developers use a repeatable workflow that's easy to follow and quick to develop new configurations. This section describes that workflow and how everything ties together in the platform before you take a deep-dive into how to set up each configuration in the rest of this training.

The main difference between users stepping through this workflow is whether you prefer to use Talos AI to assist you or use the traditional developer experience. In the following workflow, you will see each step in the workflow with the traditional developer experience accompanied by a similar Talos prompt to do the same task. 

As DataForge has many benefits, you may not always need to go through every step of the developer workflow. For example, sometimes you may only use DataForge to ingest tables one time and other times set recurring ingestions and use the main code framework to develop transformation logic and publish out to external systems.

Connect a Source and Ingest Data

Create a connection

The first step to developing in DataForge is creating a connection. Connections are set up for either sources or outputs. Source connections allow users to connect to source systems to ingest data. Output connections allow developers to push data out to target systems. Once the developer creates a connection, it can be re-used to set up as many sources or outputs as needed.  

Talos AI: Talos does not currently support creating connections as connection parameters are typically sensitive and secret.

Update data dictionaries in table Connections

Once connections are created, export, update, and import the data dictionaries for tables and columns in connection settings to assist in bulk source and relation creation further in the workflow. This also makes Talos smarter in helping you discover data.

Talos AI: Talos currently does not support updating data dictionaries directly. However, since Talos uses generative AI, you can prompt Talos to help with descriptions for tables and columns with a prompt like the following and attaching a file:

"Can you generate descriptions for what you think each of the columns does in the uploaded data dictionary?"

Create sources from connection data (tables, files, event, api, or custom)

After a connection is established, developers create sources that utilize that connection.  Sources can be created directly from connection metadata (for table connections) or by hand where the user chooses the connection to be used as they create a new source.  At a minimum, the user defines a source name and description, connection type and connection, source query or file mask, and refresh type which decides how to refresh data in the hub table as new data is ingested.  Optionally, the user modifies additional parameters and options related to how data is handled in each process - Ingestion, Parse, CDC, Enrichment, Refresh.

Talos AI: Talos has a "Find Data" tool to help you find data you're looking for. After you've found the data you're looking for, you can use the "Create Sources" tool with Talos to set up all the sources you need.

"Find data for sales order detail revenue by customer name, territory name, and product from Adventureworks"

"Create sources from the matched tables using incremental refresh"

Ingest first set of data to work with

With the source settings established, the developer ingests the first set of data into the source from the source system. For some developers, this is the end result they are looking for in the developer workflow as they can now utilize the data in the source hub table in Databricks.  Other times, this is only the beginning.  Now that data exists within the source, developers can define any transformation logic by reviewing the source schema and creating relations and rules.  Optionally, developers can set up outputs to push this newly ingested data from the DataForge source out to another system.

Talos AI: Talos can help you initiate new data pulls and start specific processes as requested. You will be prompted to initially pull data if you've used Talos to set up sources, but you can request it at any time. Use the "Find source" tool first if you're unsure which sources you want to start new processes for.

"Pull data on the Product and SalesOrderHeader sources"

"Start data profiling for the Product source"

Relate and Enrich Sources

Create relations between sources

You'll create relations between sources, which act similarly to JOINs, and allow you to build further transformation logic on a source. You can use relations in rule expressions to reference or aggregate data across sources. If table keys are identified in the connection data dictionary, DataForge will automatically create relations for you when you bulk create sources in connection metadata.

Talos AI: Talos can recommend relations between your sources using the "Recommend relations" tool. You will be prompted to run Data Profiling on your sources if it's not already completed so Talos can identify best matches between columns.

"Recommend relations between all the sources"

Create rules on source

You create rules in the source to either validate existing data or to enrich the data by adding additional columns to the source hub table. You use Spark SQL to write rule expressions that define the logic in the source. These rule expressions can be as simple or complex as you needs. Rule expressions can utilize relations and window functions as needed. Rules are set up as either snapshot or keep current to allow you to decide whether the rule should always recalculate.

Talos AI: Talos can assist you in creating rules, as simple or complex as you need using any data types. If rules should use columns from other sources, specify which source you want the data to come from in the prompt.

"In Sales Order Detail, calculate revenue from orderqty times unitprice"

"Create a rule named store_specialty on the store source to extract the xml field of Specialty from the demographics column"

Recalculate the source

After transformation logic is built into the source using relations and rules, the developer uses the Recalculate options on the source to recalculate any data that has already been brought into the platform before the transformation logic was set up.  Developers have options to recalculate all rules or only those that may have changed since the last time rule results were calculated.

Talos AI: Talos can recalculate your rules when asked.

"Recalculate the sales order header source"

Validate hub table results and reiterate on rule creation as needed

With the source hub table recalculated to include all rule logic, you can now validate the source hub table data is what it needs to be. You utilize standard configuration nomenclature when querying the data as every source has both a hub table and a view that are visible in Databricks. 

Talos AI: Talos does not currently support querying source data directly.

Publish Data to Output Systems

Create an output and map the source + attributes

You create an output that allows you to publish the source hub tables out to external systems, such as file storage, SQL server, Snowflake, and Databricks Delta lake. Similar to setting up a source, you start by defining the output settings with a minimum of output name and description, output type and connection, and table schema/table name, file type/file name, or event topic. After finishing the output settings, you map the source needed to the output.  Additional parameters can be used to further define the dataset that gets pushed through the output like attribute filters and validation. You add any columns that are needed in the external system and map source attributes to these output columns, optionally using automap features to ease this process.

Talos AI: Talos can help you turn sources and attributes into an output mapping to publish data outside of DataForge. Use the "Find source" or "Find attribute" workflows to identify what information you want included in the output creation. Talos will help you identify the right connection and create your output.

"Create a new output named "Sales Orders" for line total, customer, territory, product, and tax amount from sales order sources"

Reset output on source or output channel

At this point, you have both a source and output and data that lives in the source that hasn't been published yet.  From here, you use the reset output options on the source process options to push the data out to the external system. 

Talos AI: Talos does not currently support resetting output processes. However, you can prompt Talos to recalculate certain sources mapped to your outputs to also rerun output processes.

"Recalculate sales order detail"

Validate output results

Once the data is received in the external system, developers validate that the data from the source hub table has made it correctly to the output destination.

Recap

The DataForge developer workflow is easy to follow and can be modified to meet the your needs.  This section covered each step of the workflow starting from connecting to a source and ingesting data all the way to publishing the data out to external systems. Below is a visual representation of this end-to-end workflow.

Developer Workflow - vertical version.jpg

 

Updated

Was this article helpful?

0 out of 0 found this helpful