Creating Sources

This section describes the process of setting up Sources and configuring source settings for data pulls and processing.

Logical Data Flow

Sources are used to manage the entire logical data flow. Through settings and parameters or through the different sub-tabs (Raw Schema, Relations, Rules) within the Source UI, the process flow from ingestion through output is managed. Sources are configured to Outputs in a later step of this integration example, but all processing can be managed from within the Source.

Step 1: Create Sources

There are three ways to create new sources: from scratch in the Sources page, from existing Connection Metadata on the Connections page, or from prompting Talos AI. For this integration example, we will use the Connection Metadata option and create new sources directly from the Sample Datasets connection that was created in the previous section. Creating sources from Connection Metadata is quicker than from scratch as DataForge prepopulates all the source settings and you can create multiple sources at once.

Navigate to the Connections page from the main menu, and open the Sample Datasets connection. Now that the Connection is tested successfully, data will be populated in the Connection Metadata tab. Open the Connection Metadata to see a list of schemas and tables/views available in the source connection.

Search for "TPCH" (or "TPCH_SF1" if using Snowflake) in the search bar. Click on the checkbox to the right of the rows listing "customer" and "orders". Optionally, you can check the box for all the TPCH tables, but we will only use customer and orders for this example. Notice that there are no referenced tables or primary keys because this information does not exist in the database.

Select the triple-dot menu in the top right above the table and select the Create Source(s) option to have DataForge create new sources for each of the table rows that are checked.

A new dialogue box will appear, allowing you to choose a naming pattern for the newly sources. The naming pattern will always default to "<Connection Name> - ${schema}.${table}".

Additional options allow you to create relations automatically between the sources (if primary keys exist in metadata) and whether you want to initiate a new data pull for each source.

Change the naming pattern to only "${table}" since we do not care about the connection name or schema for this example.

Toggle the Initiate Data Pull option on so that DataForge starts pulling the data for you to work with right away.

After making these selections, click OK to finish this process and allow DataForge to create the new sources.

Navigate to the Sources page from the Main Menu.

The two sources now exist and will begin ingesting and processing data. Statuses for each source change from Launching to In Progress to Successful. These Statuses indicate that the processes are running in a Databricks cluster behind the scenes. You will know the processing is complete when a green check mark shows in the Status column.

Click on source name or row for the Customer source.

Step 2: Update Source Settings for Customer

With the customer source open, navigate to the Settings tab (first/left of the listed tabs). Here in the Settings is where users dictate how DataForge processes work and data is processed, starting from Ingestion through to Data Profile.

Similarly to Connections, any field that is marked with an asterisk (*) is mandatory.

By creating the Source from Connection Metadata, most of the fields needed are pre-populated. Upon creation, the Customer source should look similar to the below image.

Quick Overview of Source Settings:

The main configuration options that define how the system should ingest and process data are:

Process Config: Defines which compute configuration to use for each type of process. Larger data sets or more complex logic may require different compute and process configurations. Similarly, compute configurations can be set up with specific libraries attached to them, commonly done for Generic JDBC connections. For more information, visit Process Configuration.
Cleanup Config: Defines which Cleanup configuration to use every time the system-led cleanup process runs. For more information, visit Cleanup Configuration.
Connection Type: Identifies which type of data source this Source will ingest, with the below drop-down options. The Connection Type filters the list of Connections available to use for the source.
Connection: Identifies which connection (along with settings and credentials) to use to ingest data. The list of connections are dependent on the Connection Type selected and the connections created.
Source Query: Query to run against the connection database, written with syntax following the source system syntax.
Data Refresh: Specifies how DataForge should handle processing, refreshing, and storing the data. The six types are Full, Key, None, Sequence, Timestamp, and Custom.
Initiation Type: Only varies for File Connection Type sources with choices of Scheduled or Watcher.
Schedule: Specifies which schedule frequency DataForge should use to identify when to initiate new inputs (i.e. ingestion and data processing).

For more information on all of these Settings and further Parameters, please visit Source Settings.

Configuration Changes to Make:

For this integration example, a few changes are needed for the Source Settings for both sources before ingesting data.

Change the Data Refresh type to Key from the drop-down options and set the Key Column to "c_custkey". These sample data sets are a good use of the Key Refresh Type as there's a logical identifier for every row that doesn't change over time while the other data is updated.

After these two changes, Save the source to apply the changes to DataForge. After hitting Save, a pop-up box will appear notifying that since the Refresh Type has been changed it's required to reset CDC on the source. Changing the refresh type means that DataForge should process all data differently than it had been, so resetting CDC for all data loaded is required.

Select the SAVE CHANGES & RESET button to re-process the data already ingested.

After saving the changes to the source settings, users are directed to the Inputs tab. Close the left-hand menu at any point to maximize screen space by clicking the left arrow at the top of the menu.

Step 3: Run a New Data Pull into the Customer Source

On the Inputs tab of the Source, an input (batch of data) already exists from the source creation settings selected that chose to initiate a data pull immediately after creating the source. Select the Pull Now option to initiate a new input to ingest data.

After using the Pull Now option, a new input will be created in the table below representing that data pull. Users will see the input move from Ingestion all the way through Attribute Recalculation. This process may take a few minutes to complete, as DataForge starts the new ingestion by requesting a cluster from Databricks to use for data processing. If there is already a cluster running, the ingestion will run a few minutes faster as it does not have to wait for a cluster to start. To see the processes running individually, click the status icon of the input to navigate directly to the Process tab and click a process record to expand it to see the processes listed. At the end of the processing, users will see a Green checkmark in the Status column for the input created.

Step 4: Update Source Settings for Orders

Navigate back to the Sources page through the main menu or select the source name drop-down at the top to open the list of sources.

Now open the Orders source. Navigate to the Settings tab within the source. Similar to the Customer source, there are a few changes needed for the Orders source before ingesting data.

Configuration Changes to Make:

Change the Data Refresh type to Key from the drop-down options and set the Key Column to "o_orderkey".

After these two changes, Save the source to apply the changes to DataForge. After hitting Save, the popup appears that you need to also reset CDC to save these changes.

Select the "Save Changes & Reset CDC" button to re-process the data already ingested. After saving the changes to the source settings, users are directed to the Inputs tab. Once the Reset CDC process is complete, the input and source status will change to the green checkmark again.

Since data already exists in this source to work with, there is no need to use the Pull Data option again yet.

Continue on to the Creating Relations and Rules section of this Data Integration Example.

Updated January 12, 2026 17:31

Was this article helpful?

0 out of 0 found this helpful

Creating Sources

Logical Data Flow

Step 1: Create Sources

Step 2: Update Source Settings for Customer

Quick Overview of Source Settings:

Configuration Changes to Make:

Step 3: Run a New Data Pull into the Customer Source

Step 4: Update Source Settings for Orders

Configuration Changes to Make:

Was this article helpful?

Previous article

Next article