This section describes the process of setting up Sources and configuring source settings for data pulls and processing.
Logical Data Flow
Sources are used to manage the entire logical data flow. Through settings and parameters or through the different sub-tabs (Raw Schema, Relations, Rules) within the Source UI, the process flow from ingestion through output is managed. Sources are configured to Outputs in a later step of this integration example, but all processing can be managed from within the Source.
Step 1: Create Sources
There are two ways to create new sources: from scratch in the Sources page or from existing Connection Metadata on the Connections page. For this integration example, we will use the latter option and create new sources directly from the Databricks Unity Samples connection that was created in the previous section. Creating sources from Connection Metadata is quicker as DataForge prepopulates all the source settings and users can create multiple sources at once.
Navigate to the Connections page from the main menu, and open the Databricks Unity Samples connection. Now that the Connection is tested successfully, data will be populated in the Connection Metadata tab. Open the Connection Metadata to see a list of schemas and tables/views available in the source connection.
Click on the checkbox to the right of the rows listing "customer" and "orders". Optionally, you can check the box for all the TPCH tables, but we will only use customer and orders for this example. Notice that there are no referenced tables or primary keys because this information does not exist in the Databricks catalog.
Select the triple-dot menu in the top right above the table and select the Create Source(s) option to have DataForge create new sources for each of the table rows that are checked.
A new dialogue box will appear, allowing you to choose a naming pattern for the newly sources. Additional options allow you to create relations automatically between the sources (if primary keys exist in metadata) and whether you want to initiate a new data pull for each source.
Toggle the Initiate Data Pull option on so that DataForge starts pulling the data for you to work with right away. After making these selections, click OK to finish this process and allow DataForge to create the new sources.
If the sources are not opened in new tabs, navigate to the Sources page from the Main Menu.
The two sources now exist begin ingesting and processing data. Statuses for each source change from Launching to In Progress to Successful. These Statuses indicate that the processes are running in a Databricks cluster behind the scenes. You will know the processing is complete when a green check mark shows in the Status column.
Click on the source named "Databricks Unity Samples - tpch.customer".
Step 2: Update Source Settings for Customer
With the "Databricks Unity Samples - tpch.customer" source open, navigate to the Settings tab (first/left of the listed tabs). Here in the Settings is where users dictate how DataForge processes work, starting from Ingestion through to Data Profile.
Similarly to Connections, any field that is marked with an asterisk (*) is mandatory.
By creating the Source from Connection Metadata, most of the fields needed are pre-populated. Upon creation, the Customer source should look like the below image.
Quick Overview of Source Settings:
The main configuration options that define how the system should ingest and process data are:
- Process Config: Defines which cluster configuration from Databricks to use for each type of process. Larger data sets or more complex logic may require different cluster and process configurations. Similarly, clusters can be set up with specific libraries attached to them, commonly done for Generic JDBC connections. For more information, visit Process Configuration.
- Cleanup Config: Defines which Cleanup configuration to use every time the system-led cleanup process runs. For more information, visit Cleanup Configuration.
- Connection Type: Identifies which type of data source this Source will ingest, with the below drop-down options. The Connection Type filters the list of Connections available to use for the source.
- Connection: Identifies which connection (along with settings and credentials) to use to ingest data. The list of connections are dependent on the Connection Type selected and the connections created.
- Source Query: Query to run against the connection database, written with syntax following the source system syntax.
- Data Refresh: Specifies how DataForge should handle processing, refreshing, and storing the data. The six types are Full, Key, None, Sequence, Timestamp, and Custom.
- Initiation Type: Only varies for File Connection Type sources with choices of Scheduled or Watcher.
- Schedule: Specifies which schedule frequency DataForge should use to identify when to initiate new inputs (i.e. ingestion and data processing).
For more information on all of these Settings and further Parameters, please visit Source Settings.
Configuration Changes to Make:
For this integration example, a few changes are needed for the Source Settings for both sources before ingesting data.
Change the Data Refresh type to Key from the drop-down options and set the Key Column to "c_custkey". These sample data sets are a good use of the Key Refresh Type as there's a logical identifier for every row that doesn't change over time while the other data is updated.
After these two changes, Save the source to apply the changes to DataForge. After hitting Save, a pop-up box will appear notifying that since the Refresh Type has been changed it's required to reset CDC on the source. Changing the refresh type means that DataForge should process all data differently than it had been, so resetting CDC for all data loaded is required.
Select the SAVE CHANGES & RESET CDC button to re-process the data already ingested.
After saving the changes to the source settings, users are directed to the Inputs tab. Close the left-hand menu at any point to maximize screen space by clicking the left arrow at the top of the menu.
Step 3: Run a New Data Pull into the Customer Source
On the Inputs tab of the Source, an input (batch of data) already exists from the source creation settings selected that chose to initiate a data pull immediately after creating the source. Select the Pull Now option to initiate a new input to ingest data.
After using the Pull Now option, a new input will be created in the table below representing that data pull. Users will see the input move from Ingestion all the way through Attribute Recalculation. This process may take a few minutes to complete, as DataForge starts the new ingestion by requesting a cluster from Databricks to use for data processing. If there is already a cluster running, the ingestion will run a few minutes faster as it does not have to wait for a cluster to start. To see the processes running individually, click the status icon of the input to navigate directly to the Process tab and click a process record to expand it to see the processes listed. At the end of the processing, users will see a Green checkmark in the Status column for the input created.
Step 4: Update Source Settings for Orders
Navigate back to the Sources page through the main menu or select the source name drop-down at the top to open the list of sources.
Now open the source named "Databricks Unity Samples - tpch.orders". Navigate to the Settings tab within the source. Similar to the Customer source, there are a few changes needed for the Orders source before ingesting data.
Configuration Changes to Make:
Change the Data Refresh type to Key from the drop-down options and set the Key Column to "o_orderkey".
After these two changes, Save the source to apply the changes to DataForge. After hitting Save, the popup appears that you need to also reset CDC to save these changes.
Select the "Save Changes & Reset CDC" button to re-process the data already ingested. After saving the changes to the source settings, users are directed to the Inputs tab. Once the Reset CDC process is complete, the input and source status will change to the green checkmark again.
Since data already exists in this source to work with, there is no need to use the Pull Data option again yet.
Continue on to the Creating Relations and Rules section of this Data Integration Example.
Updated