This section describes the process of setting up connections in DataForge.
Logical Data Flow
A Connection is used in part of the source input and output steps of the data flow. We set up separate connections to be used for sources (ingesting data) and outputs (sending data).
Step 1: Review Databricks Unity Samples connection
DataForge communicates to external data services through Connections. Connections consist of configurations for where and what type of data to pull into DataForge.
For ease of use in the Data Integration Example, a source connection is already created in DataForge with the name of "Databricks Unity Samples". This allows data pulls directly from Databricks sample dataset TPCH using already established credentials between DataForge and Databricks.
In a Chrome browser, navigate to the DataForge Sources screen by accessing the URL provided by your DataForge account team and logging in.
Navigate to the Connections screen through the Main Menu (triple horizontal lines top-left), as seen below.
Two connections have already been created: Databricks Hive Metastore and Databricks Unity Samples. The Databricks Hive Metastore connection will not be used in this Data Integration Example, but is created for further use of ingesting Delta tables from Databricks Hive Metastore.
Open the Databricks Unity Samples connection by clicking on the name or row of the connection. Below is an example of what the connection looks like on first opening.
The name of this connection can not be modified as it is deployed by the DataForge team during initial deployment and any subsequent version upgrades. User-created Connections can be fully edited, including name, as well as deleted.
Connections have varying settings and parameters depending on the type of connection and this page dynamically adjusts to provide those settings as selections are made. All fields marked with an asterisk (*) are mandatory.
Key Settings and Parameters:
- Name: Connections must be given a name before they can be saved. All connection names must be unique.
- Description: Description can be anything but recommended to be descriptive about where and how the connection pulls or pushes data from sources or to output destinations.
- Connection Direction: Indicates if the connection will be attached to sources to pull data, or attached to outputs to send data.
- Connection Type: Indicates whether data is pulled from a table, file, custom notebook (written in Databricks), or using a built-in API connector such as Salesforce.
- Uses Agent: No or Yes, indicates whether the connection relies on an installed DataForge agent to send data through. We do not need or want to use an agent to pull/push data from/to Databricks so this is set to No.
- Driver: Used to select which built-in Driver the connection should use, or the Generic JDBC connection when no built-in Driver is available. Examples of other drivers available are SQL Server, Snowflake, MySQL, Postgres, etc.
- Catalog: Specific to Unity driver option. Allows users to specify a unity catalog to ingest data from or publish data to. Default is hive_metastore which is where all hub tables are stored.
- Connection Parameters -> Metadata Refresh: Indicates which information DataForge should collect from this connection to show in Connection Metadata. Tables and Keys is the default option. Collecting this provides the option to users to bulk create Sources and automatic Relations as a starting point.
- Connection Parameters -> Metadata Schema Pattern: Limits the metadata collected to certain schemas within the connection system. Uses LIKE style patterns, e.g. "tpch" to only get metadata for schemas LIKE %tpch%.
Click Save in the bottom right to save the connection and launch a test.
Step 2: Testing the Connection
Click on the Connections header hyperlink to be brought back to the main Connections page or navigate to Connections via the Main Menu again.
Every time a connection is resaved, DataForge will run a Connection Test to check whether the connection is working correctly. After saving the Databricks Unity Samples connection, an In Progress icon will display in the Status column to the right of the connection.
This connection test may take a few minutes to complete as a new cluster will be launched through Databricks to check the validity of the connection. This connection test runs once a day or whenever a connection is resaved. Users can also force a connection test by selecting the triple-dot menu on the connection row and using the Test connection and refresh metadata option.
Once the Connection Test is complete, users see either a green checkmark meaning the test was successful, or a red exclamation point meaning the test failed.
Step 3: Creating an Output Connection
Now that a source connection called "Databricks Unity Samples" has been set up and tested successfully, the next step is to create a new Output Connection. This output connection will be attached to the output that is created later in this Data Integration Example and will be used to write data to Delta Lake tables in Databricks Hive Metastore.
Use the New + button to create a new connection.
Update the new connection with the following values:
- Name: Databricks Delta Lake Output
- Description: Output data into Databricks Delta Lake Hive Metastore
- Connection Direction: Output
- Connection Type: Table
- Driver: Delta Lake
- Credentials: Implicit
- Connection Parameters -> Catalog: hive_metastore
The Catalog parameter in Connection parameters is what tells Databricks which Catalog any output tables attached to this connection should be created in when the output process runs.
In this connection, no credential or access token needs to be supplied as there are already credentials configured between DataForge and Databricks
After completing these steps, the output should look similar to the below image. Click Save to finish the Output Connection configuration. When both Source and Output connections are configured, move on to the next section, Creating Sources.
Continue on to the Creating Sources section of this Data Integration Example.
Updated