Setting up Connections

This section describes the process of setting up connections in DataForge.

Logical Data Flow

A Connection is used in part of the source input and output steps of the data flow.  We set up separate connections to be used for sources (ingesting data) and outputs (sending data).

Step 1: Review Sample Datasets connection

DataForge communicates to external data services through Connections.  Connections consist of configurations for where and what type of data to pull into DataForge. 

For ease of use in the Data Integration Example, a source connection is already created in DataForge with the name of "Sample Datasets".  This allows data pulls directly from a sample dataset TPC-H using already established credentials from the DataForge workspace creation.  

In a Chrome browser, login to your DataForge workspace by accessing the URL provided by your DataForge account team. 

Navigate to the Connections screen through the Main Menu (triple horizontal lines top-left), as seen below.

A connection has already been created named "Sample Datasets".  

Open the Sample Datasets connection by clicking on the name or row of the connection. 

The name of this connection can not be modified as it is deployed by the DataForge team during initial deployment and any subsequent version upgrades.  User-created Connections can be fully edited, including name, as well as deleted.

Connections have varying settings and parameters depending on the type of connection and this page dynamically adjusts to provide those settings as selections are made.  All fields marked with an asterisk (*) are mandatory. 

Key Settings and Parameters:

  • Name: Connections must be given a name before they can be saved.  All connection names must be unique.
  • Description: Description can be anything but recommended to be descriptive about where and how the connection pulls or pushes data from sources or to output destinations.
  • Connection Direction: Indicates if the connection will be attached to sources to pull data, or attached to outputs to send data.
  • Connection Type: Indicates whether data is pulled from a table, file, custom notebook (written in Databricks), or using a built-in API connector such as Salesforce.
  • Uses Agent: No or Yes, indicates whether the connection relies on an installed DataForge agent to send data through.  We do not need or want to use an agent to pull/push data from/to Databricks so this is set to No.
  • Driver: Used to select which built-in Driver the connection should use, or the Generic JDBC connection when no built-in Driver is available. Examples of other drivers available are SQL Server, Snowflake, MySQL, Postgres, etc.
  • Connection Parameters -> Metadata Refresh: Indicates which information DataForge should collect from this connection to show in Connection Metadata.  'Tables, Columns, Keys' is the default option. Collecting this provides the option to users to bulk create Sources and automatic Relations as a starting point.
  • Connection Parameters -> Metadata Schema Pattern: Limits the metadata collected to certain schemas within the connection system.  Uses LIKE style patterns, e.g. "tpch" to only get metadata for schemas LIKE %tpch%.

Click Save in the bottom right to start a new connection test.

Step 2: Testing the Connection

Click on the Connections header hyperlink to be brought back to the main Connections page or navigate to Connections via the Main Menu again.

Every time a connection is resaved, DataForge will run a Connection Test to check whether the connection is working correctly and collect optional metadata. After saving the Databricks Unity Samples connection, an In Progress icon will display in the Status column to the right of the connection.  

This connection test may take a few minutes to complete as a new compute resource will be launched to check the validity of the connection.  This connection test runs once per day or whenever a connection is resaved.  Users can force a connection test by selecting the triple-dot menu on the connection row and using the Test connection and refresh metadata option. 

Once the Connection Test is complete, users see either a green checkmark meaning the test was successful, or a red exclamation point meaning the test failed.

Step 3: Creating an Output Connection

Now that a source connection called "Sample Datasets" has been set up and tested successfully, the next step is to create a new Output Connection.  This output connection will be attached to the output that is created later in this Data Integration Example and will be used to publish data to a separate schema.

Use the New + button to create a new connection. 

 

Update the new connection with the following values for your respective workspace platform:

Databricks

  • Name: Table Output
  • Description: Output data to tables
  • Connection Direction: Output
  • Connection Type: Table
  • Driver: Delta Lake
  • Credentials: Implicit
  • Connection Parameters -> Catalog: dataforge 
    • If you customized your catalog name during DataForge workspace creation, use your catalog name. To confirm, open the System page from the main menu and check the Value for the row with Name "datalake-db-name".

The Catalog parameter in Connection parameters is what tells Databricks which Catalog any output tables attached to this connection should be created in when the output process runs.  In this example, we're using the default catalog that is used for DataForge. However, this could be any Catalog that you've set up in Databricks.

In this connection, no credential or access token needs to be supplied as there are already credentials configured between DataForge and Databricks.

After completing these steps, the output should look similar to the below image.  

Click Save to finish the Output Connection configuration.  

Continue on to the Creating Sources section of this Data Integration Example.

Snowflake

  • Name: Table Output
  • Description: Publish data to Snowflake tables
  • Connection Direction: Output
  • Connection Type: Table
  • Driver: Snowflake
  • Credentials: Implicit
  • Connection Parameters -> Database Name: DATAFORGE
    • If you customized your database name during DataForge workspace creation, use your database name. To confirm, open the System page from the main menu and check the Value for the row with Name "datalake-db-name".

The Database Name parameter in Connection parameters is what tells Snowflake which Database any output tables attached to this connection should be created in when the output process runs.  In this example, we're using the default database that is used for DataForge. However, this could be any database that you've set up in Databricks.

The Schema will be defined later in the Output Settings.  

In this connection, no credential or access token needs to be supplied as there are already implicit credentials configured.

After completing these steps, the output should look similar to the below image.  Click Save to finish the Output Connection configuration.  

Click Save to finish the Output Connection configuration.  

Continue on to the Creating Sources section of this Data Integration Example.

Updated

Was this article helpful?

0 out of 0 found this helpful