Getting Familiar with the Data

This section covers the exploration of the Databricks sample data needed for the rest of the Data Integration Example.  

Logical Data Flow

The first step in getting started with DataForge is to create or identify a raw data location to ingest data from.  For ease of getting started, the Data Integration Example will use the TPC-H dataset.  The steps in this Getting Familiar section demonstrate how to query the TPCH dataset and become familiar with it.

If your workspace uses Databricks, continue along the following steps. 

If your workspace uses Snowflake, skip to Snowflake steps.

Databricks

Step 1: Navigate to Databricks Data Explorer from DataForge

Start by opening the Main Menu in DataForge by clicking the three horizontal lines in the top left of the application.  From the Main Menu, select the Databricks option to launch the Databricks Workspace that is attached to the DataForge workspace. 

Select the Catalog option in the left-hand menu to open the Catalog Explorer page.

Step 2: Query the Samples Catalog and TPCH table data

Expand the Samples catalog along with the TPCH schema. Click the Customer table to view its metadata, then select the Create Option and choose Query.  This will create a new Databricks query with a starter SQL statement to query data from the Customer table.

 

If a SQL Warehouse is not attached yet, you will need to select an existing warehouse or create a new one. Lastly, run the query to see the results of the customer table data.

The customer table will be one of two primary tables used in the Data Integration Example.  Scan the data and columns to get familiar with it.  

Add a new query using the + button.  In the second query, paste in the following SQL code:

SELECT * FROM samples.tpch.orders

Run the second query to view the data held in the Orders table.  This will be the second of the two primary tables used in the Data Integration Example.  Scan the data and columns to get familiar with this data as well.

Notice that both tables have a column relating to c_custkey.  These columns will be used to join the data in DataForge to be used for new rule and validation logic later in this example workflow. 

Although these two tables will be primarily used in the Data Integration Example, feel free to explore other datasets from within the Databricks query editor.  When ready, proceed to the Setting up Connections section to continue the Data Integration Example. 

Snowflake

Step 1: Navigate to Snowflake Workspaces from DataForge

Start by opening the Main Menu in DataForge by clicking the three horizontal lines in the top left of the application.  From the Main Menu, select the Databricks option to launch the Databricks Workspace that is attached to the DataForge workspace. 

Select the Projects option in the left-hand menu and select Workspaces.

Step 2: Query the SNOWFLAKE_SAMPLE_DATA Catalog and TPCH_SF1 table data

Expand the SNOWFLAKE_SAMPLE_DATA catalog along with the TPCH_SF1 schema tables. 

Click the triple-dot options on the Customer table and select Preview Table.

The customer table will be one of two primary tables used in the Data Integration Example.  Scan the data and columns to get familiar with this data.  

Add a new query using the + button.  In the this query, run the following SQL code:

SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS

This Orders data will be the second of the two primary tables used in the Data Integration Example.  Scan the data and columns to get familiar with it.

Notice that both tables have a column relating to C_CUSTKEY.  These columns will be used to join the data in DataForge to be used for new rule and validation logic later in this example workflow. 

Although these two tables will be primarily used in the Data Integration Example, feel free to explore other datasets from within the Databricks query editor.  When ready, proceed to the Setting up Connections section to continue the Data Integration Example. 

 

Updated

Was this article helpful?

0 out of 0 found this helpful