Getting Familiar with the Data

This section covers the exploration of the Databricks sample data needed for the rest of the Data Integration Example.  

Logical Data Flow

Typically, the first step in getting started with DataForge is to create or identify a raw data location to ingest data from.  For ease of getting started, the Data Integration Example will use the TPCH sample dataset from Databricks Unity Catalog that is provided in every new workspace.  The steps outlined in this Getting Familiar section will demonstrate how to query and get familiar with the TPCH dataset.

 

Step 1: Navigate to Databricks Data Explorer from DataForge

Start by opening the Main Menu in DataForge by clicking the three horizontal lines in the top left of the application.  From the Main Menu, select the Databricks option to launch the Databricks Workspace that is attached to the DataForge environment. 

 

Hover over the left-hand menu in Databricks and select the Catalog option to open the Data Explorer page.

If a Sql Warehouse or Databricks cluster is not already running, select the option to Start the Starter Warehouse.  Once the SQL Warehouse Endpoint is running, all databases, schemas, and tables will populate in the Data Explorer view.  This process may take a few minutes for the SQL Warehouse to start.

Step 2: Explore the Samples Catalog and TPCH table data

Once the SQL Warehouse is up and running, expand the samples catalog along with the TPCH schema. A hive_metastore catalog exists, but will not be used for the purposes of this example.  Click the Customer table to view metadata related to the table and then select the Create Option and choose Query.  This will create a new Databricks query with a starter sql statement to query data from the Customer table.

The query name starts with "New query" and will not be saved unless you rename it and save. If your SQL Warehouse is not started yet, select the Connect drop-down next to the Schedule and Share options and select either the Start Warehouse that was previously started, or select another warehouse to use for querying the data. Lastly, run the query see the results of the customer table data.

The customer table will be one of two primary tables used in the Data Integration Example.  Scan the data and columns to get familiar with this data.  

Add a new query using the + button.  In the second query, paste in the following SQL code:

SELECT * FROM samples.tpch.orders

Run the second query to view the data held in the Orders table.  This will be the second of the two primary tables used in the Data Integration Example.  Scan the data and columns to get familiar with this data as well.

Notice that both tables have a column relating to custkey.  These columns will be used to join the data in DataForge to be used for new rule and validation logic later in this example workflow. Note the screenshot below is from a notebook for visual purposes.

Although these two tables will be primarily used in the Data Integration Example, feel free to explore other datasets from within the Databricks query editor.  When ready, proceed to the Setting up Connections section to continue the Data Integration Example. 

Updated

Was this article helpful?

0 out of 0 found this helpful