DataForge allows users to connect to external tables that are set up in Databricks as Delta or Hive tables. This provides the ability to create downstream relations and rules on sources within DataForge, but calculated off of data managed externally.
Differences in Unmanaged External Sources
Below are the main differences between Unmanaged External sources and DataForge managed sources.
- No connection setup is necessary as Unmanaged External sources always point to Databricks table definitions
- Data View is not provided and data should be queried in Databricks where the table is managed
- Inputs are created and ingestion/parse processes are run to parse the raw schema for use within DataForge. No other processes will be run including Refresh or Output.
Setting Up
Create a new source that will be used to connect to the unmanaged external table. Open the connection type drop-down and select the Unmanaged External option.
Provide the table name of the source table to be used. When the table name is populated, save the source settings. Upon saving, a new Input will be created to parse the source table raw attributes.
Unmanaged External Sources need a new data pull to create references to any newly created attributes in the external table. Data Pulls can be done manually as needed or set on a Schedule.
Example 1: Connecting to Delta table to enhance rule logic
Create the unmanaged external source and identify the table name.
Save the source settings and allow the first input to complete Parsing to populate the raw schema page.
Create relation(s) between the Unmanaged Source and DataForge Source(s) to use in downstream rules.
Create rule(s) in DataForge source(s) to reference data from the unmanaged source using the new relation.
Reprocess any existing inputs by using the Recalculate All or Recalculate changed. When complete, view the rule results in Data View or Databricks notebooks/queries.
Example 2: Connecting to Meta Input table (Federated Lakehouse connection to retrieve file names per input)
Create the unmanaged external source and identify the meta.input table name. In this example, the meta schema and tables within it are actually tables from Postgres that are surfaced through a Databricks connection. If not already completed, create the File connection type source to be used for uploading files.
Create a relation between the Unmanaged Meta Input Source and DataForge Source(s) to use in a downstream rule. In this case, when typing the side of the relation expression that references s_input_id, DataForge will not auto-populate this field since it is a system field behind the scenes. However, if you type s_input_id the parser will correctly interpret the field and allow you to save.
Create rule(s) in DataForge source(s) to reference the source_file_name attribute from the unmanaged meta input source using the new relation.
Reprocess any existing inputs by using the Recalculate All or Recalculate changed. When complete, view the rule results in Data View or Databricks notebooks/queries. Notice that we now have a file_name column populated with the file name of the input that was run.
Updated