SQL Server Streaming

DataForge supports CDC-based streaming ingestion for SQL Server tables, allowing row-level changes to land in your lakehouse within seconds of occurring. Three areas require configuration before streaming sources are active: SQL Server itself, a Databricks Serverless SQL Warehouse for stream processing, and DataForge.

SQL Server Configuration

CDC must be enabled at both the database and table level before DataForge can stream from a table. Follow Microsoft instructions to enable CDC: https://learn.microsoft.com/en-us/sql/relational-databases/track-changes/enable-and-disable-change-data-capture-sql-server?view=sql-server-ver17

Databricks Configuration

Streaming ingestion requires a dedicated Databricks SQL Warehouse. We recommend starting with a serverless size 2XS instance. Set the Max Scale on the warehouse based on the number of tables you plan to stream and your target latency. A useful rule of thumb:

Max serverless instance scale = MAX( {# of tables} / {target latency in seconds},  max stream threads / 8 )

If Max Scale is set too low, actual latency per replicated source will exceed your target. Monitor utilization and adjust as needed.

DataForge Configuration

Once your SQL Server and SQL Warehouse are ready, configure the agent and sources in DataForge:

  • Configure the agent connection to the SQL Server database by populating the SQL Warehouse http path in the Agent parameters.

You can create stream sources two ways. The faster option is from the connection metadata tab: select your tables, click Create Sources, and choose Stream as the source type. Alternatively, create a new source manually and set Processing Type to Stream. DataForge will automatically update the refresh type and key columns based on the index key used by the SQL Server CDC configuration. Enter the full table name in schema.table format in the source query field. Once the source is saved, click Start Stream to start streaming.

Initial Load

Stream sources automatically perform a full initial load the first time the ingestion stream starts. For very large tables (10M+ rows), the initial load can take a long time. In those cases, we recommend performing the initial load in batch mode first and then switching to streaming:

  1. Create a batch source for the table.
  2. Enable the Save LSN ingestion parameter. This records the last LSN value so the source can switch to incremental streaming after the initial load completes.
  3. Run the first ingestion in batch mode.
  4. Change the source Processing Type to Stream.
  5. In the source query field, remove the SELECT ... portion, leaving only the schema and table name in [schema].[table] format.
  6. Save and click Start Stream.
  7. Check the process logs to confirm the stream started in incremental mode.

Monitoring

Open the Agent page and navigate to the Monitoring tab to observe real-time CPU and RAM utilization. Adjust your Max Batch Threads and Max Stream Threads settings to keep utilization below 70-80%. If you see latency creeping above your target, revisit the SQL Warehouse Max Scale setting. Any change to the Agent parameters requires you to restart the agent before the changes are in effect.

Updated

Was this article helpful?

0 out of 0 found this helpful