DataForge Cloud 10.1 is here, bringing streaming CDC ingestion for SQL Server, native BigQuery connectivity, bulk source parameter management, and significant agent performance improvements!
|
⚠ Action Required Before Upgrading Java 21 is now a hard requirement for DataForge Agents in version 10.1. If your agent server is currently running Java 8 only, you must install Java 21 on that server and re-install the DataForge Agent before or after upgrading to avoid disruptions. You can run multiple Java versions side by side on the same server. As long as Java 21 is installed, the agent will resume normal operation after re-installing. Servers that already have Java 21 installed require no additional steps. |
Table of Contents
- SQL Server Streaming Table Ingestion
- BigQuery Source and Output
- Bulk Update Source Parameters
- Optimized Agent Processing and Monitoring
- User Experience Improvements
SQL Server Streaming Table Ingestion
DataForge now supports CDC-based streaming ingestion for SQL Server tables, allowing changes to land in your lakehouse within seconds of occurring rather than waiting for a scheduled batch run. To configure a streaming source, set Ingestion Type to Table, Refresh Type to Key, and Source Type to Stream. You supply a qualified schema.table name instead of a source_query, and a new trigger interval parameter (default 1 second) controls how frequently the agent checks for new changes.
The database and table need CDC enabled in SQL Server: https://learn.microsoft.com/en-us/sql/relational-databases/track-changes/enable-and-disable-change-data-capture-sql-server?view=sql-server-ver17
Before streaming begins, the agent validates that CDC is enabled on the SQL Server table. If CDC is not configured, ingestion fails with an error linking to Microsoft's CDC setup documentation. On first run, the agent performs a full table ingestion to establish a baseline LSN, which is stored in meta.input.cdc_status. On every subsequent interval, only net changes are fetched and written to the source hub table in Databricks. If an LSN out-of-bounds error occurs, the agent automatically triggers a full re-ingestion to resync. Expected latency for incremental changes is the trigger interval plus a few seconds for processing.
For more information, visit the SQL Server Streaming documentation.
BigQuery Source and Output
DataForge 10.1 adds native BigQuery support for both source ingestion and output on Databricks-based workspaces. The new BigQuery table connection type requires three parameters for authentication and targeting. Connection tests are available from the connection settings page before attaching the connection to a source or output.
A few prerequisites apply. Only Unity Catalog format DataForge Sources are supported (Hive is not supported). Databricks compute running a BigQuery output must have {"spark.serializer":"org.apache.spark.serializer.JavaSerializer"} in its Spark Conf. Schema evolution (ALTER) is not supported for output columns with complex types after the initial execution. All parameters supported by the Spark BigQuery connector can be passed through the connection's bigquery_parameters field as a JSON key/value pair, giving you access to the full connector configuration without a custom notebook.
Bulk Update Source Parameters
You can now select multiple sources from the Sources page and apply parameter changes to all of them at once. A Select All / Unselect All checkbox in the header row respects your current filter, and selections are processed for all sources matching the current filters applied whether they are shown or hidden behind the "Load more" button. A count label shows how many sources are selected. Three action buttons replace the previous Select Action dropdown: Pull Now, Recalculate, and Bulk Update. All bulk actions require confirmation before proceeding, with a dialogue box showing exactly how many operations are about to run.
Bulk Update lets you change one parameter at a time across all selected sources. The input is constrained to valid values for that parameter. The following parameters are supported:
| Location | Parameter | UI Control |
| Source Settings | Process Config | Config drop-down |
| Source Settings | Cleanup Config | Config drop-down |
| Source Settings | Schedule | Config drop-down |
| Source Settings > Parameters > Ingestion | Disable Initiation | True/False checkbox |
| Source Settings > Parameters > Performance and Cost | Max Retries | Integer text box |
| Source Settings > Parameters > Performance and Cost | Data Profiling | True/False checkbox |
For more information on source settings and parameters, visit the Sources Settings documentation.
Optimized Agent Processing and Monitoring
The MaxProcesses agent parameter has been replaced with two separate parameters.
- Max Batch Threads controls concurrent batch (file and table) ingestions.
- Max Stream Threads is a new parameter that controls concurrent stream ingestions. Separating these prevents a high volume of streaming sources from starving batch ingestions of threads, and vice versa. Internally, batch processing has been optimized to remove redundant heartbeat communications, file watcher processes have been refactored to use Futures, and scheduled file sources now follow the same scheduling path as all other scheduled source types.
When disable ingestion is set on an agent, the agent now pauses processing rather than restarting the loop. Processing resumes automatically when ingestion is re-enabled. A new Monitoring tab on the Agent page displays real-time metrics including file watcher latency and thread utilization, making it easier to evaluate whether your concurrency settings are appropriately sized for your workload.
For more information, visit the Agent Configuration documentation.
User Experience Improvements
DataForge 10.1 includes several improvements to reduce friction and improve day-to-day performance.
Output mapping performance
Saving a single column mapping on large outputs is now near-instant. Previously, each save triggered a full screen refresh regardless of output size, which could take up to 8 seconds on outputs with hundreds of channels and thousands of mappings. Mapping saves are now handled by a dedicated API call, and the full refresh only occurs when needed.
Default schedule separation
The default Cleanup schedule has been changed from 12:00 AM UTC to 10:00 PM UTC. Previously, both the Cleanup and deployment schedules defaulted to 12:00 AM UTC, which caused conflicts in environments where both ran at the same time. The deployment default schedule remains at 12:00 AM UTC.
Git export commit batching
Project exports to GitHub now batch all file changes into a single commit per project instead of one commit per file. This resolves rate limit timeouts that occurred on large projects and makes commit history significantly cleaner.
Updated