DataForge Cloud version 8.1 is here, providing streaming support, sub-sources for complex data, and an improved lineage user experience!
Table of contents
- Stream processing
- Sub-source rules for complex nested arrays
- JSON file support
- Improved lineage experience
- Data type handling for rules
Stream processing
Stream processing is a new feature within DataForge that enables users to combine both streaming and batch data seamlessly. This powerful feature provides real-time data enrichment and processing, allowing for the creation of dynamic, scalable data pipelines with zero-effort infrastructure.
Stream processing capabilities include:
Kafka and Streaming Delta Table Integration: Seamlessly ingest data from Kafka topics and streaming Delta tables into the DataForge platform.
Batch Data Enrichment: Enrich real-time data streams with historical batch data already residing in the DataForge managed lakehouse, allowing comprehensive, up-to-the-moment insights.
Downstream Data Processing: Write enriched data back to Kafka topics and Delta tables for real-time consumption by downstream systems.
For more information and a demo of the stream processing feature, visit the DataForge stream processing blog post.
Sub-source rules for complex nested arrays
Sub-sources are a new type of rule, providing ultimate flexibility and ease-of-use when working with nested complex arrays (Array of type Struct) in datasets. Gone are the days of needing to manage multiple tables and explode data to normalize it and make it workable.
Developers can create a rule referencing the nested complex array column to create a sub-source. Afterward, this "virtual" sub-source exists, making it easy to create relations and rules just like any other regular source without manipulating the data structure beforehand. With an implicit relation to the parent source pre-defined, creating rules or aggregations across the sub-source and parent source is easy. When you create rules in a sub-source, the results are written back inside the nested complex array of the parent source column.
For more information and a demo of the sub-source feature, visit the DataForge sub-source blog post.
JSON file support
DataForge now natively supports JSON file ingestions and output. Sources with JSON files use the Spark parser to handle the file schema. Developers can simplify any sources currently setup with a custom ingestion or custom parse notebook to handle JSON files to this native file handling option. For more information on options and settings available for use with JSON files, visit the Spark JSON files docs.
Improved lineage experience
Lineage graphs received an overhaul with new icons, colors, and a matching graph legend making it much easier for a developer to understand the lineage flow and which icons represent which type of objects.
In addition, developers can now open objects directly from a lineage graph by right-clicking the icon and using the open option. The following objects can be opened directly from lineage graphs: source and output containers, raw attributes, rules, and output columns.
Data type handling for rules
DataForge now handles data type checking for rules when developers attempt to change data types of existing rules. When a developer changes the datatype of an existing rule or rule template, DataForge will validate all downstream elements to ensure that rule and relation expressions are still valid, and output column data types are compatible. Any problematic elements will be displayed and prevent the rule data type update from saving until the issues are fixed. After a successful validation, it will recursively update data types for all downstream rules impacted by the change.
Full Changelog
-
Lineage Details View
-
Add stream kafka source
-
Streaming processing type can be used with Event connection type in Source settings
-
Add kafka output for stream sources
-
Stream sources can be added as channels to Event Kafka output
-
-
Prevent mapping of stream sources to file outputs and non-delta tables
-
Validate virtual output functionality with mapped stream source
-
Stream sources can be added as channels to Virtual output
-
-
Switch to using Instant for extract datetime in sdk/sparky/agent
-
Libraries for capturing UTC time during processing have been updated
-
-
Subsources
-
https://www.dataforgelabs.com/blog/sub-sources
-
-
Fix Check Databricks Version message on cluster configurations
-
Cluster configs that have been flagged for CHECK DATABRICKS VERSION won't have their name continuously appended to anymore
-
-
JSON File Source
-
Added ability to ingest and parse JSON files. Enabled all spark json library configuration parametrs in DataForge UI.
-
-
JSON File Output
-
Added support for complex types in Json output columns
-
-
Clean unused s3 buckets and remove them from TF
-
logging bucket has been removed and is replaced with the logs bucket. The db bucket has been removed
-
-
Move enrich, refresh and output query generation from preceding process
-
Query generation errors will now fail the enrichment or recalculation process (vs. failing the preceeding process pre-8.1)
-
-
Handle datatype changes for rules
-
When user changes the datatype of existing rule or template, system will validate all downstream elements to ensure that rule and relation expressions are still valid, and output column datatypes are compatible. It will display all problem elements to the user and prevent updating the rule until issues are fixed. After successful validation, it will recursively update datatypes for all downstream rules impacted by the change.
-
-
Enable Sub-Source Rule Templates
-
Enabled sub-source rule templates
-
-
Added cloning of rules contained within the sub-source
-
Sub-Source Output
-
When sub-source is mapped to an output channel, the output data is written the grain of the sub-source.
-
-
All sub-source attributes are available for mappings, as well as any attributes in sources related via M:1 relation chains
-
Auto-map columns when new output channel is added
-
When new output channel is added to output with existing columns, system will automatically map source attributes to output columns
-
-
Extend output removeOldTempTables to delta output
-
Delta output will remove any leftover temp tables when it runs
-
-
Create Stream Output process for reading from Delta lake and writing to Event or Delta table
-
Streaming source can read from Delta Lake table for stream ingestion
-
-
Stream restart on failure and upgrades
-
Streams will restart after a deployment or if they fail, if ingestion isn't disabled
-
-
Add table for Stream and only support Delta tables
-
Stream source can read from Delta Lake table but table must be saved as delta
-
-
Log current windows user name on agent startup
-
Windows user is logged in Agent UI logs on startup.
-
-
Update lineage legend for SS subsource
-
Add rule limitations on stream sources
-
Due to spark streaming query limitation, aggregates into related sources are not currently supported
-
-
Keep current and unique rules are also not supported for stream sources
Updated