The DataForge Cloud 8.0 release is here, providing support for complex data types, Kafka integrations, and more!
Table of Contents
- Handle Complex Data Types with ease
- Extended Schema Evolution
- Integrate Event Data using Kafka into your workspace
- Streamlined Import/Export format 2.1
- Import from dataforge-core
- Project management puts more control in your hands
- Easily duplicate Workspace level configurations
- Enhanced Processing details user experience
- Benefit from upgraded Databricks Runtime
- Postgres upgrade to 16.1
- Google Cloud Platform (GCP) support
- Talos AI Assistant
Handle Complex Data Types with ease
DataForge added support for array and struct data types. These types are common in semi-structured datasets (JSON), API results, and streaming.
You can now use these complex types along with scalar types in rules, relations, templates, output mappings, and filters. The expression editor intelli-sense will help you navigate nested complex types by listing child keys for each nested level. Structs and Arrays display a metadata icon with further schema definitions. You can also use all available spark built-in functions in rules.
In output mappings, you have the option to expand mapped struct attributes. This will create new columns named as <original struct column name>_<key> and map corresponding struct keys into new columns.
Expand option on output column mapping
Expanded column mapping to multiple columns
You can also do this for all structs in the channel by clicking Expand Structs in the channel options menu.
Output channel struct options
Clicking this option recursively will expand nested structs to the next level.
For working with complex nested arrays, DataForge recommends using Databricks higher-order functions. DataForge plans to introduce a new sub-source feature in a future release to fully integrate nested array of struct operations into the platform at the columnar level, which will further streamline the experience of working with complex types.
Extended Schema Evolution
DataForge Cloud 7.x source parameters lock_schema_new_columns, fail_on_missing_columns, lock_schema_datatype_changes have been consolidated and extended into a new "Schema Evolution" parameter with the following settings:
Option | Description |
Lock | Any schema changes will generate ingestion or parsing error |
Add | Allows adding new columns |
Remove | Allows missing (removed) columns in inputs. Missing column values will be substituted with nulls |
Upcast |
Allows retaining of the same column alias when column datatype changes to the type that can be safely converted to an existing type (upcast). Example: in input 1 column_a is string in input 2 column_a is int When Upcast is enabled, it will convert the int value in input 2 to string, retaining the original column alias column_a in the hub table. This feature helps avoid creating new column versions with _2,_3.. suffixes when compatible data type changes happen in the source data. |
Clone | Allows creation of new column aliased <original column>_2 when column data type changes and it's not compatible with the existing column. |
The Schema Evolution parameter is in the source parameters Parsing section. It contains a drop-down with 7 preselected combinations of the above settings. The default setting is Add, Remove, Upcast.
Legacy parameter values on existing sources will be migrated into the corresponding Schema Evolution value according to this table:
Parameter | Value | Value | Value | Value |
lock_schema_new_columns | X | |||
fail_on_missing_columns | X | X | ||
lock_schema_datatype_changes | X | X | X | |
Schema Evolution | Lock | Add | Add, Remove | Add, Remove, Clone |
For more information on schema evolution and data type compatibility, visit the Source Settings documentation in the Advanced Parameters section.
Integrate Event Data using Kafka into your workspace
Event integration using Kafka is now supported for Sparky Ingestion (non-agent, cluster-based) and Output, to do batch writes and reads from any Kafka topic. Avro and JSON schemas and Schema Registry connection are supported using parameters defined in the Connection.
Currently, only batch reads and writes are supported, with full streaming integration coming in a future release. Starting and ending offsets are supported on Ingestion, with the ability to set the starting offsets to "deltas" to ensure each ingestion is always pulling the latest batch from the Kafka topic. Additional parameters exist within the Source ingestion parameters for further refining the data flow.
For output, key/value columns are supported, and users will need to set up their value column in a rule if it's necessary to make a complex type in the value.
Streamlined Import/Export format 2.1
New export format 2.1 removes all attributes containing default values from YAML files. This results in a significant 2-3x reduction in the size of export YAML files and less code to manage and source control.
Additionally, relations have been moved to a single standalone relations.yaml file in the root folder of the import zip. The relations format has been simplified as in this example:
- name: "[SalesOrderDetail]-sod-soh-[SalesOrderHeader]"
expression: "[This].salesorderid = [Related].salesorderid"
cardinality: "M-1"
Centralized relation files make it easier to maintain and avoid duplication.
Export files created in DataForge Cloud 7.x with format 2.0 are compatible and can be imported into DataForge Cloud 8.0, but the new 2.1 format cannot be imported into DataForge Cloud 7.x workspaces.
Format version is defined in the meta.yaml file in the root folder of the import zip.
Import from dataforge-core
DataForge added support for the new core1.0 import format to enable importing projects created in DataForge Core.
core1.0 format is supported alongside standard DataForge Cloud format 2.1.
Project Management puts more control in your hands
Ensure Projects that are live in production are never accidentally modified by using the new Lock Project option. Easily enable or disable the project lock by updating the flag on the project settings and saving the change.
When a project is locked, only changes made via project import are allowed. Trying to edit any configurations or settings in the project manually while it is locked is prevented.
Locked projects are indicated on any page of the Workspace by the lock icon presented to the right of the project name drop-down.
Source and Output Settings pages also indicate the lock by showing a "Project is locked" message.
Easily duplicate Workspace level configurations
Need to tweak a Workspace level configuration for something specific? Gone are the days of needing to manually recreate existing Workspace level configurations when you need to keep the original settings applied to some objects. Now you can easily duplicate Workspace level configurations within the Workspace, including Process Configs, Cluster Configs, and Connections.
Use the Duplicate button on any configuration and DataForge will launch a new tab with the same settings and a name of "<configuration name> COPY". Update the name of the config and any settings changes needed and save.
Enhanced Processing details user experience
DataForge now supports copying and sending direct hyperlinks to specific processes. Users no longer need to send a link and describe what process is the focus.
Click the Process ID hyperlink to highlight the process id and copy the new URL from the browser window. Alternatively, right-click the Process ID hyperlink and copy the link address. Available on all Processing screens (main, source, output).
Easily filter many processes at a time to narrow down your search results by filtering on start and end hours. The start and end times are based on your browser's local time setting. Change the hours and hit the Enter key or click the Refresh button to easily adjust your view. Available on all Processing screens (main, source, output).
Search for a specific process by entering the Process ID in the Process ID filter to narrow it down to a specific process. This feature was previously only available on select pages but is now available on all Processing pages.
Benefit from upgraded Databricks Runtime
The default Databricks version for cluster configurations is updated to DBR 14.3 LTS. For more information, visit Databricks 14.3 LTS documentation.
Postgres upgrade to version 16.1
DataForge is upgrading the AWS and Azure Postgres versions of the metadata database from version 14 to 16.1. This is an in-place upgrade that will take place during the deployment and will retain all settings of the current server.
PostgreSQL version 16 contains several improvements described in PostgreSQL 16 release
PostgreSQL version 15 contains several improvements described in PostgreSQL 15 release
Google Cloud Platform (GCP) Support
DataForge now offers support for Google Cloud. In addition to AWS and Azure, DataForge Cloud now seamlessly integrates with Google's powerful cloud services and resources.
Talos AI Assistant
DataForge is excited to introduce DataForge's Talos virtual assistant. Here is a list of things Talos can help with:
- Find Connections by name
- List & search schemas and tables within metadata for the connection
- Create source and pull data from table(s)
- Find a source and open it
- Find an attribute (raw, rule, or output column) and open lineage for it
It is available in preview for DataForge Cloud Professional and Enterprise (AWS only) customers.
Full Changelog
- Add Process ID filter and Hour filter (between a and b hours) to Main Processing page
- Added process_id filter on all process pages (global, source and output)
- Added clickable hyperlink on process_id column: this enables to easily share links pointing to specific process_id
- Create sparky file watcher
- Restrict use of Force Case Insensitive parameter once there is any data/raw schema loaded
- Prompt user to Reset All Parsing when they switch parser and save source settings
- Moved all source pre-save checks to backend.
- Improved UX: more readable errors, supports multiple users updating same source concurrently, fixed restore to currently saved setting on error
- Lock Projects so people cannot accidentally make changes to a production project
- Change single file default to false for Parquet file output
- Attach sdk library to data viewer cluster, keep it up to date in deployment container
- The SDK library will be installed/updated on the Data Viewer Cluster during deployment
- Directory style file move should throw error if the directory is empty
- Add support for struct and array types
- Added support for struct and array types. See blog for details
- Add Kafka support
- See blog for details
- Upgrade Postgres to latest version 16.1
- See blog for details
- Upgrade to Databricks 14.3 LTS
- Clusters are auto-updated to use Databricks 14.3 LTS runtime
- Delta lake channel overwrite shouldn't trigger outputs on sources that haven't passed refresh/0 effective records
- Deprecate numeric type
- Removed redundant numeric type from attribute types. Migrated existing rule Cast Attribute Types to decimal
- Add "EXPAND" option to output column mapping for STRUCT
- Expand button on output column mapping popup expands the struct 1 level down, creating and auto-mapping new columns name as <orginal_column_name>_<key>
- Expand all menu option on channel expands all mapped structs one level down.
- Add parameter on agent for user to choose whether to launch retries after agent restart
- Add clone button to group settings page
- Capture database timeout error when ingestion fails
- Reformatted spark job error captured in DataForge UI log
- Added new "Caused by" log message containing error root cause details
- Fix job run ID hyperlink to Databricks not working for first period of time
- Opening databricks job link immediately after job start now waits until link is ready and then opens job details in new tab
- Filter out Maintenance and Cleanup schedules from Source schedule list
- Maintenance and Cleanup schedules are no longer selectable in the source schedule dropdown
- Persist column selections and filter expressions on Data View tab
- Data viewer filter and column selections are now persisted for every source and are restored when user returns to data viewer tab
- Added pull now button to Unmanaged External source
- Remove or NULL s_row_id for non-file sources
- Added new Generate Row Id source parameter
- When selected, s_row_id column will contain unique id for each input
- When unselected(default), s_row_id will contain null.
- We recommend keeping this off as it greatly reduces size of data files and improves performance
- Add duplicate feature to Cluster Configurations, Process Configurations, Connections
- Solved double-use of disable ingestion flag during deployment
- Deployment now adds status_name = ‘deployment’ record to meta.system_status prior to deployment, and is not using `disable-ingestion` system_configuration parameter.
- Upgrade Azure to Postgres 16.1
- Make import/export yaml attributes optional
- Removed all values containing defaults from export yaml files. This results in 30-40% reduction in yaml file sizes, making it easier to managed
- Build import for core1.0 format
- Enabled import of projects created in dataforge core
- Create open source CLI
- See dafaforge-core documentation
- Enable complex attribute expressions in source CDC parameters
- Added new source parameter "Select List" with default value *
- It enables user to change raw schema by applying select list in the format of spark selectExpr method: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#selectExpr(exprs:String*):org.apache.spark.sql.DataFrame
- Struct attributes can be expanded by specifying struct1.* or struct.id expressions
- Expanded attributes can then be referenced in CDC parameters as keys, timestamps, etc.
- Parameter does not apply to agent ingestions
- Add two values to spark conf at runtime for the bulk reset CDC
- Fail Custom Process when SDK version <> main version
- Added check to SDK (custom ingest, parse and post-output) that compares SDK version (defined in jar attached to cluster that's executing SDK notebook) with workspace version. It also validates that SDK is referencing correct com.dataforgelabs.sdk package.
- Note: legacy sdk package is no longer supported in 8.0
- YAML relations updates
- Updated export file format:
- moved relations from source yaml to single, consolidated relations.yaml
- removed uids and streamlined format as:
name: "[SalesOrderDetail]-self raw-[SalesOrderDetail]"
expression: "[This].productid = [Related].productid"
cardinality: "M-M"
- Updated export file format:
- Handle hard deletes in Source for Custom Refresh
- Added new "Append Filter" optional parameter to custom refresh sources. It defines boolean filter expression that is applied to received batch (input) before appending it to hub table during refresh. This allows to ingest data from sources with delete markers, and filter out the delete marker records from the hub table. Deletes are thus propagated to both the hub and output tables.
- It also enables more advanced custom refresh configurations whenever it is required to filter ingested batch before appending it.
- Update Postgres Jdbc Drivers to 42.5.5
- Add start cleanup to UI in system configuration
- Add LLM integration for Connection metadata to generate sources
- Add new status "C" for sources that were skipped by cleanup for > 7 days
- Added new icon and source status to display warning for sources that have not been cleaned up for 7+ days.
- Enable generic JDBC for Agent ingestions
- Enables generic jdbc agent connections (previously only available for spark ingestions)
- Generic jdbc connection uses connection string in jdbc format, and optional jdbc parameters in json key-value format
- Before using the connection, place the jar for jdbc driver into the Dataforge folder where agent is installed
- Generic JDBC connections are not supported for local agents - please use faster and more efficient spark ingestion instead
- SaaS Upgrades and Maintenance scheduling (non-major)
- "Automatic Upgrade" schedule is replaced by "Maintenance", with default setting at every Wednesday @3PM UTC.
- Maintenance schedule is modifiable my customer; however it needs to be set to workday weekday, with time between 2PM and 8PM UTC.
- Automatic Upgrades to minor and patch versions are now enabled for all workspaces.
- If Automatic Upgrade does not require database deployment, upgrade will happen seamlessly without processing or UI interruption
- When database change is required, the upgrade will be auto-scheduled to the next Maintenance window.
- When upgrade starts, system disables new ingestions and waits 1 hour for all currently running processes to complete. If processes are still running at 1 hour mark, deployment will stop processes and capture stopped process_ids in deploy.upgrade table
- Once all processes are stopped, deployment will proceed. Upon completion of the deployment, stopped processes are automatically queued for retry.
- Note: major version deployments (e.g. 8.0 ) as well as deployments requiring infrastructure change, will be performed during the same Maintenance window.
- Change filters for connections
- Add API and Events in Connection Type filter on Sources page
- Added schedule option to Unmanaged External sources
- Change new cluster default node type to m-fleet.xlarge for AWS workspaces
Updated