The DataForge 7.1.0 release is here with a revamped user interface, new features and developer productivity enhancements!
Any workspaces currently running 7.0.x versions can upgrade to this latest version. To bring your workspace up to 7.0.x, follow the 7.1 upgrade guide. For assistance with the upgrade, please submit a request to the DataForge support team.
Actions Needed Post-Upgrade and Notes
Pre-Upgrade Action: Be sure to check the "AutoUpdate" box to true on each agent you have in the environment. Agents will auto-upgrade to the 7.1 version after the upgrade is complete when this is checked. Agents running 6.2.x versions will not be able to ingest data after upgrading to 7.1 if this checkbox is not true.
Pre-or-Post-Upgrade Action: DataForge is no longer supporting the first version of the SDK. To avoid any issues with job runs, all notebook SDK references should be updated to the latest DataForge SDK (com.dataforgelabs.sdk). Follow the migration guide here to update any notebooks you may have.
Post-Upgrade Action: Conversion of existing rule, rule template and output mapping expression syntax is automatically performed during the 7.1 upgrade (see Simplified Rule Expression topic). Some rule template conversions may not work due to existing broken rules. Rule template conversion errors are captured in _rule_template_conversion_7_1 table and need to be reviewed prior to using impacted templates. Users should run the Databricks notebook "7.1-post-upgrade-enr-relation-check" found in the "dataforge-managed" folder. If there are errors listed, users need to correct the configuration manually in the DataForge workspace. For questions or additional help, submit a support request to the DataForge team.
Note: New agent auth protocol - for all new agents, use the 2.0 protocol. At a later date, DataForge will require updating all 1.0 protocol agents along with an upgrade to the Java version. Using the 2.0 protocol for new agents going forward will reduce the amount of work needed in the future when the Java version is upgraded.
Note: Project export file format has been changed to a new format 2.0. Format version is shown in the meta.yaml file in the root folder of the export zip file. Due to the changes in rule, rule template and output mapping expression syntax (see Simplified Rule Expression topic), the new format is not compatible with pre 7.1 export formats. In future releases, DataForge plans to maintain backward compatibility of import file formats, e.g. if import format changes in future versions to 2.1, then the import will accept format 2.0.
Table of Contents
Automatic Keys and Relations for Database Connections
Unity Catalog Ingestion and Output
Simplified Rule Expressions for Relation traversals
Enhanced Project Import/Export
Project variables and options improve the developer experience
Watermark Column for Custom/None Refresh Incremental Loads
New Agent Authentication Protocol (v2) and MSI download button
Optional Read Replica for Postgres Metadata Database
Revamped User Interface
DataForge workspaces have a new user interface with revamped colors and tools for developers.
Color updates make the eye travel where you need to be. Following the DataForge color pallet, users will see a familiar layout on pages but with enhanced color schemes to draw the eye towards warnings and places of action.
Relation graphs have been modified to enhance the user experience. When users view relations in graph form, the graph will now display bottom to top. As nodes are expanded or collapsed the graph will re-render to optimize visibility.
Source statuses on the main Source page now sort in order of actions needed. Clicking the Status column will sort the statuses so the user is immediately drawn to any sources with failures and warnings that should be reviewed.
Processing pages now jump to the dates that processes were run so users can quickly navigate back or forwarding using the date selection without wasting time scrolling through dates with no processes.
Automatic Keys and Relations for database connections
DataForge is committed to speeding up and easing the developer workflow. With database connections, users can choose what metadata to collect from the source system, managed by two new connection parameters:
- Metadata Refresh (includes three options):
- Tables and Keys (default) collects most granular information for each table
- Tables collects only table/view names
- None disables metadata collection for the connection
- Metadata Schema Pattern (optional):
- Specifies LIKE pattern to filter schemas for metadata collection
When the Tables and Keys collection option is enabled on the Metadata Refresh parameter, the Connection Metadata tab will show Primary Keys and Referenced Tables for each table. Users can bulk create sources for any tables needed, as well as referenced tables recursively, and automatically create relations between the sources in just a few clicks. When bulk creating, users have options to change the source name patterns and trigger initial data pulls on all sources.
For more information, visit the Connections documentation in the User Manual section.
Unity Catalog Ingestion and Output
DataForge is proud to natively support Unity Catalog in Databricks. Users can now ingest data directly from tables stored in Unity Catalog. Data can also be output to Unity Catalog through the Delta table output type. Unity Catalog requires a new connection be created with the catalog saved on the connection. By default, all connections will continue to use Hive Metastore unless otherwise specified.
For more information, visit the Unity Catalog Connection documentation.
Simplified Rule expressions for Relation traversals
Relations and rule expressions can be hard to navigate at times when there is a long chain of relations. DataForge now takes this difficulty out of play by simplifying rule expressions to only the target source name within brackets without writing the relation chain by hand. Relations and relation chains are now expanded into an Expression Parameters model below rule expressions to make relation chains legible and easy to update with a drop-down. After typing [ into the expression, DataForge will list all sources reachable from the [This] source via active relations.
Pre 7.1 Rule Expression: [This]~{Relation Name 1}~[Source 2]~{Relation Name 2}~[Source 3].attribute
Post 7.1 Rule Expression: [Source 3].attribute (relations shown in expression parameters)
Once the user selects the destination source, DataForge will pick the best relation path and display it in the Expression Parameters section below the expression. Users can change any part of the path using the presented drop-downs. Where applicable, additional hops in the relation chain can be added with an "Add Next" button to expand the traversal. Relation paths displayed in Expression Parameters have intuitive labels formatted as [From Source Name]->relation name->[To Source Name].
Hovering over the relation path selected provides users an extra level of detail to see the relation expression used including primary and foreign keys. The editor also tracks cardinality based on whether the expression is wrapped into an aggregate function or not, and informs the user about having to use an aggregate in the expression.
Users can click on each attribute name in the parameters to auto highlight where that attribute is referenced in the rule expression which can be extremely helpful in complicated rule expressions.
Rule templates also follow the new simplified syntax, and include additional parsing to generate and save expression parameters. This achieves faster and more robust linking of templates to sources.
For more information, visit the Rules documentation in the User Manual section.
Enhanced Project Import/Export
Importing a project now includes an automatic check whether any sources will be deleted as part of the import. If one or many sources will be deleted as a result of not being included in the import files, the import will be put on pause and a pause status will be assigned to the import. By opening the logs of the import, users will be able to see a list of sources that will be deleted if the import proceeds. Users click the pause status icon to see the warning message and options to cancel the import, fail the import, or proceed with the import.
Project exports no longer include updated by and created by users and datetimes. When Project imports are run, all objects that are changed or new will be created and marked with Created by or Updated by users of "Import " and the import number which can be found on the Project Imports screen. Created and Updated by datetimes will post the time the import was started. This massively improves auditability and debugging as users can see exactly when a configuration was changed with which import files.
Logs for Project Imports now include detailed counts of all objects that were updated and deleted during the import.
Additional Import/Export enhancements:
- Import folder name is now listed in the Project Imports page for auditability
- Relation UIDs in export files have been replaced with human-readable [Source A]-relation-[Source B] strings
- Exhaustive validations for rule and output column mapping parameters have been added to the import process (validate relation chain points to correct source, validate parsed expression against listed parameters/aggregations, validate parameter attributes match the container source)
These project enhancements achieve the following objectives:
- Enable change tracking of objects created and modified by imports. Previously, tracking changes was difficult because attribute changes were copied from one workspace to another.
- Reduced size of export yaml files
- Enable better source control by removing unnecessary "noise" to allow users to focus on real configuration changes
- Allow easier and safer manual edits to export files
For more information, visit the Projects documentation in the User Manual section.
Projects variables and options improve the developer experience
DataForge projects have overhauled the ability for developers to manage CI/CD pipelines with the ability to integrate DevOps best practice tools like Github. In this release, projects have been matured to improve the developer experience through a number of new tools.
Project variables can now be added to a project so configuration names are replaced by a variable name when the project is exported and matched to another variable when the project is imported back into a workspace. In practicality, this allows users to manage multiple sets of the same project (e.g. Dev, Test, Prod) within the same workspace without having to worry whether multiple outputs are pointed to the same table across projects. Variables are added to a project before exporting from a workspace and when imported back into a workspace, the variables are matched and replaced with target values the user configures. When project export files that include variables are imported into a project that does not yet have the variables configured, the import is failed and the variables are automatically created as a placeholder. The user only has to populate the variables with values and restart the import!
Projects now include a Disable Ingestion flag that users can toggle on or off to turn off ingestion on all sources within a project. This setting can be configured when the project is being created or after the project exists to manage ingestions. When a Project has ingestions disabled, a grey in progress icon with a slash through it appears next to the project name in the site banner.
For more information, visit the Projects documentation in the User Manual section.
Watermark Column for Custom/None Refresh Incremental Loads
Sources with Custom or None Refresh now include additional parameters that allow users to use a <latest_watermark> token in source queries. The <latest_watermark> token is substituted with the max(watermark column) as defined in the Watermark Column parameter.
Watermark Column: column in source attributes is substituted as MAX(Watermark Column) in source queries when <latest_watermark> token is used
Watermark Initial Value: initial value to substitute into <latest_watermark> when there is no data ingested yet (e.g. 1900-01-01)
For more information, visit the Source Settings documentation in the User Manual section.
New Agent Authentication Protocol (2.0) and MSI download button
A new agent authentication method has been added as 2.0 protocol that removes the need for Auth0 to be used for authentication. The 2.0 protocol uses an agent token that is stored in the agent-config.bin file and hashed/encrypted. Customers should begin converting to 2.0 protocol when possible as DataForge will discontinue support of the 1.0 authentication protocol in the future.
Customers can convert existing agents by changing the Auth Protocol parameter on the agent ui page to be 2.0 and save the change. This will stop the agent from communicating with DataForge temporarily. After saving this change, download the Agent Config file again from the Agents page and replace the existing agent-config.bin file on the machine that the agent is currently installed on. After replacing the agent-config.bin file, restart the Agent service from the machine and the agent will begin communicating with DataForge again.
The agents page now includes a button to download the Agent MSI file directly from the UI. Users no longer need to navigate to their respective cloud storage system to download the MSI file for new installations.
For more information, visit the Installing a New Agent documentation in the User Manual section.
Optional Read Replica for Postgres Metadata Database
DataForge now works with a failover node of Postgres where needed. To deploy a second node of Postgres, add a variable in Terraform of "read_replica_enabled" and a value of "yes". A second Postgres node helps with redundancy if the cloud provider has an issue with the cluster running Postgres. Adding this read replica has additional cost implications. If the first Postgres node goes down for any reason, traffic and metadata will be diverted to the second node.
For more information, visit the Adding Failover Node to Postgres Database documentation in the Operations section.
Full Changelog
- Automatic Keys and Relations
- See blog post for details
- Detect refresh type changes on import and force/recommend a reset all CDC
- Force reset all cdc if someone changes a refresh type as part of an import
- Turn loopback_ingestion process type into sparky_ingestion
- Consolidated with sparky_ingestion process type. Simplified code.
- Turn Databricks SaaS code into a Module and publish on TF registry
- Databricks SaaS starter is now available on Terraform Registry as a module for both AWS and Azure
- Default project schema is "dataforge" for all new customers
- New environments will now use "dataforge" as default project schema
- Add download link for MSI to UI
- Added MSI download button to Agents page to simplify agent installation process and faciliatate agent installation for SaaS customers
- Migrate Auth0 Rules to Actions (Should be done by Oct 16, 2023)
- Auth0 Rules have now been migrated to Actions to comply with deprecation of Auth0 Rules in 2024
- Agent authentication with custom key
- Authentication protocol 2.0 is now available for Agents, this will replace Auth0 if turned on for an agent
- Add <watermark_value> column to use in source query filter with custom AGG expression
-
Added 2 new parameters for sources with custom and none refresh types:
-
Watermark Column: defines raw attribute if any type. Max value of this column will be tracked for each input, and then substituted in <latest_watermark> token in ingestion query for the next input
-
Watermark Initial Value: defines value to use for <latest_watermark> token for the first input
-
For tracking and troubleshooting purposes, min_watermark and max_watermark values are saved in meta.input.cdc_status column
-
- Simplify relation expression syntax
- See blog post for details
- Use Delta Automerge on output
- Delta lake output has been optimized with new Delta Automerge feature in Databricks
- Add option to disable ingestions for an entire project (bonus for deactivate/reactive outputs)
- New "Disable Ingestion" option added to Project settings tab
When selected, disables ingestion on all sources in the project, except with manual Pull Now.
When selected, Disabled ingestion icon appears to the right on project name in the main toolbar
- New "Disable Ingestion" option added to Project settings tab
- Project Overrides
- See blog post for details
- Get core running on Java 17
- Core is now built and running on Java 17
- Auto add Kryo Serializer to every cluster config in spark conf
- Kryo Serializer has been added to all cluster configs
- Get API running on Java 17
- API is now built and running on Java 17
- Add an Import Review page to import UX to avoid accidental deletion
- When import detects that any existing source in the project is not present in the import file, it will log a warning and pause import to prevent accidental data loss. After reviewing logs, click Restart to resume the import or Fail it.
- Speed up file deletes in cleanup
- Improved parallelization of file delete operations, improving speed and decreasing infrastructure costs for cleanup process
- Populate create/update attributes with import_id=123 during import
- Updates Import/Export format:
- removed create/update userids and timestamps
- userid of objects created/updated by import will be populated as Import + <import_id>
- timestamps will be populated as import start_datetime
- removed source names for [This] source parameters in rules and output mappings
- changed relation_uids format from guid to human readable [Source A]-relation name-[Source B] format
- removed create/update userids and timestamps
- Updates Import/Export format:
- Add ability to add read replica to Postgres in Terraform
- Customers will now have the ability to add a read replica to Postgres Via the read_replia_enabled variable in Terraform.
- Remove Core Parse from SaaS version
- SaaS versions will no longer be able to use Core Parse process
- Redo "Convert Rule to Template"
-
Opens new rule template tab, pre-populated with values copied from enrichment, with test source copied from enrichment
-
User validates template, saves it, optionally applying to source
-
- Add "Save and Apply to Source" button to rule template
- New Sav and Apply button saves template and links it to the test source in one click
- Update UI Color Pallet
- Fully refreshed the DataForge UI color scheme.
- Postgres failover testing with second node works successfully
- AWS Postgres RDS can now support adding a second node for resiliency
- Prevent 2 different outputs writing into the same schema.table in the same connection
- Added protection against multiple outputs writing to the same physical table or view
- Remove old postgres functions from db that are not in the project
- Clean up database code by removing all deprecated functions
- Delta output to unity catalog
- Delta lake output can now write to unity catalog
- Add rule template conversion check to post-deployment notebook
- Rule template conversion check has been added to post-deployment notebook that is deployed using Terraform
- Expose project name in post-output parameters parameters for custom post-output to support custom loopbacks
- Added project_name to exposed session parameters
- Sparky ingestion from unity catalog
- Sparky ingestion can now read from unity catalog, with catalog specified
- Add single user access mode to all cluster configs on save
- Single user access mode has been added to all cluster configs on save to allow clusters to access Unity catalog
- Update Lineage Layout
- The relations graph has been updated with a new bottom to top flow. Additionally, the lineage graph "collapse all nodes" button has been updated to re rendered the graph in a much more readable format.
- Project import log to show which file is imported.
- File name column added to imports tab
- Change managed Databricks JDBC connection to use Unity table type instead of databrics jdbc
- Managed Databricks JDBC connection uses Unity table type instead of Databricks JDBC type
- Add json parameter to JDBC connections for custom props (sensitive and non-)
- JDBC connections can now add custom properties using JSON parameter inputs on the connection settings
- Fix Toolbar and Column UI on Output Mapping Page
- Removed loading bar and page movement
- Add single node to cluster config with defaults
- Single node format has been added to cluster config with defaults populated on save
- Salesforce ingestion fails when returned resultset has no header
- Salesforce ingestion now includes a warning in logs indicating if there is an error with the source query that SFDC API does not like.
- Change source status sort order
- Changed order of status sort to prioritize where user needs to take action
- Show loading icon or spinning while waiting for data in Data View
- The Data Viewer page has been updated with a loading wheel when waiting for data
- Delete automated testing record in system config table
- Removed deprecated parameter from metadata
- Grey out active flag during project create
- Active flag is now not selectable during project create
- Re-write descriptions of archive parameters of file type source
- Archive parameters have had their descriptions updated
- Show a better Error In Data Viewer when Hub Table Doesn't Exist
- The Data Viewer page will now show a loading wheel until loading is complete. The error of undefined has been fixed when there is no data available.
- SaaS updates for Agent to enable writing to customer datalake
- AWS/Azure Datalake access information will now need to be provided when enabling Authentication Protocol 2.0. For more information, see release blog and updated Agent install guide
- Change record_counts on meta.input to Long rather than Int
- Changed data type of record counts on meta.input table to increase counts available
Updated