7.0.0 Version Features Blog

The DataForge 7.0.0 major release is here with new features and optimized processing.

For instructions on upgrading to this 7.0.0 major release version, please refer to the following 7.0.8 Upgrade Guide documentation.

Note: The existing file paths for the DataForge SDK Jar in Azure (dbfs:/mnt/jars/dataops-sdk.jar) and AWS (s3://<Environment>-datalake-<Client>/dataops-sdk.jar) will no longer be updated.  The new file path for both cloud providers will be dbfs:/mnt/processing-jars/dataforge-sdk/dataforge-sdk-$version.jar moving forward.  Replace the $version.jar portion of the file path to match the version number your environment is on (e.g. 7.0.8.jar). Use this path to install the DataForge SDK to interactive clusters to run custom notebooks into DataForge or to query Postgres.  The old file paths will work for a period of time so nothing breaks but will not receive updates.
  1. Projects enable easier configuration management and migration
  2. Custom Refresh
  3. Connection Metadata and 1-click database source creation
  4. Unmanaged External Sources connecting DataForge to Databricks tables
  5. New Connection Types: Salesforce API, Generic JDBC, Agentless ingestions within VPC
  6. Faster validations for Rules and Enrichments (sparky removed)
  7. Updated support and defaulting to latest DBR LTS version
  8. Starter Configs - Managed clusters, schedules, connections
  9. View in Databricks option to open the table
  10. Users tab for viewing who has access to the environment
  11. SaaS Offering of Platform

 

Projects enable easier configuration management and migration

Projects are a top-level container and provide the ability to group sets of configurations within an environment and easily migrate configurations.  Projects are now the primary vehicle for exporting and importing configurations.  For more information on how to use Projects, refer to the Project documentation

There are two primary intended uses for Projects which are listed below.  Projects are intentionally not tied to any specific set of sources or environments so users have full control of managing the configurations in their environments.

  1. Migrating full environments between different stages of Development, Testing, and Production.  Each project contains a full set of all sources, outputs, and templates developed in DataForge.
  2. Maintaining multiple projects for independent workstreams.  This assumes users have multiple "live" projects that each contain different sources, outputs, and templates related to separate workstreams.

Users can create as many projects as necessary within an environment.  When an import is run into a Project, all of the contents within the existing project are replaced.  During the upgrade process, all existing configurations are placed in a 'Default' project folder.

Sources within Default project

New Projects Management Page

Custom Refresh

Custom Refresh provides ultimate flexibility in managing how source data is stored and processed.  Setting a source Data Refresh type to Custom Refresh allows users to set a custom Delete Query and optional Partition Columns to use during the Refresh process of all inputs.  

CDC processes are not included in Custom Refresh sources as the Delete Query is defined to eliminate records from the hub tables that are no longer wanted.

For more information and examples of how to use, refer to the Custom Refresh documentation

Custom Refresh settings on a source

Connection Metadata and 1-click database source creation

Connection Metadata gives access to creating new sources from source database connections with one click.  A new tab displays on each Connection page called "Connection Metadata".  Users will be able to see all tables and views contained in the database that the connection user has access to.  

For more information, refer to the Connection documentation.

Users create one or multiple new sources directly from the Connection Metadata using the checkbox selections and a Create Source button.  

View already created sources set up to pull from specific tables or views by clicking the number hyperlink in the Sources column.

Unmanaged External Sources connecting DataForge to externally managed tables

Unmanaged External sources open the possibility to create relations and rules pointing to tables stored in Databricks and processed/managed outside of DataForge.

Users create a new source within DataForge and set the connection type as Unmanaged External which allows a fully qualified table name to be defined.  The source runs through an initial ingestion/parse to understand the external table schema. 

After relations and rules are created to reference the unmanaged external table, DataForge will look to this table during the Enrichment and Recalculate processes to make updates to existing hub tables.  Data duplication checks are in place to ensure the grain of existing sources are not changed as external data is referenced.

For more information and examples, refer to the Unmanaged External Source documentation.

New Connection Options: Salesforce API, Generic JDBC, Agentless ingestions within VPC

New source connection types have been added to DataForge to continue expanding native support.  

Agent-less Ingestions within VPC

Prior to the 7.0 release, a Local Agent was necessary to ingest data accessible within the Cloud VPC.  This has been improved so now a Local Agent is no longer necessary and users have the option of whether to use an agent or not when creating connections.

Any data sources accessible to the local agent previously can be updated to no longer use the agent for ingestion.  Any connections not using the local agent will show as Sparky Ingestion in the UI and will ingest data via a Databricks cluster that can be tuned.

Generic JDBC

Connect to any database through a JDBC connection.  Enter the connection string, driver class path (when needed), and any sensitive parameters such as tokens or passwords.  Sensitive parameters are entered/stored in a JSON key/value pair format and are encrypted.

For more information, refer to the Generic JDBC Connection documentation.

Salesforce API

Connect directly to a Salesforce environment object by creating a new Salesforce API connection.  The connection will authenticate through client credentials, JWT Bearer, or username-password.

Users will type in the NoSQL statement into the SQL Query setting on any sources that use a Salesforce API connection.

For more information, refer to the Salesforce Connection documentation.

Example Salesforce API connection

Example source settings

Faster validations for Rules and Enrichments (sparky removed)

Relation and Rule validation no longer requires a mini-sparky Databricks cluster to be running to validate syntax and allow users to save.  This drastically speeds up the creation and validation process as users can quickly rewrite logic and save in seconds.  

Now whenever validations are occurring, a local spark session runs through the API container to validate the relations/rules.

Updated support and defaulting to latest DBR LTS version

The default Databricks Runtime for all job cluster configurations is upgraded to 11.3 LTS with this release.  Any custom pools created in Databricks and referenced in DataForge clusters should be recreated and re-referenced to use this updated runtime version. DataForge recommends switching pool cluster configurations to be job cluster configurations using the Spot with Fallback settings.  For more information on the recommended cluster configurations, refer to the production cluster guidance documentation.

The mini-sparky cluster and pool are no longer needed for any processing within DataForge.  Due to this, the upgrade process will also convert all existing cluster configurations that use the "Job from Pool" setting along with the Sparky pool ID to now be a job cluster with the same instance type as was defined on the cluster pool. 

DataForge will leave all existing Sparky cluster pools in Databricks that were previously set up to avoid complications with custom notebooks created outside of DataForge that may still reference the compute. 

Starter Configurations - Managed clusters, schedules, connections

DataForge now creates and manages select cluster configurations, schedules, and connections in every environment.  

Cluster Configurations

Infrastructure can be hard to troubleshoot and processing compute isn't always the easiest to scale up or down quickly to those unfamiliar with the process.  To alleviate this problem, DataForge automatically creates a set of t-shirt sized clusters that users can attach to Process Configurations to use.

The following cluster configurations are DataForge Managed and are not editable by users.  Please note, larger clusters use more expensive instance types and number of workers which can increase cloud infrastructure costs.  Typically, these larger clusters also process data quicker so the cloud costs can offset.

In addition, the following cluster configurations are automatically created for use in the platform.  DataForge reads these cluster configurations by name for certain processes so users are unable to edit the names, but can edit settings and parameters within the cluster configuration.

  • Cleanup: Used every time the Cleanup process runs.  
  • Connection Test: Used every time a new connection is created and saved.  Runs once a day to retest connection status and connection metadata.
  • Databricks JDBC: Set up to use when ingesting data into sources through the Databricks JDBC Samples connection.  Includes Databricks JDBC library in parameters.
  • Developer: Optional use.  This is configured with an extended Idle Time setting of 30 minutes for users who are doing extensive development or don't want to wait for a cluster to launch.

Connections

Since Databricks can be a common connection used, DataForge creates two connections during the deployment.

  • Databricks Hive Metastore: allows sources to pull source data from any tables/views stored in the Hive Metastore catalog within Databricks.  Authenticates through implicit credentials.

  • Databricks JDBC Samples: allows sources to pull data from any tables/views stored in the Samples catalog within Databricks.  This connection uses the Generic JDBC option where the connection string is autopopulated to use a Databricks SQL datalake endpoint to generate the data.  Users need to initially provide credentials in this connection by generating a Databricks personal access token and saving it in the Connection JDBC Sensitive Parameters setting in the format of:  {"PWD":"<personal-access-token"}

 

Schedules

DataForge relies on certain schedules to perform maintenance and upgrades.  The following schedules are created by DataForge and read by name.  Users will be able to change the values of the schedules to change frequencies/timing to match the schedule that makes sense, but are not able to change the schedule names.

  • Cleanup:  Identifies the day and time that the DataForge System Cleanup process should begin.  For more information, refer to the Cleanup Configuration documentation.
  • Automatic Upgrade: Identifies the day and time that DataForge should try to auto-upgrade to the latest available release version.  Once the schedule time starts, DataForge will wait up to one hour for a time when there is a safe window of no processing occurring to run available upgrades. For more information, refer to the Automatic Upgrades documentation.  

View in Databricks option to view and query source data

Private deployments: Within the Data View tab on a source, a new "View in Databricks" button will appear near the top.  Clicking this button will open a new browser tab directly to the hub table of the source in Databricks.

SaaS deployments:  The Data View tab on any source will open a new browser tab directly to the hub table of the source in Databricks.  When navigating back to DataForge after the Databricks tab is opened, users will see an empty screen on the Data View tab and can navigate back to any other source tab.

After the Databricks Data Explorer tab opens, users can view metadata or sample data directly from the hub table.  Query the data by using the Create button to create a new SQL Editor or Notebook query. 

Users tab for viewing who has access to the environment

A new Users tab is available in the main navigation menu of DataForge.  Depending on the type of deployment, users will experience different options.

Private Deployment: Users who have accounts that can access DataForge through Auth0 will be displayed along with created dates and last login date/times.  To update user access, Admins need to manage through Auth0.  

Users tab for private deployments

SaaS Deployment: DataForge users can add new users directly from the Users page if they are given Admin access in the same settings.  Non-admin users will be able to see the list of environment users, but not be able to make changes.  Admins will be able to manage other users access by enabling/disabling Admin access for others or deleting other accounts to remove access.

Users Tab for SaaS deployment

Managing an existing or new user as an Admin

SaaS Offering of Platform

DataForge is now available as a SaaS offering! This allows customers to manage everything they want and not have to worry about infrastructure of the platform.  

Customers need to have a Cloud provider and a Databricks workspace set up.  A new DataForge environment is deployed based on the Databricks workspace identified. Customers continue to host all of their own data and compute infrastructure.  DataForge manages the platform infrastructure and metadata to host the services.

If you are interested in a new SaaS environment deployment or switching from an existing private deployment, please reach out to the DataForge team either directly or through a support request to get the process started!

Full Changelog

  • Create source from connection
    • "Add Pull Now button
    • Removed auto-scheduling from sources created from connection metadata"
  • Move All Containers Instances to Container Apps
    • Added ability to move from Container Instances to Container Apps in Azure to improve stability of the platform
  • Handle inactive objects during export
    • Added all objects to export/import scope. 
  • Added billing notifications for users 
  • Add "View in Databricks" button to open hive table
    • Added button to open databricks table view
  • Cleanup additions: hub views, output views, imports
    • Added cleanup of unreferenced views and imports
  • Highlight or add column to show managed clusters to make it easy to sort/identify them
    • Added column to view managed clusters
  • Default schedule
    • Added default flag for schedule. All new sources are automatically scheduled with default schedule
  • Create UI for adding users to IDO workspace, UI/API path to reset password and save in metastore
    • Added new "users" tab to the main menu. 
    • Non-SaaS functionality:
      • Users page is read-only. User list is auto-populated after login, and last_login_datetime is updated for each user
      • Users can open their record details and reset password
    • SaaS functionality:
      • In addition to non-saas, admin users can create new users, inactivate others, delete and reset passwords"
  • Change relations visual to not show one level down-grain
    • Added option to show/hide "cousin" relations. This helps de-clutter screen for complex configurations
  • Partition environment by Project
    • Added project as a top level partition for the environment
  • Add Retry option to input reprocess options
    • Added Retry Failed Process option at the source level
  • Remove Plugins from Agent
    • Agent plugins are no longer supported. Oracle and SAP Hana JDBC connections are now added to main Agent, and are also usable in Sparky Ingestion
  • Move hub views to project-specific db
    • all hub views now reside in the database(schema) defined at the project level
  • Add Archive File option for file ingestions and change archive behavior to read this check
    • Archive File is now a toggleable parameter for file ingestions
  • Update Azure storage libraries to minimum safe (ideally latest) versions
    • Azure storage libraries updated to latest versions
  • Move AWS Topic Alert logic into Core, with same logic for Emails
    • Backend changes for Alerts
  • Build ACA API In Terraform to replace the current ACI API In Azure 
    • Built a new Container App API for Azure environments to improve stability of the platform
  • Add failed ingestions to cleanup processing
    • Clean up failed ingestions
  • Agent-less source & output connection status validation
    • Connection status and metadata load for connections not using the agent
  • Create starter configs on environment deploy
    • default and example configs are now added to customer environments
  • Create Source level status for disabled initiation
    • disabled initiation is now visible in the source UI header
  • Create schedule file ingestion sparky
    • Enable sparky ingestion for files. Supports ingestion from AWS, Azure. 
    • Supports AWS file ingestion from private and public buckets
  • Move Sparky-Core routes to API
    • Enabled external sparky-core API routing to support SaaS clients
  • Add environment name to browser tab text
    • Environment name is now visible on the browser tab
  • Salesforce Ingest
    • Full support for querying and ingesting Salesforce Objects without the need for a custom notebook. Also now supports the modern OAuth 2.0 authentication flows to follow enterprise security best practices.
  • Scoped import/export + git integration
    • Import/export 2.0
  • Create postgres login for Sparky/SDK & configure DB permissions
    • moved all sparky postgres database calls to distinct sparky user with limited set of permissions
  • Custom source refresh type
    • New refresh type allows users to configure advanced data refresh scenarios. Uses delta lake engine, supports custom partitioning
  • Create new unmanaged source type to allow lookups to existing databricks tables
    • New source type enables additional integration with existing datasets
  • Improve performance of bulk reset all CDC + delete input
    • Optimized bulk rest CDC and delete operations
  • All environments should use condensed secrets, in the form of "system-configuration-<environment_id>"
    • Public and private secrets in Secrets Manager/Key Vault have been merged to one secret, called "system-configuration-<environment_id>"
  • Remove cleanup concurrency multiplier parameter
    • Remove redundant parameter, Cleanup concurrency/performance is now fully controlled by # of driver cores in Cleanup cluster configuration
  • Remove raw attribute case fix from db post-deploy post 6.2
    • Removed migration code
  • Kill mini-sparky in API
    • Removed mini-sparky interactive cluster. Replaced with built-in spark cluster running in API container.
  • Remove sparky heartbeat
    • Removed redundant process to minimize locking contention.
  • Update output process to wrap the Delete/Insert statement pair within a transaction
    • Snowflake Outputs now leverage a transaction for the delete/update/insert statements to avoid temporary periods where data was deleted, but not yet inserted - causing potential incorrect downstream reporting queries to be executed.
  • Update Source List Filter to retain last top level selection
    • Source search UX improvements
  • Create Generic Spark JDBC source type/connection
    • Spark ingestion supports connection using generic JDBC drivers. User needs to select generic_jdbc as Driver in connection, enter valid driver-specific jdbc connection string, add additional sensitive parameters as required, and attach jdbc driver jar to the cluster configured for sparky ingestion.
  • Publish sparky and sdk to s3, create notebook to init mount in customer workspace
    • Sparky and SDK are now accessed via mount in Databricks workspace
  • Sparky-Core SaaS infrastructure
    • Sparky talks to Core directly in SaaS version
  • Add system_flag boolean to config tables to denote a config created by IDO processes + managed_flag for cluster configs
    • System and managed flag will discern platform created configs in UI
  • Remove and Change UI Elements for SaaS Environments
    • UI updates for SaaS
  • Update spark version in TF
    • updated sparkVersion to 11.3.x-scala2.12
  • Enable support and default to latest databricks runtime (10.4 or 11.3)
    • Upgrade default databricks runtime version (11.3)
  • Connection Metadata & 1 click database data ingestion
    • User can create multiple sources with one click from connection metadata page

Updated

Was this article helpful?

0 out of 0 found this helpful