6.0.0 Major Release Blog

The DataForge 6.0.0 major release is here with several bug fixes, new features, and Beta lineage.

  1. Persist Processing Unit weights and calculations and display in UI 
  2. Detailed tracking and automated management of raw schema 
  3. Add "Max Parallel Clusters" parameter to cluster configuration 
  4. Stop idling sparky job runs when core is not running 
  5. Upgrade Postgres version of metadata database to 14 
  6. Ingestion and scheduling refactor for all ingestion types 
  7. Global workflow visualization: Lineage  

Persist Processing Unit weights and calculations and display in UI 

By: Vadim Orlov 

DataForge now has transparent PU usage, data persistence, visualization, and collection in the DataForge user interface, UI. Centi PU, cPU or 1/100th of PU, values are now visible in various DataForge tables; sources, rules, outputs, output mappings, and processes.  

Users will also find total PU and detailed breakdowns in history.process. This is done using the following formula with PUs and data volume factor: 

meta.enrichment, meta.source (sum of enrichment cIPUs), 
meta.output_source_column, meta.output_source (sum of mapping cIPUs)  

Usage history can now be used in queries and reports using meta.process_history table in databricks. Here is an example of the query returning total usage for each process record:   

select from_json(ipu_usage, 'base decimal(7,2), cdc_volume decimal(7,2), 
refresh_type decimal(7,2), rules decimal(7,2), refresh_recalc decimal(7,2),
output decimal(7,2), total decimal(7,2)').total from meta.process_history 

 

Detailed tracking and automated management of raw schema 

By: Vadim Orlov 

Users will find that the previous pain points associated with stale raw attributes have been greatly mitigated with DataForge version 6.0.0. 

For instance, detailed tracking of raw attributes for each input in the input.raw_attribute_ids column has been added to DataForge.  

In addition, two new flags have been added to the Raw Schema table that help visual raw attribute usage. The first is the Target flag which indicates whether a raw attribute is referenced in either a rule or mapping. The second is the Inputs flag. This shows whether a raw attribute exists in at least one or more inputs.  

If at least raw attribute has a value of “false” for both flags, the option to delete the orphaned attribute is available. Simply click “Delete Orphan Attributes” to remove them from the Raw Schema table. 

And finally, on the Inputs tab, the new option titled, “View Raw Schema”, is available within the triple dot menu. Clicking “View Raw Schema”, opens a new tab to show a Raw Schema table filtered by attributes that are included in the current input. 

 

Add "Max Parallel Clusters" parameter to cluster configuration 

By: Vadim Orlov 

The new “Max Parallel Clusters” parameter limits the number of concurrent Job Runs for each cluster configuration. When active, queued jobs stay in the process queue until an active Job Run is available. This allows for more efficient processing since redundant startups are minimized.  

However, if the “Max Parallel Clusters” parameter is set too low, users might see lower speeds. This is due to clusters being forced to go through a small number of Job Runs. The default value is 1,000 in order to mimic the previous DataForge cluster state, unconstrained. 

For best practices, it is recommended for environments with multiple sources ingesting data on the same schedule to use a cluster configuration with a “Max Parallel Clusters” value of one half of the number of sources. If users are seeing reduced speeds due to idle job starts, reduce the parameter value. If processes have long queue times, it is recommended to increase the parameter value to reduce time spent waiting in queue. 

 

Stop idling sparky job runs when core is not running 

By: Vadim Orlov 

Previously, there was an issue with spark jobs idling indefinitely when the core was stopped. In DataForge v6.0.0, this issue has been resolved. This change primarily affects non-production environments, such as development environments, that are not always up and running. 

 

Upgrade Postgres version of metadata database to 14 

By: Vadim Orlov 

Previously, in DataForge v5.2.0 Postgres version 10 was used in both Azure and AWS environments. Now, since upgrading to the AWS Aurora version 2 engine, DataForge is able to run Postgres version 14 in both environments.  

These version upgrades are exciting for DataForge users for multiple reasons: 

  • Additional granular scaling options 
    • Aurora version 1 (v1) was able to scale in geometric proportion, from 2 to 4, 8, 16, etc. ACUs 
    • Aurora v2 scales in 0.5 ACU increment and has lower minimum ACU of 0.5 
  • Faster scaling 
    • Aurora v1 moves the workload to different computed nodes as part of scale operation. This prevented scaling from occurring while active transactions were present 
    • Aurora v2 scales up and down in place allowing transactions to continue to run.  
      • As a result, v2 scale operations typically take just a few seconds and are virtually unnoticeable 
  • Improved User Experience (UX) 
    • Processing resiliency 
    • Cost savings for AWS customers 

Azure deployments have also been upgraded to run Postgres version 14 with Azure Database for PostgreSQL - Flexible Server. 

 

Ingestion and scheduling refactor for all ingestion types 

By: Joe Swanson 

Scheduling has been refactored to use two new tables; meta.ingestion_queue and history.ingestion_queue. This will affect all ingestion types; custom, sparky, loopback, table, file. 

Now, when users attach a schedule to a source, a record will be inserted into the Ingestion Queue table with the next time the Ingestion process should run. Core will check this table every 30 seconds and create a process if the scheduled time is ready to run. If the process is run by Sparky, a Databricks job will be launched and ran in Databricks workflow. Instead, if the process is run by an Agent, the Agent will pick up the queued process on a configurable heartbeat interval and run the process.  

“Pull now” has also been reworked. It now creates an Ingestion process immediately after clicking. File watcher in the Agent has been refactored too. It will now pick up all files it finds on its heartbeat interval, create processes for them, then immediately begin processing as well.  

These updates optimize file ingestions to run faster and more efficiently during large file and several multiple file uploads.  

Agents on 5.2.0 will be able to auto-update to 6.0.0 using the normal process, but they will no longer to be able to run Ingestion processes until they update to 6.0.0.   

 

Global Dataflow Visualization: Lineage  

By: Mary Adams EC  

The dataflow visualization in DataForge has arrived. Known as Lineage, this interactive graph allows users to see full dataflows from a source container to its output from an output column to its original source node and so much more. Due to the technology of this feature, it is currently in beta and will have many updates in future releases. 

lineage_large.png

Lineage starts with a node. Users can kick off a lineage graph by navigating to an DataForge table with lineage nodes. Container nodes, or parent nodes, can be found in the following locations: Sources, Relations, Outputs, and Output Source Channels. Child nodes are found in the following views on DataForge: Raw Schema, Rules, Data Viewer, and Output Mapping Columns. 

lineage_menu_outputs.png

Users have the option, when clicking on a node, to either start a new lineage instance or add the node to their current session. Users can access their current session by clicking on the Beta Lineage option located in the main menu. 

A lineage session will always start with a single node. As stated earlier, the node may either be a parent container or child node. The selected node will be highlighted on the graph. Users can navigate the dataflow related to any node visible on the graph by right-clicking on the desired node. A submenu will appear with the option to Add or Remove nodes in the dataflow. For child nodes, the option to Isolate will also be available. 

lineage_menu_origin.png

The different types of nodes are represented by combinations of colors and symbols. This is true for dataflow arrows or edges as well. Users will find the legend key in the upper right corner of the lineage screen. It can be toggled at any time in order to differentiate between the nodes. 

Edges show the dataflow starting from the left, origin, to the right, destination. Edges that are thicker and colored magenta represent relations. Users can left-click on these relation edges in order to view the relation graph. Note, the relation graph from lineage does not currently support editing relations and displays relations in the source_id order returned by the database. 

lineage_relation_view.png

In order to navigate lineage dataflows, users can select origin or destination. Origin shows the previous node in the dataflow. Destination shows the next node. Both have the recursive option which shows the start node for origin and the final node for destination. Users can also remove origins and destinations in order to simplify the dataflow. 

Users can expand or collapse parent nodes by clicking the left-hand corner of a container. Since some dataflows can have hundreds of edges and nodes, the option to Collapse All is also available. Expand All is an option as well, though its use is recommended for smaller graphs. 

lineage_collapsed_parent.png

Currently, lineage does not support link sharing between user accounts. Instead, users may export a PNG of their lineage session by clicking the Export button available at the top of the screen. Users can look forward to link sharing in a future release.

As a starting point for learning lineage, users can find the legend key below.

Colors_in_Action2-horiz.png

Full Changelog

  • Add SNS alerts to Custom SDK
    • SNS alerts are now available for Custom Parse, Ingest, and Post Output Tasks
  • Remove API reload of auth0 public keys on every authenticated request route
    • API performnace optimization: removed redundant fetch of Auth0 public keys
  • Add agent code to parameters json on process record, add "fail ingestion" mechanism to core
    • Agent code is on parameters field on process record. Core will now fail ingestions that are queued or in progress with their Agent not heartbeating.
  • Export YAMLs should have the same order of elements when exporting multiple times
    • Export YAMLs will now have consistent ordering to make them easier to compare to each other
  • Add java security changes to Spark Conf for Azure environment
    • Azure Databricks jobs are updated and defaulted on create to add two Java security environment variables to Spark Conf so that they can work with the new Postgres Flexible Server.
  • Display scheduled times in UI on main sources page
    • Added "Next Ingestion" column to source list page. It shows next scheduled ingestion datetime for schedule-driven sources.
  • Display more information on Cluster Config list page
    • Added 3 new columns to cluster list tab:
      • Databricks version
      • Pool ID
      • Number of workers
    • New columns are sortable and filterable
  • Getting a new instance pool to show up in the UI no longer requires API restart
    • Switched instance pool id back to free text attribute. This enables to configure clusters for newly created pools.
  • Disallow users from saving spaces in column names for Delta Lake outputs
    • Spaces cannot be saved in column names of Delta Lake outputs
  • Stop idling sparky job runs when core is not running
    • Fixed the issue with spark jobs idling indefinitely when core is stopped. Primarily affects non-production environments which are not 100% up all the time.
  • Azure AD provider upgrade
    • Azure AD terraform provider has been upgraded to version 2.25. Global admin privileges are no longer needed for the main terraform principal
  • Persist PU weights and calculations and show in UI
    • Fully transparent PU usage data persistence, visualization and collection
    • Persist cPUs (1/100ths of PU) values in tables: meta.enrichment, meta.source (sum of enrichment cPUs), meta.output_source_column, meta.output_source (sum of mapping cPUs)
    • Persist DataForge version, total PU and detailed breakdown in history.process via formula using above cPUs and data volume factors
    • Visualize PU information on source list, enrichment list, output list, output mapping and process screens
  • Deploy blank windows image to appstream, create fleet+stack that have network access to postgres
    • Appstream fleet has been deployed and can be used to download and install pgAdmin to connect to AWS RDS v2
  • Update TF & deployment app to support Postgres 14 upgrade
    • DataForge is now able to support Postgres version 14 in both AWS and Azure. With this version upgrade DataForge users will experience benefits including additional granular scaling options, faster scaling, and an improved user experience.
  • Remove source retention settings that are no longer used (cleanup config)
    • Removed legacy cleanup source retention settings from source settings tab parameters
  • Standardize s_row_id to Long datatype
    • All s_row_id fields will now be long type to ensure consistency across Source types
  • Deleting all inputs OR the only input for a source with the single "Delete Input" button breaks stuff
    • Deleting all inputs on a source now triggers "Delete Source Data" process, which is much faster.
  • Timestamp/Sequence, execute recalculate after input delete
    • When performing an input delete for a Timeseries or Sequence source, all enrichment rules will be recalculated to avoid an edge case where values would become incorrect.
  • Remove connection parameters api call in checkForFile function in agent for file system sources
    • File watcher location checking will now call and generate significantly less AWS temp credentials.
  • Build lineage UI
    • Lineage View
      • Access points for new sessions are Sources, Outputs, Output Sources, Raw Schema, Data Viewer, Relations, Rules, Output Mapping Columns
      • Main menu opens current lineage session
      • Zoom controls with either scroll wheel or pan zoom slider located on left-hand side
      • Legend can be toggled with the map icon in the upper right corner
    • Nodes
      • Selected node highlighted
      • Navigate dataflow by right-clicking node and selecting flow direction (origin for left, destination for right)
      • Graphical flow goes left to right
      • Export lineage graph using export
      • Expand All and Collapse All parent node buttons
      • Remove nodes by right-clicking and select remove
    • Edges
      • Dataflow arrows between nodes go left to right
      • Show relation dataflow by left-clicking thick magenta colored edge
    • Relation View
      • Ordered by first relation from the list of returned relations
      • Shows elation and source names
      • Can open a new tab to view Source or Relation
  • Lineage entry points - UI
    • Added lineage entry point ((Lineage icon) on :
    • Source list
    • Source → Raw Schema
    • Source → Relations List
    • Source → Rules
    • Outputs list
    • Output → Mapping under triple dot menu on channel
    • Output column triple dot menu
  • Add "Max Parallel Clusters" parameter to cluster configuration
    • New “Max Parallel Clusters” parameter limits # of concurrent job runs for each cluster configuration
      • queued jobs will stay in process queue and wait until active job runs free up
      • more efficient as it minimizes redundant starts
      • downside (when set too low) - forcing everything to go though few job runs and losing speed
      • default value 1000 to mimic current state (unconstrained)
      • may be used to help manage cloud compute limits, e.g. max concurrent # of nodes
  • Change Save Rule Template to jump to Linked Sources page
    • After saving a Rule Template, users are now directed to the Linked Sources tab
  • Update Timestamp/Sequence Refresh to read data from hub tables instead of enriched files when effective ranges change
    • When dealing with Timeseries or Sequence data with overlapping data between inputs. Fixed an edge case where enrichment values would be lost upon ingesting new data.
  • Remove "Reset All Enrichment" everywhere
    • Previously there was an option to "Reset All Enrichment" from the top level triple dot menu. This has now been removed. Resetting enrichments can be done individually per input by selecting it from their respective triple dot menu.
  • Refactor Agent scheduling
    • Agent, scheduling, and Ingestions have been refactored to reduce load on the API and improve stability.
  • Terraform: Add parameter for standard or premium databricks in Azure
    • "databricksWorkspaceSku" variable has been added as optional for Terraform workspaces. Variable is defaulted to "standard" but can be set to "premium"
  • Inability to right click a related source name to a new tab on Relations page
    • Related source names are now right-clickable on the Relations tab
  • Delete raw metadata column(s) by deleting an input that contains that attribute
    • Added detailed tracking of raw attributes for each input in input.raw_attribute_ids column
    • Added 2 flags to raw attribute tab :
      • Targets Flag indicates whether each raw attribute is referenced in rules/mappings
      • Inputs Flag indicates whether each raw attribute exists in at least 1 input
    • When at least one raw attribute exists with both flags = false, "Delete Orphan Attributes" button appears on raw attribute tab. It deleted all orphan attributes from the raw schema.
    • Deleting input also triggers automatic check and deletion of orphan attributes.
    • Added new option "View Raw Schema" to Input tab, under triple dot menu. It opens raw attribute tab filtered on attributes included into the current input

Updated

Was this article helpful?

0 out of 0 found this helpful