DataForge Cloud 8.2 Version Features Blog

DataForge Cloud version 8.2 is here, providing GitHub integration support, role-based access control, dependency recommendations, and an improved user experience.

Table of contents

  1. Project integration with GitHub
  2. Role-based access control
  3. Dependency recommendation tool
  4. User experience improvements
  5. Schema evolution option - Extend
  6. XML file parse support

Project integration with GitHub

Integrate your projects with GitHub and use Dev Ops best practices to manage your CI/CD process. DataForge supports connecting projects to a repo, different branches, and different paths to provide flexibility in managing your development process while providing the tools to make promotions reliable and easy. Each workspace is tied to a specific repository but can be reset as needed.

Utilize features like push and pull in your project settings to automatically send your latest changes to GitHub for further review and pull request/merging. Options for regular and force pushes and pulls allow you to do what you need. User information is passed in the commit message to GitHub for granular tracking back to your DataForge user base.

Track the progress of your pushes and pulls directly in the Imports tab of your project to ensure success or dig into details of any failures.

Role-based access control

DataForge now supports role-based access control to fine-tune how users interact with your data models. Set workspace-wide roles for each user in your User settings and refine project-specific access within each project to provide a customized experience.

Workspace-level permissions are as follows:

User Role Workspace Permissions
Admin
  • Manage Users
  • Manage Roles
  • Full Access to all projects
Power User
  • Manage connections, schedules, clusters, processing, cleanup configurations, and system configurations
  • Create a new project
  • Auto-added to Editor role in each existing + new project. Access can be removed manually by a project owner or admin
  • Project role reset to Read-Only when role removed
User
  • None - access will only be within specific projects

In addition to setting the workspace access, Admins can see a list of all project-level access on each user's page.

Project-level permissions are as follows:

Project Access Level Project Permissions
Owner
  • Delete and rename projects
  • Add/remove users to a project, change project access levels
Editor
  • Any changes in the objects contained in the project
  • Lock/Unlock project
  • Import
Operator
  • Pull data, reset/retry processes
  • Enable/disable ingestion for project
  • No configuration changes allowed
Read-Only
  • Read-only, export

Project owners can see a holistic picture of who has access to the project within the Project Users page and add or remove access as required.

Dependency recommendation tool

Talos is getting even smarter! Now, you can ask Talos to help you identify hard dependencies that should be in place to ensure your data is enriched in the correct order every time. A link shows you a list of hard dependencies that should be set up to ensure rules and output mappings use the latest data when processes run.

Quickly add the hard dependencies from this page to your sources using the checkboxes to identify which dependencies you want. After the dependencies are added, DataForge will wait to run the Capture Data Change and Enrichment processes until the dependent source data is refreshed.

In addition to adding dependencies, the list will provide information about circular paths where you have rules from two different sources pointing at each other. You can choose how to manage these situations between hard dependencies or keep current rules to satisfy your needs.

recommending dependencies.gif

 

User experience improvements

DataForge has released several user experience improvements to reduce complexity and make users more efficient.

Output mapping automap related columns

Adding a new source mapping to an output will automatically map any related fields in the new source channel to the existing output columns based on other channel mappings. For example, you may have an existing source channel (Source A) with a mapped field pointing to a related source (Source B). When you add a new source channel (Source C) with relations to Source B, all fields from the existing mappings that point to Source B will be copied.

automap related fields in channels.gif

 

Cancel job run from process

Easily cancel job runs directly from a process within DataForge rather than opening the job run in Databricks and canceling. 

cancelling job run.gif

Relation traversal drop-down search

There may be a long list of relations to select from when building rules pointing to other sources. You can quickly search for specific relations using the search tool when editing relation paths. 
relation traversal search.gif

 

Recalculate or Reset Output for all channels at once

With two clicks, you can recalculate all sources mapped to channels on an output or reset output for all channels simultaneously without resetting each channel individually.

reset all output channels.gif

Clean output tables for deleted channels

Output processes now automatically clean up published tables to delete orphaned output channel data no longer mapped to the output. When source channels are deleted, the data traditionally existed in the output table but will now be removed. This protects users from accidentally deleting channels and re-adding them later, causing duplicate sets of the same data. The delete queries include additional clauses to identify and delete records in your output tables where the DataForge managed column, s_output_source_id, is not null and contains a channel ID that is no longer mapped.

Filter outputs

Quickly narrow your search results by filtering outputs using various filter options, similar to filter options for sources.

filtering outputs.gif

Schema evolution option - Extend

Schema evolution is even easier to manage in DataForge with the Extend, Extend Complex, and Extend All options.

Extend

The new Extend feature allows you to convert data types of existing hub table attributes to match the data type of an incoming attribute where possible. When an incoming data type is different from the current hub table type, and the current hub table type can be upcast into the incoming type, DataForge will check all downstream rules, relations, and output mappings for the raw attribute being considered for extension. Once all checks have passed, the hub table data type will be extended, and data types will be updated for all downstream rules. Below is an example:

  Column Name Data Type
Current hub table data type amount Integer
Incoming data type amount Decimal
New hub table data type amount Decimal

Extend Complex

Extend Complex applies this concept to complex data types such as structs and arrays, where data types found inside the complex type will be converted to match the incoming data types that match on column name. With the Extend Complex option, complex types can adjust to new incoming sub-fields, while simple types will not be allowed to extend.

  Column Name Data Type
Current hub table data type parameters Struct{a int, b string}
Incoming data type parameters Struct{a int, b string, c boolean}
New hub table data type parameters Struct{a int, b string, c boolean}

Extend All

Extend All will allow DataForge to extend any column in the hub table, whether it is a complex type or not.

The updated list of Schema Evolution options is below:

Option Description
Lock Any schema changes will generate ingestion or parsing error
Add Allows adding new columns
Remove Allows missing (removed) columns in inputs. Missing column values will be substituted with nulls
Upcast

Allows retaining the same column alias when column datatype changes to the type that can be safely converted to an existing type (upcast). Example:

in input 1, column_a is string

in input 2, column_a is int

When Upcast is enabled, it will convert the int value in input 2 to string, retaining the original column alias column_a in the hub table.

This feature helps avoid creating new column versions with _2,_3.. suffixes when compatible data type changes happen in the source data.

Extend All, Extend Complex

Allows converting hub table column data type to match the data type of the same column from the latest set of data refreshed.

Clone Allows creation of new column aliased <original column>_2 when column data type changes and it's not compatible with the existing column.

 

XML file parse support

DataForge now supports XML as a file type selection for parse in source settings. Users are required to enter the RowTag identifier. Additional XML parse settings can be adjusted in the parsing parameters.

 

Updated

Was this article helpful?

0 out of 0 found this helpful