Complex Data Types

DataForge supports the use of complex data types such as array and struct, common in semi-structured datasets (JSON), API results, and streaming. These data types are treated similarly to all other data types during standard processing. You can now use these complex types along with scalar types in rules, relations, templates, output mappings, and filters. However, there are additional parameters and options available for easier management.

Source Settings

Additional parameters are available for use with complex data types.

Select List

The Select List parameter is found in Source Ingestion Parameters. This parameter provides the ability to further refine the raw attributes that are stored for an input. Select List applied a specified select list to ingested data, allowing to expand or hide columns that exist in the source connection dataset. This parameter does not apply to Agent ingestions. The default value is "*".

Example Select List Description
* Ingest every attribute
*, value.* Ingest every attribute. Create new columns for every attribute within struct field named "value".

Schema Evolution

The Schema Evolution parameter is found in Source Parsing Parameters. It contains a drop-down with preselected combinations of the below options. The default option is Add, Remove, Upcast.

Option Description
Lock Any schema changes will generate ingestion or parsing error
Add Allows adding new columns
Remove Allows missing (removed) columns in inputs. Missing column values will be substituted with nulls
Upcast

Allows retaining of the same column alias when column datatype changes to the type that can be safely converted to an existing type (upcast). Example:

in input 1 column_a is string

in input 2 column_a is int

When Upcast is enabled, it will convert the int value in input 2 to string, retaining the original column alias column_a in the hub table.

This feature helps avoid creating new column versions with _2,_3.. suffixes when compatible data type changes happen in the source data.

Clone Allows creation of new column aliased <original column>_2 when column data type changes and it's not compatible with the existing column.

 

Rules

Struct and array data types can be used in rules and templates. Structs can be navigated within a rule using dot notation, such as struct.key. Arrays can be navigated by using the array index (zero based), such as array[0]. The expression editor intelli-sense will help you navigate nested complex types by listing child keys for each nested level. All available spark built-in functions can be used in rules. For working with complex nested arrays, DataForge recommends using Databricks higher-order functions.

complex data types release.gif

Structs and arrays also display a metadata icon with further schema definitions. 

 

Output Mappings

Dot notation can be used to navigate struct fields directly in Output column mappings.

Output column mappings for struct data types include a button to expand mapped struct attributes. This will create new columns named as <original struct column name>_<key> and map corresponding struct keys into the new columns.

Expand option on output column mapping

Expanded column mapping to multiple columns

All struct data types mapped in a channel can be expanded by clicking Expand Structs in the channel options menu. 

Output channel struct options

Clicking this option recursively will expand nested structs to the next level.

Updated

Was this article helpful?

0 out of 0 found this helpful