Legacy Pricing: Processing Units Overview

This section provides an overview of the legacy pricing Processing Units (PU) and how they are calculated to track usage of the platform for licensing. For current pricing, visit www.dataforgelabs.com/pricing

Processing Units Introduction 

In order to best align with the licensing and cost model of the underlying services leverage by DataForge (e.g. Databricks, AWS/Azure, etc.), the product also adheres to a usage-based cost model which seeks to accurately align value and efficiency gained to price paid for the services.
 
Cloud providers such as Azure, AWS, Databricks, and Snowflake charge customers based off of the CPU and/or memory of the underlying infrastructure multiplied by the duration those resources are allocated to the customer's use. This aligns well with customer value, as their core services focus on simplifying the management and operations of those infrastructure resources, or the IT Ops around keeping those resources running reliably.
 
Although DataForge also provides services to help customers manage their infrastructure resources, the primary value is derived from data engineer development and operations workflow efficiency rather than improving or optimizing the infrastructure the platform runs on.
 
Because accurately tracking engineering time saved versus an alternative data architecture, methodology, and/or toolsets is not feasible, a surrogate estimation measurement must be used with as many input data elements as possible to ensure clear alignment between product pricing and value.
 
This measurement of data engineering efficiency for DataForge is called an Processing Unit (PU) and leverages the highly detailed metadata information on user configurations and process tracking to estimate the level of automation and number of processes managed by DataForge.
 

Components of Calculation

All components of the PU calculation are based on DataForge processes.
 
A DataForge process is an atomic unit of operation or calculation within the DataForge meta-structure, with processes being generated automatically by the system based off the configurations and logic defined by the end users.
 
Please see the Data Processing Engine section for more information on processes.
 
The four major components which influence PU consumption in order of impact are:
  1. Flat base weight of each process, with different base weights by process type
  2. Added flat weight for Refresh and Output process types by Source Refresh Type
  3. Added calculated complexity weight for processes leveraging DataForge Rules or Output Mappings
  4. Logarithmic volume weight for Change Data Capture and Refresh process Types
PUs are only charged for successful processes. Failed processes have zero PU cost.
The total PU for a single process are as follows:
 
(<Base Weight> of Process Type) +
(IF Process Type IN (Refresh, Output) THEN <Refresh Type Weight> ELSE 0) +
(IF Process Type IN (Enrichment, Recalculate) THEN <Rules Weight> ELSE 0) +
(IF Process Type IN (Output) THEN <Output Mapping Weight> ELSE 0) +
(IF Process Type = Capture Data Changes THEN <Input Volume Weight> ELSE 0) +
(IF Process Type = Refresh THEN <Hub Table Volume Weight> ELSE 0)
 

Controlling PU consumption

In practical terms, users will see increased PU usage when they:
  1. Increase the number of runs/refreshes the systems processes monthly
  2. Add more Sources, Rules, Outputs, and Output Mappings
  3. Use primarily Key Refresh Type
  4. Switch Rules from Snapshot to Keep Current
  5. Pulling all data from source systems rather than incremental
Conversely, users can decrease their PU usage by:
  1. Moving tables which change slowly to a less-frequent refresh cadence
  2. Leveraging Full, None, Sequence, or Timestamp Refresh Type for small volume or tables which can support these alternative Refresh Types
  3. Leveraging Hard Dependencies rather than Keep Current to avoid Recalculation processes
  4. Use dynamically injectable tokens in the Source Query configuration to only pull incremental data, rather than the full table for each Input
  5. For deployed/finalized Sources, disable optional processes such as Data Profiling which may not be used actively outside of initial development workflows

Base Weight

The following table provides an overview of the base weights for each process:
 
Process Name Weight
capture_data_changes 2
manual_reset_all_ capture_data_changes 2
manual_reset_all_ processing_from_cdc 20
manual_reset_capture_ data_changes 2
custom_ingestion 5
custom_parse 5
custom_post_output 5
manual_reset_custom_ parse 5
input_delete 3
enrichment 1
manual_reset_all_ enrichment 1
manual_reset_ enrichment 1
import 10
ingestion 1
loopback_ingestion 1
sparky_ingestion 1
cleanup 0.5
meta_monitor_refresh 0.5
manual_reset_all_output 1
manual_reset_output 1
output 1
manual_reset_parse 2
manual_reset_sparky_ parse 2
parse 2
sparky_parse 2
data_profile 1
attribute_recalculation 1
manual_attribute_ recalculation 1
refresh 1
 
These weights will remain largely static over time as long as the definition/scope of the process does not change.
 
An example of this type of change is a 2.5.0 feature enhancement. The rollback process was eliminated and manual reset change data capture for Keyed sources was completely refactored after feedback of long run times.
 
The same logical operation that used to generate 2-4 processes per Input to reset CDC for an entire source (resulting in potentially hundreds or even thousands of processes) now generates just one intelligent and process-intensive process.
 
As a result, we gave this large, manual, and complex operation (manual_reset_all_ processing_from_cdc) a very high base cost (20) to reflect the change. Despite this high base cost, this universally results in an overall reduction in PU consumption and complexity of operating vs the old version.
 

Refresh Type Weight

DataForge generates different workflows, process chains, and process sub-types based off the Refresh Type configuration.
 
Specifically, Sources configured as Keyed can generate the most complex and challenging workflows. To account for this, additional PUs are generated based off what style of refresh is configured and is allocated to both Refresh and Output.
 
These weights are applied to every Refresh and Output operation:
 
Refresh Type Weight
Key 1
Timestamp
0.5
Sequence 0.5
Full 0.2
None
0.1
 

Rules Weight

For processes which leverage configured DataForge rules, PUs are calculated based off the number of and complexity of the rules compiled and executed as part of the process.
The following weights are applied to all Enrichment and Attribute Recalculation processes:
  • Rule with compiled length <= 250 characters = +0.03 weight
  • Rule with compiled length > 250 characters = +0.08 weight
    • See compiled expression under meta.enrichment -> expression_parsed
    • Compiled expressions are used to normalize against any non-primary relation traversal syntax differences and source name lengths.
    • Compile expressions ensure users are not punished for using descriptive object names or long-form syntax for their business logic
  • Expressions that include an aggregate function over a MANY relation traversal = +0.05 weight
  • Expressions that include a window function = +0.05 weight

Output Mapping Weight

Similar to rules, PUs are calculated based off the number of and complexity of the mappings configured and executed as part of the process.
The following weights are applied to all Output processes:
  • Base Mapping Weight(i.e. [This].mycolumn) = +0.01 weight
  • Mappings including a traversal through a relation = +0.03 weight
  • Aggregate Function Mappings = +0.05 weight

Input and Hub Table Volume Weight

Because cluster tuning and performance optimization becomes a major focus and feature set of DataForge once volumes move into the 100MB+ range per source, DataForge also allocates PUs based on the volume of data processed.
 
Because DataForge does not process the data (Databricks processes the data) linear weight scaling does not align to business value or feature usage.
 
After analyzing numerous models, the following formula correlates volume to feature value:
 
To illustrate how this complex formula materializes, the following table illustrates samples across various data volumes:
 
Data Volume
Weight
1 KB
0.04
1 MB
0.32
10 MB
0.64
100 MB
1.28
1 GB
2.56
10 GB
5.12

As you can see, the weight for the PU calculation doubles for each order of magnitude of data.

This results in a negligible PU weight at lower volumes, and an increasing, but not exponential weight as the processing moves into the Big Data territory where infrastructure optimization becomes a critical operational consideration and feature.
 
This weight is applied to each Input File Size and attached to the Change Data Capture process.
This weight is also applied to the Hub Table Size and attached to the Refresh process.
 
While Hub Table Size is often difficult to optimize, Input File Size can be often reduced by applying upstream filtering and limiting the data ingested into DataForge with each Input generated.
 

Single Day Duplicate Source-Process Discount

In some scenarios, data must be refreshed multiple times per day. Additionally, with complex or circular Keep Current rules, a single new Input can cause a cascade of refreshes to be generated to guarantee data accuracy.

To account for these common scenarios, a count of all Refresh and Attribute Recalculation processes are summarized per source per day then used in discount formula to reduce all PUs for those sources, respectively.

 
Here is the formula:
mceclip2.png
 

Example:

This example is a set of processes for a single source generated with one refresh daily and a keep current rule which triggers an attribute_recalculation process
Source Name Process Type Standard PU
SourceA ingest 1

SourceA

capture_data_changes 2.5
SourceA enrichment 3.5
SourceA refresh 2
SourceA output 4
SourceA attribute_recalculation 2
SourceA output 4
 
Calculating the variables of the formula:
SUM(PU_source) = 19 PUs for the day
Run Count_refresh = 1
Run Count_attribute_recalaculation = 1
mceclip3.png
Plugging in these variables to the formula:
This results in an adjusted PU of 14.6, reduced from 19 or a discount of 23%
Or said differently, the second refresh is 47% discounted
 
This discount increases as you perform more refreshes per source, per day, respectively
 

Common Scenarios:

  • There is no discount applied for Sources refreshed less than once per day
  • Refreshing a source 2  times per day is ~1.5x once per day
    • ~25% discounted
  • Refreshing a source 10 times per day (once per hour during business hours) is ~5x once per day
    • 50% discounted
  • Refreshing a source 24 times per day is ~10x once per day
    • ~58% discounted
  • Refreshing a source 96 times per day (once every 15 minutes) is ~32x once per day
    • ~66% discounted

This discount structure was introduced to provide significant discounts up to 10 times per day refresh cadences, with smaller additional discounts beyond.

Refresh cadences above 10 times per day are supported but often require substantial additional operational overhead, resulting in additional product support and platform complexity needs.

NOTE: These discounts are not calculated as part of the process_history.ipu_usage field. Clients seeking to analyze usage by querying this field may need to account for this discount manually.

 


Summary

While intimidating at first, the PU weighting system is necessarily complex to capture all the various ways Data Engineers may leverage the features within DataForge, while also avoiding overly burdensome costs when specific corner cases or scale-out solutions are implemented.
 
Any changes to the weighting system will be tested thoroughly to avoid any changes to expected monthly costs for existing customers, and will primarily be implemented to reduce PU costs for existing inefficient workflows or processes, or as part of a net-new feature release that will not impact existing Source configurations. 

Updated

Was this article helpful?

0 out of 0 found this helpful