(Databricks only) Production Workload Compute Configuration Recommendations

This article will guide users on best practices for setting up Compute Configurations in a production DataForge environment.  It is recommended to read through the Spark and Databricks Overview first to understand how the DataForge, Databricks, and Cloud Provider services work together for processing. 

Overview

There are many ways to set up compute configurations in DataForge. These configurations can be tweaked and tuned to match the performance needed to meet processing SLAs.  An example of how to use these compute configurations to improve user experience is by tweaking the idle settings to optimize compute while in development to reduce wait times.  However, in Production environments, compute configurations should be tuned for both stability and cost optimization.  As source data grows, this tuning activity should be periodically revisited based on recent performance.

Compute Configuration Types

Job: Job compute uses the configurations and settings in the Compute Configuration within DataForge to know when to start up, how much compute is needed, what to process, and when to shut down.  These compute configurations start and shut down entirely based on the DataForge configuration settings.  When a new compute is needed, there is typically a 3-5 minute wait time for the compute to be provisioned before it begins processing.  Compute and instances are shut down after processing is complete and the compute waited the Max Idle Time of the Compute Configuration.

Job from Pool:  Job from Pool compute uses the configurations and settings from both the Compute Configuration in DataForge and the Pool configuration in Databricks.  When a new compute is needed, DataForge sends the request for compute to Databricks.  If the Pool has an idle instance ready, then it is used as a part of the compute requested.  This can save 1-2 minutes of startup time if instances are ready.  If not, DataForge waits for the compute to be provisioned like a job compute.  Instances are shut down only after the Pool in Databricks tells them to.

Compute Type Guidance

In general, users should always aim to use the Job compute type and "Spot with Fallback" setting available for production workloads.  This minimizes cloud costs by using spot instances when available, and  switches to on-demand instances when spot instances are unavailable.  This fallback setting is only available with Job-type compute configurations and not Job from Pool configurations. This is a Databricks limitation. It is set in the Cloud Configuration section of the Compute Configuration Parameters. DataForge recommends to set the First On Demand parameter to 1 so the Driver node of the compute is always on-demand.  In general, if spot worker instances are lost, the compute can recover and complete the job.  However, if the driver node is spot and is reclaimed, the compute is shut down and the process is marked failed. Skip to the Compute Configuration Example below to see a screenshot of a Spot with Fallback compute configuration.

Example AWS Settings on Cloud Configuration

Pools can be a helpful tool because they ensure the Cloud Provider keeps compute instances warm and ready to go when DataForge requests a new compute with those instance types.  This typically saves anywhere from one to two minutes of compute launch wait times as needed.  However, Pools have two major downsides - they can only be set to either fully Spot or fully On-Demand and there are additional cloud costs while instances are sitting idle and warm.  Instances created from a Pool drive up the cost of Infrastructure as the Pool waits for the amount of time set in the Instance Idle Auto Termination setting to shut down the instances if they are not used. 

For these reasons, it is highly recommended to use Job-type compute configurations with the Spot with Fallback option.  If users are experiencing issues with compute launch times taking too long and not meeting data processing SLAs, the next step is to tweak the Idle Time and Max Idle Compute settings within the Compute Configuration in DataForge

The Max Idle Compute setting ensures that X number of full compute are kept idle by DataForge to be used again by whichever source requests a new compute matching the configuration next.  The Idle Time sets the amount of time that a compute can be idle before it is shut down.  With these two settings, users can keep spot with fallback compute warm from one job to the next (similar to pools) so the next job does not need to wait for a compute to be provisioned.  This removes the typical 3-5 minute wait time for a compute to be provisioned but accrues cloud and databricks costs while the compute are idle.

Scale Mode Guidance

There are two ways to handle number of worker scale: Automatic and Fixed.  With fixed scale, users set the exact number of worker nodes to request for the compute.  With automatic scale, users set a minimum and maximum number of worker nodes for the compute and Databricks automatically scales the compute up or down within the min and max workers depending on how many it thinks the job needs to complete. 

When the job size is known and fairly stable, it is recommended to use fixed scale compute.  In this case, job size is the size of the input or hub table, based on the process type.  When the compute size for the job is unknown or varies widely between inputs, use the automatic scale option. This way Databricks can scale up to the lowest amount of workers it thinks are needed and scale down the workers when they are nearing completion or done with their portion of the processing. As jobs become more stable and the required workers are known, users can switch back to fixed scale.  This approach provides the best performance and reduces costs.

Instance Type and Number of Workers Guidance

Unfamiliar users tend to use the lowest price instance type available and increase the number of workers to match the job size needed.  However, this is not usually the best way to scale the environment and job processing.

Whenever compute performance needs to be improved, it is always best to think about how to best scale up and scale out.  Scaling up is upsizing the instance type to a large instance with more CPU and Memory available (e.g. going from xlarge to 2xlarge).  Scaling out is increasing the number of worker nodes in a compute but keeping the instance type the same. 

Every compute configuration contains a Driver Node and a certain number of Worker Nodes of a certain instance type.  Every driver and worker node will have a certain number of vCPUs, also known as Cores, and Memory to work with.  For example: in AWS, an m5n.xlarge instance has 4 vCPUs and 16 GiB of Memory and in Azure, a Standard_VS3_v2 instance has 4 vCPUs and 14 GiB of Memory. 

The total number of nodes times the number of vCPUs per node based on the instance type is the total amount of compute available for the job.  For example, a compute with 10 workers using m5n.xlarge would have 44 vCPUs.  This is 4 vCPUs per worker node plus 4 vCPUs for the driver node.

We recommend aiming for 5-10 workers in a compute and adjusting the instance type to match the total number of vCPUs and Memory needed for the job. 

Configuring less nodes and larger instance types can improve the performance and cost-efficiency of Spark in many scenarios, but must be balanced with overall compute stability when using spot instances. 

The Cloud providers have a limited number of spot instances available in any given Region and Availability Zone.  When using spot availability, there are times when the cloud provider may reclaim spot instances which are then removed from your running compute.  If too many spot instances are removed or if the driver node is spot and removed, the DataForge job will fail and attempt to retry based on the number of retries defined in each Source setting.  When deciding which instance type to request in a compute, use the cloud provider spot instance advisor links below to identify the instance type that will provide the number of vCPUs and Memory needed and ideally has the lowest interruption frequency.

AWS Instance Advisor: https://aws.amazon.com/ec2/spot/instance-advisor/

Azure Spot Instance Advisor: https://azure.microsoft.com/en-us/pricing/spot-advisor/

Google Cloud Spot VMs: https://cloud.google.com/compute/docs/instances/spot

Be sure to select the correct Region matching the Region your DataForge environment is deployed in.

Below is an illustration of a compute with many workers and how it can be converted to a larger instance type with fewer workers to reduce spot availability interruptions.

AWS

  Instance Type vCPU per Node Memory per Node Num Workers Total Num Nodes (incl Driver) Total vCPUs Total Memory
Compute 1 m5n.large 2 8 40 41 82 328
Compute 2 m5n.2xlarge 8 32 10 11 88 352

Azure

  Instance Type vCPU per Node Memory per Node Num Workers

Total Num Nodes

(incl Driver)

Total vCPUs Total Memory
Compute 1 Standard_DS3_v2 4 14 20 21 84 294
Compute 2 Standard_DS4_v2 8 28 10 11 88 308

GCP

  Instance Type vCPU per Node Memory per Node Num Workers

Total Num Nodes

(incl Driver)

Total vCPUs Total Memory
Compute 1 n4-standard-4 4 16 20 21 84 294
Compute 2 n4-standard-8 8 32 10 11 88 308
 

Tuning Compute Configurations

The best tools to use when tuning Compute Configurations are the Spark UI and Metrics that are provided by Databricks for every process that is run in DataForge.  To find these two options, navigate to the Process tab of a Source within DataForge (or the main Processing page), click the Job Run ID hyperlink for the process you want to investigate, and select the Databricks option. 

Navigating to Databricks from DataForge Process link

Selecting Spark UI or Metrics within Databricks

With the Spark UI open on either the Jobs tab or the Stages tab, focus on the number of total tasks needed for the job and the number of tasks running at a time in the Active and Completed Jobs sections.  The number of tasks able to run at a time is equal to the total number of vCPUs available in the compute. The larger the total number of tasks needed, the more opportunity users have to increase the parallelism of compute processing by adding more vCPUs via larger instance types or more workers.

 

With the Metrics UI open on the Main tab, focus on the Compute Load, Compute Memory, Compute CPU, and Compute Network charts to understand overall performance across all Driver and Worker nodes in the compute. View the bottom half of the metrics page to see a stacked graph for each individual node and the health of the node represented in colors.  Users should aim for the node colors to be either orange or red as this means that the workers are being fully utilized and cost is not being wasted with idle workers. 

Databricks Metrics UI

Common Challenges and Recommendations

1) The processing is taking too long to complete, affecting SLAs. What should I look for and take action on?

Check the following information on the Databricks Job Run page using the Spark UI or Metrics pages.

Compute Load Chart: If the 'Procs' line is consistently above the 'CPUs' line, users may want to add more total vCPUs to the compute configuration through larger instance types or more workers.

Compute Memory Chart: If the 'Use' line is consistently at or above the 'Total line', users may want to add more total memory to the compute configuration through larger instance types or more workers.

Compute Network Chart: If there are many spikes in the network while the compute is running, it is likely that workers are needing to re-shuffle data partitions across other worker nodes to complete their tasks.  Users may want to upsize the instance type and reduce the number of workers, which in turn reduces the amount of re-shuffling required.

Individual Node Charts (All Red): If the driver and worker nodes are showing all red, it means the nodes are all being fully utilized.  Users may want to upsize the instance type or add more workers to reduce the load on each node in the compute.

2) The processing is very quick but my costs are too high.  What should I look for and take action on?

Check the following information on the Databricks Job Run page using the Spark UI or Metrics pages.

Individual Node Charts (Green or Blue): If some or all of the worker nodes are showing Green or Blue colors, it is likely that the compute is oversized and users could benefit from downsizing the instance type or reducing the number of workers.

3) Some of the processes are very quick while others are very slow using the same compute configuration.  What should I do?

Use process configuration overrides!  Navigate to the Process Configuration documentation to get familiar with the Process Override concept.  It is common that some processes can be quicker than others.  In these cases, it's best to set the Default Compute Configuration to match the CPU and Memory needs of the majority of the processes.  Then add individual process configuration overrides to utilize a larger or smaller compute configuration for specific processes.

An example of when this can happen is if the source contains a large amount of data but there are very few or no relations or rules to process.  This would result in the Refresh process taking longer than the Enrichment process so the user could add a process configuration override to use a different compute configuration for Refresh.

Example of Processing where Process Override helps

Process Config with Override for Refresh Process

4) Running data outputs from a Key Refresh Type Source to Snowflake or SQL Server is taking longer than expected, even with a large compute configuration in DataForge.  What should I do?

Check the output process logs in DataForge for a completed output run and look for the messages "Running key delete query" and "Delete query complete".  Compare the time it took between these two messages to the overall process time.  If the majority of the time is spent waiting between these messages, then users could speed the output process up by increasing the Snowflake Warehouse.  Similarly, users could increase the compute size in SQL Server to speed the process up.  This is because DataForge handles the process up to the point the temp table is created, and then relies on Snowflake and SQL Server compute to complete the key deletes to update the existing table.  When Snowflake/SQL Server finishes, they send the 'all clear' back to DataForge to finish the process and mark it as a success or failure.

5) Periodically, processes are failing because of spot availability with jobs frequently running on 30-minute and hourly cadences.  What should I do?

The first settings to think about tweaking in a production environment that has jobs constantly running are the compute configuration Idle Time and Max Idle Compute parameters.  If compute is frequently shut down just before a new input starts processing, users can increase the Idle Time of the compute to stay idle long enough that the next time an input runs, it picks the idle compute up rather than trying to launch a new compute.  If there are multiple sources starting at the same time with the same compute configuration, increasing the Max Idle Compute may help as well to keep multiple compute ready to go when the next set of inputs begin processing.

Tweaked Idle Time of 5 Minutes and Max Idle Compute of 5

6) A process failed and the DataForge logs show "Job run terminated or failed to launch. Retrying process."  After opening the Job Run Databricks link, I see "AWS_MAX_SPOT_INSTANCE_COUNT_EXCEEDED_FAILURE" error. What does this mean and what can I do?

AWS sets maximum concurrent spot vCPU limits for each account.  If users receive this error, it means that the maximum number of vCPUs were running at one time on Spot Availability that AWS is allowing.  AWS dynamically adjusts these spot limits based on how many spot instances are used.  Users can also request increases in these Spot instance limits in AWS on the Limits service rather than relying on the dynamic adjustments. For more information, visit Amazon EC2 Service Quota documentation

Example of Spot Limit Error

Viewing Spot Quotas in AWS and Requesting More

Full Example Compute Configuration of Spot with Fallback

Below are full screenshots of compute configurations set up to use the Spot with Fallback option in both AWS and Azure.  The Compute Type needs to be set to Job and the Availability needs to be set to the "Spot with Fallback".  All other settings can be adjusted as needed. 

AWS Compute Configuration

 

Azure Compute Configuration

Updated

Was this article helpful?

0 out of 0 found this helpful