Production Workload Cluster Configuration Recommendations

This article will guide users on best practices for setting up Cluster Configurations in a production DataForge environment.  It is recommended to read through the Spark and Databricks Overview first to understand how the DataForge, Databricks, and Cloud Provider services work together for processing. 

Overview

There are many ways to set up cluster configurations in DataForge. These configurations can be tweaked and tuned to match the performance needed to meet processing SLAs.  An example of how to use these cluster configurations to improve user experience is by tweaking the idle settings to optimize clusters while in development to reduce wait times.  However, in Production environments, cluster configurations should be tuned for both stability and cost optimization.  As source data grows, this tuning activity should be periodically revisited based on recent performance.

Cluster Configuration Types

Job: Job clusters use the configurations and settings in the Cluster Configuration within DataForge to know when to start up, how much compute is needed, what to process, and when to shut down.  These clusters start and shut down entirely based on the DataForge configuration settings.  When a new cluster is needed, there is typically a 3-5 minute wait time for the cluster to be provisioned before it begins processing.  Clusters and instances are shut down after processing is complete and the cluster waited the Max Idle Time of the Cluster Configuration.

Job from Pool:  Job from Pool clusters use the configurations and settings from both the Cluster Configuration in DataForge and the Pool configuration in Databricks.  When a new cluster is needed, DataForge sends the request for a cluster to Databricks.  If the Pool has an idle instance ready, then it is used as a part of the cluster requested.  This can save 1-2 minutes of startup time if instances are ready.  If not, DataForge waits for the cluster to be provisioned like a job cluster.  Instances are shut down only after the Pool in Databricks tells them to.

Cluster Type Guidance

In general, users should always aim to use the Job cluster type and "Spot with Fallback" setting available for production workloads.  This minimizes cloud costs by using spot instances when available, and  switches to on-demand instances when spot instances are unavailable.  This fallback setting is only available with Job-type cluster configurations and not Job from Pool configurations. This is a Databricks limitation. It is set in the Cloud Configuration section of the Cluster Configuration Parameters. DataForge recommends to set the First On Demand parameter to 1 so the Driver node of the cluster is always on-demand.  In general, if spot worker instances are lost, the cluster can recover and complete the job.  However, if the driver node is spot and is reclaimed, the cluster is shut down and the process is marked failed. Skip to the Cluster Configuration Example below to see a screenshot of a Spot with Fallback cluster configuration.

Example AWS Settings on Cloud Configuration

Pools can be a helpful tool because they ensure the Cloud Provider keeps compute instances warm and ready to go when DataForge requests a new cluster with those instance types.  This typically saves anywhere from one to two minutes of cluster launch wait times as needed.  However, Pools have two major downsides - they can only be set to either fully Spot or fully On-Demand and there are additional cloud costs while instances are sitting idle and warm.  Instances created from a Pool drive up the cost of Infrastructure as the Pool waits for the amount of time set in the Instance Idle Auto Termination setting to shut down the instances if they are not used. 

For these reasons, it is highly recommended to use Job-type cluster configurations with the Spot with Fallback option.  If users are experiencing issues with cluster launch times taking too long and not meeting data processing SLAs, the next step is to tweak the Idle Time and Max Idle Cluster settings within the Cluster Configuration in DataForge

The Max Idle Clusters setting ensures that X number of full clusters are kept idle by DataForge to be used again by whichever source requests a new cluster matching the configuration next.  The Idle Time sets the amount of time that a cluster can be idle before it is shut down.  With these two settings, users can keep spot with fallback clusters warm from one job to the next (similar to pools) so the next job does not need to wait for a cluster to be provisioned.  This removes the typical 3-5 minute wait time for a cluster to be provisioned but accrues cloud and databricks costs while the clusters are idle.

Scale Mode Guidance

There are two ways to handle number of worker scale: Automatic and Fixed.  With fixed scale, users set the exact number of worker nodes to request for the cluster.  With automatic scale, users set a minimum and maximum number of worker nodes for the cluster and Databricks automatically scales the cluster up or down within the min and max workers depending on how many it thinks the job needs to complete. 

When the job size is known and fairly stable, it is recommended to use fixed scale clusters.  In this case, job size is the size of the input or hub table, based on the process type.  When the cluster size for the job is unknown or varies widely between inputs, use the automatic scale option. This way Databricks can scale up to the lowest amount of workers it thinks are needed and scale down the workers when they are nearing completion or done with their portion of the processing. As jobs become more stable and the required workers are known, users can switch back to fixed scale.  This approach provides the best performance and reduces costs.

Instance Type and Number of Workers Guidance

Unfamiliar users tend to use the lowest price instance type available and increase the number of workers to match the job size needed.  However, this is not usually the best way to scale the environment and job processing.

Whenever cluster performance needs to be improved, it is always best to think about how to best scale up and scale out.  Scaling up is upsizing the instance type to a large instance with more CPU and Memory available (e.g. going from xlarge to 2xlarge).  Scaling out is increasing the number of worker nodes in a cluster but keeping the instance type the same. 

Every cluster configuration contains a Driver Node and a certain number of Worker Nodes of a certain instance type.  Every driver and worker node will have a certain number of vCPUs, also known as Cores, and Memory to work with.  For example: in AWS, an m5n.xlarge instance has 4 vCPUs and 16 GiB of Memory and in Azure, a Standard_VS3_v2 instance has 4 vCPUs and 14 GiB of Memory. 

The total number of nodes times the number of vCPUs per node based on the instance type is the total amount of compute available for the job.  For example, a cluster with 10 workers using m5n.xlarge would have 44 vCPUs.  This is 4 vCPUs per worker node plus 4 vCPUs for the driver node.

We recommend aiming for 5-10 workers in a cluster and adjusting the instance type to match the total number of vCPUs and Memory needed for the job. 

Configuring less nodes and larger instance types can improve the performance and cost-efficiency of Spark in many scenarios, but must be balanced with overall cluster stability when using spot instances. 

The Cloud providers have a limited number of spot instances available in any given Region and Availability Zone.  When using spot availability, there are times when the cloud provider may reclaim spot instances which are then removed from your running cluster.  If too many spot instances are removed or if the driver node is spot and removed, the DataForge job will fail and attempt to retry based on the number of retries defined in each Source setting.  When deciding which instance type to request in a cluster, use the cloud provider spot instance advisor links below to identify the instance type that will provide the number of vCPUs and Memory needed and ideally has the lowest interruption frequency.

AWS Instance Advisor: https://aws.amazon.com/ec2/spot/instance-advisor/

Azure Spot Instance Advisor: https://azure.microsoft.com/en-us/pricing/spot-advisor/

Google Cloud Spot VMs: https://cloud.google.com/compute/docs/instances/spot

Be sure to select the correct Region matching the Region your DataForge environment is deployed in.

Below is an illustration of a cluster with many workers and how it can be converted to a larger instance type with fewer workers to reduce spot availability interruptions.

AWS

 

Instance Type

vCPU per Node Memory per Node Num Workers

Total Num Nodes (incl Driver)

Total vCPUs Total Memory
Cluster 1 m5n.large 2 8 40 41 82 328
Cluster 2 m5n.2xlarge 8 32 10 11 88 352

Azure

 

Instance Type

vCPU per Node Memory per Node Num Workers

Total Num Nodes

(incl Driver)

Total vCPUs Total Memory
Cluster 1 Standard_DS3_v2 4 14 20 21 84 294
Cluster 2 Standard_DS4_v2 8 28 10 11 88 308

GCP

 

Instance Type

vCPU per Node Memory per Node Num Workers

Total Num Nodes

(incl Driver)

Total vCPUs Total Memory
Cluster 1 n4-standard-4 4 16 20 21 84 294
Cluster 2 n4-standard-8 8 32 10 11 88 308
 

Tuning Cluster Configurations

The best tools to use when tuning Cluster Configurations are the Spark UI and Metrics that are provided by Databricks for every process that is run in DataForge.  To find these two options, navigate to the Process tab of a Source within DataForge (or the main Processing page), click the Job Run ID hyperlink for the process you want to investigate, and select the Databricks option. 

Navigating to Databricks from DataForge Process link

Selecting Spark UI or Metrics within Databricks

With the Spark UI open on either the Jobs tab or the Stages tab, focus on the number of total tasks needed for the job and the number of tasks running at a time in the Active and Completed Jobs sections.  The number of tasks able to run at a time is equal to the total number of vCPUs available in the cluster. The larger the total number of tasks needed, the more opportunity users have to increase the parallelism of cluster processing by adding more vCPUs via larger instance types or more workers.

 

With the Metrics UI open on the Main tab, focus on the Cluster Load, Cluster Memory, Cluster CPU, and Cluster Network charts to understand overall performance across all Driver and Worker nodes in the cluster. View the bottom half of the metrics page to see a stacked graph for each individual node and the health of the node represented in colors.  Users should aim for the node colors to be either orange or red as this means that the workers are being fully utilized and cost is not being wasted with idle workers. 

Databricks Metrics UI

Common Challenges and Recommendations

1) The processing is taking too long to complete, affecting SLAs. What should I look for and take action on?

Check the following information on the Databricks Job Run page using the Spark UI or Metrics pages.

Cluster Load Chart: If the 'Procs' line is consistently above the 'CPUs' line, users may want to add more total vCPUs to the cluster configuration through larger instance types or more workers.

Cluster Memory Chart: If the 'Use' line is consistently at or above the 'Total line', users may want to add more total memory to the cluster configuration through larger instance types or more workers.

Cluster Network Chart: If there are many spikes in the network while the cluster is running, it is likely that workers are needing to re-shuffle data partitions across other worker nodes to complete their tasks.  Users may want to upsize the instance type and reduce the number of workers, which in turn reduces the amount of re-shuffling required.

Individual Node Charts (All Red): If the driver and worker nodes are showing all red, it means the nodes are all being fully utilized.  Users may want to upsize the instance type or add more workers to reduce the load on each node in the cluster.

2) The processing is very quick but my costs are too high.  What should I look for and take action on?

Check the following information on the Databricks Job Run page using the Spark UI or Metrics pages.

Individual Node Charts (Green or Blue): If some or all of the worker nodes are showing Green or Blue colors, it is likely that the cluster is oversized and users could benefit from downsizing the instance type or reducing the number of workers.

3) Some of the processes are very quick while others are very slow using the same cluster configuration.  What should I do?

Use process configuration overrides!  Navigate to the Process Configuration documentation to get familiar with the Process Override concept.  It is common that some processes can be quicker than others.  In these cases, it's best to set the Default Cluster Configuration to match the CPU and Memory needs of the majority of the processes.  Then add individual process configuration overrides to utilize a larger or smaller cluster configuration for specific processes.

An example of when this can happen is if the source contains a large amount of data but there are very few or no relations or rules to process.  This would result in the Refresh process taking longer than the Enrichment process so the user could add a process configuration override to use a different cluster configuration for Refresh.

Example of Processing where Process Override helps

Process Config with Override for Refresh Process

4) Running data outputs from a Key Refresh Type Source to Snowflake or SQL Server is taking longer than expected, even with a large cluster configuration in DataForge.  What should I do?

Check the output process logs in DataForge for a completed output run and look for the messages "Running key delete query" and "Delete query complete".  Compare the time it took between these two messages to the overall process time.  If the majority of the time is spent waiting between these messages, then users could speed the output process up by increasing the Snowflake Warehouse.  Similarly, users could increase the compute size in SQL Server to speed the process up.  This is because DataForge handles the process up to the point the temp table is created, and then relies on Snowflake and SQL Server compute to complete the key deletes to update the existing table.  When Snowflake/SQL Server finishes, they send the 'all clear' back to DataForge to finish the process and mark it as a success or failure.

5) Periodically, processes are failing because of spot availability with jobs frequently running on 30-minute and hourly cadences.  What should I do?

The first settings to think about tweaking in a production environment that has jobs constantly running are the cluster configuration Idle Time and Max Idle Clusters parameters.  If clusters are frequently shut down just before a new input starts processing, users can increase the Idle Time of the cluster to stay idle long enough that the next time an input runs, it picks the idle cluster up rather than trying to launch a new cluster.  If there are multiple sources starting at the same time with the same cluster configuration, increasing the Max Idle Clusters may help as well to keep multiple clusters ready to go when the next set of inputs begin processing.

Tweaked Idle Time of 5 Minutes and Max Idle Clusters of 5

6) A process failed and the DataForge logs show "Job run terminated or failed to launch. Retrying process."  After opening the Job Run Databricks link, I see "AWS_MAX_SPOT_INSTANCE_COUNT_EXCEEDED_FAILURE" error. What does this mean and what can I do?

AWS sets maximum concurrent spot vCPU limits for each account.  If users receive this error, it means that the maximum number of vCPUs were running at one time on Spot Availability that AWS is allowing.  AWS dynamically adjusts these spot limits based on how many spot instances are used.  Users can also request increases in these Spot instance limits in AWS on the Limits service rather than relying on the dynamic adjustments. For more information, visit Amazon EC2 Service Quota documentation

Example of Spot Limit Error

Viewing Spot Quotas in AWS and Requesting More

Full Example Cluster Configuration of Spot with Fallback

Below are full screenshots of cluster configurations set up to use the Spot with Fallback option in both AWS and Azure.  The Cluster Type needs to be set to Job and the Availability needs to be set to the "Spot with Fallback".  All other settings can be adjusted as needed. 

AWS Cluster Configuration

 

Azure Cluster Configuration

Updated

Was this article helpful?

0 out of 0 found this helpful