Recommended Cluster Configurations during Development
For environments that are undergoing daily development, it may be beneficial to set up your cluster configurations to keep fully-warm clusters to reduce startup times. By adjusting the cluster idle settings, you can set up your cluster configuration to keep clusters warm all day.
Infrastructure Basic Overview
Before we detail the recommendations, it is important to have a basic understanding of the layers of infrastructure involved with cluster provisioning
Using the default configurations, DataForge requests a specific cluster from Databricks, who then in turn requests the associated virtual machines (VMs) from the cloud provider.
Databricks waits for the VMs to be provisioned by the cloud provider before installing and booting up their Runtime on those newly provisioned machines.
Finally, Databricks installs the DataForge Sparky codebase on top of their runtime - at which point DataForge begins the coordination of data processing.
This provisioning and installation process can take 1-5 minutes depending on a number of factors, during which you will see the "Launching Cluster" status in the DataForge user-interface.
Databricks Pools
To avoid or reduce the time spent waiting for the Cloud Provider to provision the virtual machines every time a new cluster is requested, Databricks offers a feature called Pools.
Pools allow users to configure a set number of VMs to remain provisioned even after a job has finished running. For more details, see the Databricks Documentation.
These waiting "warm" VMs consume cloud provider credits, but do not consume Databricks Credits (DBUs) or Processing Units.
Unfortunately, Pools only eliminate the VM cloud provider provisioning time. They do not reduce or eliminate the Databricks Runtime installation or DataForge Sparky installation processes required to execute DataForge processes.
DataForge Idle Clusters
To fully eliminate the cluster launch wait times, DataForge includes a configuration option called "Idle Clusters". These settings allow users to specify how many, and with what auto-shutdown timing DataForge keeps fully-warm clusters ready to execute DataForge processes.
These waiting "warm" clusters consume cloud provider credits and DBUs, but do not consume Processing Units while not executing a DataForge process.
The following guide details how to configure a common configuration of DataForge Idle clusters for a development environment to prevent developers from constantly waiting for cluster provisioning.
Warnings
Never specify a Cluster Configuration to point to an "all-purpose cluster" in Databricks.
DataForge does not support running DataForge processes on all-purpose clusters
Building a Cluster Configuration for Development
Start by navigating to the Cluster Configurations page within your DataForge environment (Menu -> System Configuration -> Cluster Configurations). A version of this type of cluster configuration is already created for you called "Developer" that is automatically created in every environment. You can use this cluster configuration or choose to create a new, similar configuration. The Developer Cluster Configuration has an idle timeout of 30 minutes, meaning it will shut down if there is a 30 minute period of inactivity and will start up again when a new process is initiated.
Creating a New Developer Cluster
To create a new cluster configuration that you'll be able to reuse, click the New + button on the cluster configuration page. The name and description can be whatever makes sense for your team. In this example, we've named the cluster "AlwaysOnDev" to make it easily known that this cluster is meant for development purposes.
When you set up the cluster config, you will need to select the Cluster Type, Scale Mode, and Job Task Type. Set the cluster type to Job, scale mode to Automatic, and and Job Task Type to DataForge jar. The Job Task Type should always stay on DataForge jar unless the cluster is being used for Custom Ingest, Custom Parse, or Custom Post-Output notebooks.
You may optionally specify "Job from pool", but because we plan to keep these clusters fully warm, there is very little benefit to Databricks pools - thus we recommend not using Databricks Pools for this configuration.
Once the basics are established, you'll begin to tweak the Parameters of the cluster config. For this configuration, we recommend keeping the default Node Types and settings. The default settings combined with Scale Mode set to Automatic will work well for almost all development and testing processes.
If working with small datasets (<10MB), it may be beneficial to set up a Single Node cluster to improve cluster efficiency and performance, while limiting cloud costs by only utiziling one node (the driver node) for the cluster with no worker nodes. To make any cluster a Single Node cluster, change the Scale Mode to "Single Node".
The last section of Parameters is the DataForge Configuration. This is where you will be able to adjust the idle settings of the cluster. For our purposes, we will set the Idle Time to be 28800. You can think about this number by converting the length of time you'd want to keep the cluster warm into seconds. In our example, the 28,800 represents 8 hours in a day * 60 minutes per hour * 60 seconds per minute. This effectively keeps the cluster(s) alive for the entire work day - with developers only having to wait for a cluster launch one time in the morning across the entire environment.
To set an Idle Time greater than 10 hours (36,000), you will also need to adjust the parameter "Timeout Seconds" under the Job Configuration section of the Cluster Configuration Parameters. This setting is a Databricks-level parameter that protects against long-running rogue processes, however it will kill any idle clusters alive for longer than this setting.
The other setting that is important to tweak in this heavy development use case is the Max Idle Clusters. A good way to think about this setting is how many developers will you have working in this environment at the same time. Similarly, how many sources are they kicking off at the same time. The rule of thumb is ~1.75 clusters per developer, rounding up for a quicker experience or down if you are okay potentially waiting as clusters are stood up from time to time. Rounding up, we set the Max Idle Clusters to 2 in our example since we only expect one developer to be active in the environment at a time.
These are the main two settings for a good cluster configuration while in a heavy development phase. You can also play around with Max Parallel Clusters Per Source and Cluster Launch Rate Per Source to find what works best for you. Below is a completed screenshot of everything we've talked about so far. Be sure to save the cluster configuration when you are done making changes.
The last steps are to attach this Cluster Configuration to a Process Configuration (also under Menu -> System Configuration) and associate the Process Configuration to the sources that you will be doing development on. Below are screenshots of what each of those steps look like.
Create new Process Configuration for Cluster Config
Attach Process Configuration to Sources in Development
Finally, be sure to change the associated Cluster Configuration to a more cost-effective alternative when promoting to Production or testing daily runs prior to promotion. While useful for efficient development cycles, keeping the clusters alive/idle while not in use can cause significant unexpected cloud and Databricks costs.