Cleanup Configuration

Cleanup Configuration defines retention settings for data lake objects and metadata. It is accessible via main menu under System Configuration -> Cleanup Configurations.

Default configuration object is created automatically and is assigned to all existing and new sources.

Cleanup Parameters

Parameter	Description
Hub delete type	"Keep latest version only" setting will remove all non-current files and folders with hub table data from data lake "Keep all versions" disables hub data objects cleanup
Inputs No Effective Range Period	Retention period for batches (inputs) of data with no effective range (data). Applies to sources with key, timestamp and sequence refresh types
Full Refresh Inputs Period	Retention period for not current/latest batches (inputs) of data. Applies to sources with full refresh type
Zero Record Inputs Period	Retention period for inputs (batches) containing zero records. Applies to all source refresh types
Failed Ingestion Inputs Period	Retention period for inputs that failed Ingestion

Cleanup process will delete all inputs from metadata store according to these settings. It will then remove any orphaned objects from the data lake, including deleted inputs and sources.

Configuring Cleanup for the source

When new source is created, it is automatically assigned default cleanup configuration. To change it, open source settings and select desired cleanup configuration:

Customizing Cleanup Run Schedule

Cleanup schedules are maintained in the Schedules page. A default cleanup schedule will be created in every environment with a schedule name of "Cleanup" as part of the upgrade that will be set to run nightly at 12PM UTC. Do not change the name of this schedule as it needs to have the same name and spelling for the process to recognize the schedule and launch. To adjust the schedule run times, open the Cleanup schedule as you would a normal schedule and change the cron values to adjust the start times and save the changes. Any schedule change will be updated after the next cleanup process run completes or if the Core service is restarted.

Default Cleanup Schedule

Since Cleanup also helps control cloud costs by removing unneeded datalake objects, DataForge recommends running Cleanup at least once per week. If a Source has not had Cleanup run within the last 7 days, the Source status will show a garbage can icon.

Cleanup can be manually started from the Service Configurations settings found in the System Configuration menu.

Customizing Compute Configuration for Cleanup

It is possible to customize the compute configuration that is used for cleanup operations. There is a compute named Cleanup that runs for this system-driven process.

If you notice the cleanup operations running for a very long time in your environment, it is possible to reduce the processing time and potentially save costs by customizing the Cleanup compute and switching it to Single Node.

To modify the compute used for Cleanup, open the compute configuration named Cleanup, modify the settings, and save changes.

Suggested settings to use for this configuration:

Name: Cleanup
Description: Cleanup compute config
Compute Type: Job
Scale Mode: Single Node
Job Task Type: DataForge jar
Spark Version: 15.4.x-scala2.12
Node Type: m-fleet.2xlarge (or equivalent type for respective data platform, may want to scale up or down depending on size of environment)
Enable Elastic Disk: True

Updated January 12, 2026 17:31

Was this article helpful?

1 out of 1 found this helpful