Cleanup Configuration

Cleanup Configuration defines retention settings for data lake objects and metadata. It is accessible via main menu under System Configuration -> Cleanup Configurations.
 
 
Default configuration object is created automatically and is assigned to all existing and new sources.
 

Cleanup Parameters

Parameter Description
Hub delete type

"Keep latest version only" setting will remove all non-current files and folders with hub table data from data lake

"Keep all versions" disables hub data objects cleanup

Inputs No Effective Range Period

Retention period for batches (inputs) of data with no effective range (data).

Applies to sources with key, timestamp and sequence refresh types

Full Refresh Inputs Period Retention period for not current/latest batches (inputs) of data. Applies to sources with full refresh type
Zero Record Inputs Period Retention period for inputs (batches) containing zero records. Applies to all source refresh types
Failed Ingestion Inputs Period Retention period for inputs that failed Ingestion

 

Cleanup process will delete all inputs from metadata store according to these settings. It will then remove any orphaned objects from the data lake, including deleted inputs and sources.
 

Configuring Cleanup for the source

When new source is created, it is automatically assigned default cleanup configuration. To change it, open source settings and select desired cleanup configuration:
 
 

Customizing Cleanup Run Schedule

Cleanup schedules are maintained in the Schedules page.  A default cleanup schedule will be created in every environment with a schedule name of "Cleanup" as part of the upgrade that will be set to run nightly at 12PM UTC. Do not change the name of this schedule as it needs to have the same name and spelling for the process to recognize the schedule and launch.  To adjust the schedule run times, open the Cleanup schedule as you would a normal schedule and change the cron values to adjust the start times and save the changes.  Any schedule change will be updated after the next cleanup process run completes or if the Core service is restarted.

Default Cleanup Schedule

Since Cleanup also helps control cloud costs by removing unneeded datalake objects, DataForge recommends running Cleanup at least once per week. If a Source has not had Cleanup run within the last 7 days, the Source status will show a garbage can icon.

Cleanup can be manually started from the Service Configurations settings found in the System Configuration menu.

Customizing Cluster Configuration for Cleanup

It is possible to customize the cluster configuration that is used for cleanup operations.  Every DataForge workspace will have a cluster configuration named "Cleanup".  This cluster config can be edited and users can change any settings other than the cluster name "Cleanup".

NOTE: There can only be one custom cleanup cluster at any point in time as DataForge is reading for a cluster named "Cleanup" in the background to process the cleanup operation.

If you notice the cleanup operations running for a very long time in your environment, it is possible to reduce the processing time and potentially save costs by customizing the Cleanup Cluster and switching it to a Single Node cluster.

Suggested settings to use for this configuration:

  • Name: Cleanup
  • Description: Cleanup cluster config
  • Cluster Type: Job
  • Scale Mode: Single Node
  • Job Task Type: DataForge jar
  • Spark Version: 11.3.x-scala2.12
  • Node Type: m-fleet.2xlarge (may want to scale up or down depending on size of environment)
  • Enable Elastic Disk: True

Updated

Was this article helpful?

1 out of 1 found this helpful