Cleanup Parameters
Parameter | Description |
Hub delete type |
"Keep latest version only" setting will remove all non-current files and folders with hub table data from data lake "Keep all versions" disables hub data objects cleanup |
Inputs No Effective Range Period |
Retention period for batches (inputs) of data with no effective range (data). Applies to sources with key, timestamp and sequence refresh types |
Full Refresh Inputs Period | Retention period for not current/latest batches (inputs) of data. Applies to sources with full refresh type |
Zero Record Inputs Period | Retention period for inputs (batches) containing zero records. Applies to all source refresh types |
Failed Ingestion Inputs Period | Retention period for inputs that failed Ingestion |
Configuring Cleanup for the source
Customizing Cleanup Run Schedule
Cleanup schedules are maintained in the Schedules page. A default cleanup schedule will be created in every environment with a schedule name of "Cleanup" as part of the upgrade that will be set to run nightly at 12PM UTC. Do not change the name of this schedule as it needs to have the same name and spelling for the process to recognize the schedule and launch. To adjust the schedule run times, open the Cleanup schedule as you would a normal schedule and change the cron values to adjust the start times and save the changes. Any schedule change will be updated after the next cleanup process run completes or if the Core service is restarted.
Default Cleanup Schedule
Since Cleanup also helps control cloud costs by removing unneeded datalake objects, DataForge recommends running Cleanup at least once per week. If a Source has not had Cleanup run within the last 7 days, the Source status will show a garbage can icon.
Cleanup can be manually started from the Service Configurations settings found in the System Configuration menu.
Customizing Cluster Configuration for Cleanup
It is possible to customize the cluster configuration that is used for cleanup operations. Every DataForge workspace will have a cluster configuration named "Cleanup". This cluster config can be edited and users can change any settings other than the cluster name "Cleanup".
NOTE: There can only be one custom cleanup cluster at any point in time as DataForge is reading for a cluster named "Cleanup" in the background to process the cleanup operation.
If you notice the cleanup operations running for a very long time in your environment, it is possible to reduce the processing time and potentially save costs by customizing the Cleanup Cluster and switching it to a Single Node cluster.
Suggested settings to use for this configuration:
- Name: Cleanup
- Description: Cleanup cluster config
- Cluster Type: Job
- Scale Mode: Single Node
- Job Task Type: DataForge jar
- Spark Version: 11.3.x-scala2.12
- Node Type: m-fleet.2xlarge (may want to scale up or down depending on size of environment)
- Enable Elastic Disk: True
Updated