The Cleanup process puts a lock on all Source hub tables when it starts running to avoid data issues.
If a Source is running a process when Cleanup is started, the Cleanup process skips the Source during the current run and will try again the next scheduled run. If this continues to happen with the same Sources every scheduled Cleanup run, this can lead to ever-increasing storage costs from your cloud provider as Cleanup is never able to remove unnecessary data partition files. This most frequently becomes an issue when the Source(s) schedule overlaps with the Cleanup schedule.
When Cleanup skips a Source or Multiple Sources, the Source IDs are listed in the Cleanup process logs. Use this information to identify which Sources consistently overlap with the Cleanup schedule so you can make adjustments.
DataForge recommends ideally keeping a minimum of a 1-hour window for Cleanup to run every day where no Sources are scheduled to run and scheduling Sources to start no later than 30 minutes before the Cleanup scheduled run. This allows time for a Source to finish processing if it runs long before Cleanup starts.
This can be managed by adjusting Schedules to ensure there is no overlap in start times.
If schedules are not able to be adjusted for an open window due to SLAs, Cleanup can be run manually any time by visiting the Service Configurations page and using the menu to start the cleanup process.
Since Cleanup also helps control cloud costs by removing unneeded datalake objects, DataForge recommends running Cleanup at least once per week. If a Source has not had Cleanup run within the last 7 days, the Source status will show a garbage can icon.
Updated