-
How can I view the data for a DataForge source?
- Every Source has a hub table and source view that is managed and can be queried directly in Databricks using either the SQL Editor or a notebook if you have permission. Every Source from DataForge will have a hub table named "hub_<source id>" that you can query. If a view name is defined in the Source settings, that view can also be queried instead of the hub table.
-
How do I fix inputs that have failed?
- The steps to resolution may be different depending on the type of failure. Each process (Ingestion, Parse, CDC, etc) has logs associated that can be found in the Process tab for each Source or Output. These logs provide error messages to help identify what went wrong during the job run. After determining the cause of the failure and taking any necessary steps to resolve the failure, you can reset the failed processes for an individual input, or the entire Source if you are using the triple-dot menu on the header row. For more information, please visit Resetting Processes. There may also be cases where you see newer inputs Queued waiting on the failed older input to run. This is an occurrence that happens with Sources of certain Data Refresh Types like Key refresh that need to process earlier inputs before moving on.
-
My input processed successfully, but some of the rule columns are still null. How do I get the rules to calculate and show in my table?
- If the rules were created before an Input is processed, then the rules will be calculated and results shown in the hub table. If a rule is created after the input has run, it will create the new column in the hub table, but will not auto-calculate the results for previous inputs. If you can see the rule columns in the result table, but all of the rows are Null, it means you need to reset enrichment on the Input using the options on the triple-dot menu. To Recalculate rules for all Inputs, use Recalculate All or Changed options at the Source-level (header row triple dot menu).
-
I've noticed some of my process steps are taking longer than I'd like. How can I tweak the Cluster settings so there is more compute power available during Refresh for example?
- You can easily set up or adjust Process Configurations and Cluster Configurations in DataForge. If you notice that you need more compute for certain processes, you can create a Process Override to use a different Cluster Configuration for a certain step. For more information, please visit the Process Overrides documentation. For more information on troubleshooting cluster performance, please visit the Production Workload Cluster documentation.
-
I'm getting errors that say "Insufficient Instance Capacity Failure". How do I fix this?
- This means that your cloud provider (AWS/Azure/GCP) do not have enough capacity to fulfill your cluster request at the moment in your zone. You can resolve this by changing your Cluster Configuration (in DataForge) or Cluster Pool (in Databricks) to use a different instance type. For example, if you are running all jobs on m5a but you see that m5n has better availability, you can change your Cluster Configuration to use m5n. Any jobs launched after making this change will pick up the new Cluster Instance Type. DataForge recommends using fleets for cluster node types for best availability and lowest cost.
-
Is it possible to use JOINS in the Source Query?
- It is possible to use JOINs in the Source Query. We recommend keeping these to simple JOINs for performance reasons. If it is more complicated logic or also contains subqueries, then we recommend building the data and logic into DataForge using new Sources and Relations to simplify the Source Query.
-
I have duplicate fields in my Raw Schema for a Source. How did this happen and how do I resolve it?
- This can happen when schemas are interpreted as newer inputs are ingested, and will typically be related to File-based sources rather than tables. As new inputs are ingested, DataForge will try to infer the schema for the metadata (if it is not provided) and if the data types are ambiguous it can read a column as a different data type. This results in a new raw attribute with "_2" appended to it. To avoid this happening, you can declare the schema to DataForge so that it does not attempt to infer each Input. To resolve the duplicate column, either use a rule with Coalesce() to read between the two columns or you can delete the Input that is listed in Raw Schema next to the column you're trying to delete. Deleting the input that the extra column was created in will automatically remove that column from the Source.
Updated