Projects represent a group of configurations in DataForge that users control and are the primary vehicle for migrating changes.
Projects are an efficient and easy way for users to manage configuration changes and provide the ability to integrate DevOps best practices using GitHub. DataForge has a direct integration with GitHub, allowing you to connect projects to branches and repo paths. For more information on incorporating GitHub, visit the article Managing Projects with GitHub.
When project configurations are imported to other projects, it is a full replacement of all objects within the project.
Every project consists of the following objects:
- Sources
- Outputs
- Templates (including all sub-categories of Templates such as Tokens and Groups)
- Project variables
- Relations
The objects that are global to an environment are the following:
- Connections
- Schedules
- Agents
- Cleanup, Cluster, and Process Configurations
Note: All global objects not contained in projects must exist in the target DataForge environment of the project the user is importing into. For example, if a connection named "Sql Server" is in use in the project exported, a connection named "Sql Server" must be created or already exist in the Connections of the DataForge environment the project lives in that the user is importing into.
A Default project will be created automatically by DataForge. All environments upgrading from 6.2.x DataForge environments will automatically have all sources, outputs, and templates moved into the Default Project.
Common Project Use Cases
There are two primary intended uses for Projects which are listed below. Projects are intentionally not tied to any specific set of sources or environments so users have full control of managing the configurations in their environments.
- Migrating full environments between different stages of Development, Testing, and Production. Each project contains a full set of all sources, outputs, and templates developed in DataForge.
- Maintaining multiple projects for independent workstreams. This assumes users have multiple "live" projects that each contain different sources, outputs, and templates related to separate workstreams.
Project UI
Users will know which Project they are currently working on by viewing or changing the Project drop-down in the top-left corner near the main menu while navigating through any of the Sources, Outputs, or Templates pages.
Example drop-down from Sources page
All existing projects are visible to users by navigating to the Main Menu and selecting the Projects option.
Projects Page
ID: Unique ID associated with each project in the workspace
Project Name: Name of the project
Status Columns: Disabled ingestion is indicated by a cycle icon with a strike through it. Locked project is indicated by a lock icon.
Default: Initially indicates the project was created by DataForge infrastructure. A project can be changed to the default by opening the project and selecting the Default option. Only one project can be the default project at a time.
Last Update Time: Shows the latest timestamp that the project was created, updated, or imported into.
Active: Indicates which project the user is currently viewing in the UI. Also indicated by the Project drop-down in the UI header next to the main menu.
Actions: Triple-dot menus provide the ability to roll back configurations to a previously successful import or retry a previously failed import.
Creating a New Project
From the Project UI page, select the New + button to create a new project.
New Project Creation
The following fields are displayed for users to populate:
Project Name: User-defined project name that can be edited as needed.
Description: User-defined description for the project that can be edited as needed.
Max Number of Imports: Defaulted number of imports the user wants to allow for the project that can be edited as needed.
Hub View Schema: User-defined schema name for the hub views of sources and outputs within the project. This defines which schema users would query in Databricks to view data. Please note this can not be changed after it is saved to a project.
Default Project: Dictates which project users view when first logging into DataForge. Also used in custom ingestion and custom post-output sessions when the user does not define the project name in the notebook.
Active: Indicates which project the user is currently viewing in the UI. Changes as users select different projects in the project drop-down of the UI header.
Disable Ingestion: Toggle to disable ingestion from running on all sources within the project. Can be turned on or off at any time.
Lock Project: Toggle to only allow configuration changes within the project via Project Import. Removes the ability to edit sources, outputs, or templates manually.
After filling in the fields on the new Project creation page, the save button needs to be selected to officially create the project. By default, the person creating the project will be given access
Managing Project User Access
User access can be controlled at the project-level using project roles. For more information on user access, visit Manage Users.
Project roles:
Each project access level includes permissions from the level below it. To change a user's acess to a project, open the project and navigate to the Users tab. Use the actions options to edit user access.
Owner: Write access to change project user access and can delete and rename project. Editor-level access to change configurations and operate processing.
Editor: Write access to configurations and ability to lock project and import to project. Has ability to enable/disable ingestions and operate processing.
Operator: Read-only access to configurations. Has ability to enable/disable ingestions and operate processing.
Read-only: Read-only access to configurations. No ability to operate processing.
Project Variables
Variables can be added to projects so that object names that are exported from one project are then substituted for a different object name when imported into another project. Variables persist on the project unless they're deleted.
Variables can be added for the following objects types:
- Custom Cluster Configurations
- Process Configurations
- Output Connections
- Output Schema Names
- Output Table Names
- Schedules
- Source Connections
Variables that do not exist in the target Project that is being imported into will automatically be created as placeholders when the Project Import runs. The Import will be marked failed and a message shows the user needs to populate the variables with values. After populating values into the variables, the import can be restarted.
To add a Variable, open the Project from the Menu->Projects page and navigate to the Variables tab. Select the New + button in the top right to open the variable creation modal.
Choose the variable type and give the variable a name. The variable name is what will appear in the export files like $<variable_name> when the project is exported.
Depending on the variable type, you will then either choose an existing configuration or type in a value for the object name. For Output Schema Name and Output Table Name types, the value should be the name of the schema or table in the Output that you want to substitute.
Example:
There are two projects, Dev and Prod. Each project has the same output using the same output connection and schema.
Both projects should not output to the same table so a Project Variable for Output Table Name should be added to substitute table names. This is the variable that would exist in the Dev project.
When the Dev project is exported, the yaml file for this output will have output table name replaced with the variable like below.
To finish this set up, a variable is created in the Prod project to match the variable name existing in the Dev project.
In summary, when the Dev project is exported, the output table name "dev_obt_output" is substituted for "$obt_output" in the export files, and when the export files are imported into the Prod project, the variable "$obt_output" is substituted for "prod_obt_output".
Exporting a Project
Exporting a project is the first step in migrating objects/changes to a different project. To export, select the triple-dot menu to the right on the Projects UI page and select export.
Export Project Option
The export will provide a compressed zip folder of all the objects with a standard folder structure organization. Within each folder, there is a YAML file for each individual object. Outside of the folders are a standard set of yaml files:
- defaults.yaml: indicates all default values for settings that were not changed from product defaults. Simplifies yaml file contents by removing default values from every file.
- meta.yaml: indicates the export format version for DataForge to read.
- variables.yaml: exists if project variables were created and used in the exported project.
-
relations.yaml: contains all source relation definitions in the following form:
- name: "[Source]-name-[Source]"
- expression: "[This].attribute = [Related].attribute"
- cardinality: "M-M"
Project Folder Structure
Importing a Project
As previously mentioned, Project imports replace ALL contents of the project being imported into. When users are ready to import, navigate back to the triple-dot menu where projects are exported, and instead, select the Import option.
Import Project Option
After selecting Import, click the Choose File option and upload the compressed zip folder containing the project being imported.
A warning is presented letting users know that ALL objects of the project will be replaced by the contents of the zip folder that was uploaded. This includes deleting any objects that previously existed in the project but are not in the zip folder.
Import Replaces All Objects Warning
To confirm this import is still wanted, select Import one more time to launch the process. Users are brought to the Import tab of the project to see the status of the import.
Example Failed Import
The import metadata shows on the Imports tab of the Project including the file name that was used for the import.
In the event the Import process fails, select the Log icon for the import process record to view the error that needs to be corrected before rerunning. If additional help is needed to understand the error, please submit a support ticket to the DataForge Support team.
Once the import completes successfully, users can view all objects attached to the project by navigating to the Contained Objects tab.
Project Contained Objects Tab
After a successful import, Project Import logs show detailed counts of all objects that were updated or deleted during the import.
Import Warning when Sources will be deleted
When an import is run that no longer contains a file for a source that currently exists, the import will inherently delete the source when it runs. As a safeguard to unwanted data deletion, imports that will delete sources are paused and show a warning symbol for the user to review and confirm that the sources are okay to delete.
The import will be marked with a Pause icon that is clickable to notify the user why the import was paused. The message shows the import will delete sources if continued. Three options exist: Cancel, Fail, or Proceed the import.
- Cancel: cancels out of the modal and keeps the import on pause
- Proceed: resumes the import that will result in sources being deleted
- Fail Import: fails the import without committing any changes to the project
To see a list of the sources that will be deleted in the import, select the Cancel button and open the logs by clicking the Log icon.
The list of sources to be deleted will be listed in the logs. Return to clicking on the status icon to choose how to handle the import.
Debugging Project Updates
When configurations are added or changed during an import, the created and updated by and timestamp information is updated accordingly.
When new objects are created, the object will show "Created by import <import number>" and a timestamp matching the start time of the import. To find the import number that matches, open the Project Imports tab within Menu->Projects to see the list of imports.
Similarly, when objects are updated or changed during an import, the object will show "Updated by import <import number>" and a timestamp matching the start time of the import.
Use this information to trace back when configurations were changed or updated for debugging and troubleshooting.
Making Changes or Fixes on a Project
It's recommended to define a project management and configuration change migration strategy before beginning development work on a project.
With the Project concept, users can incorporate DevOps best practices by establishing a project repo in Github to use in conjunction with DataForge exports and imports for change management. For more information and guidance on this process, please visit the article Managing Project Changes with Github.
At it's simplest, the best practice for developing changes and migrating the changes to an existing project is the following:
- Export the latest version of the project that needs configuration changes
- Import the project export folder from step 1 into a new or existing project where the changes can be made and tested
- Export the project contents from the development/test project from step 2 into a compressed zip folder
- Import the project export folder from step 3 back into the original project from step 1 that needs updating
Simplified visual of change process
Merging Changes from Multiple Projects
Depending on the way Project development is being managed, there may be times when multiple developers have configuration changes from separate projects that need to be merged into one project. This can be controlled through Github branch and merge strategies for customers who have set up this workflow.
Customers who do not utilize Github to manage Project changes will need to manually migrate the changes to one of the projects before following the standard export/import process to the final project. The recommended method is to manually recreate the configuration changes from one of the two environments into the other. However, this can also be done by carefully replacing targeted export folder objects before re-importing.
Visual of manual changes needed from multiple projects with Github
Deleting a Project
Projects can be deleted by opening the project from the Projects UI page. Deleting a Project means all objects, metadata, and data within the project are deleted as well. Given the severity, there are multiple warnings and checks as users use the delete option.
Select the Delete option which will populate a second Warning message telling users all data will be deleted. Select the check box to confirm this message is understood and select the Delete option a second time.
A final confirmation box will be shown displaying counts of objects from the project to be deleted and a final warning. To delete, type the project name exactly as is in the confirmation box and select the "Yes" button. The project is now deleted.
Initial Delete Selection
First Warning and Second Delete Selection
Second Warning and Confirmation Box
Typing Project name to Confirm and Select Yes
Best Practices for Migrating Changes in Projects (export/import)
Plan ahead for new version upgrades
Exporting from one project and importing into another is only possible when both project workspaces are on the same major version. If one workspace is 5.2.0 and the other is 6.1.1, exporting and importing between will not work because one is major version 5 and the other is major version 6. For this reason, plan ahead to know when to upgrade or not and when to migrate objects or not.
Communicate with the other developers in the DataForge workspaces
Be sure to communicate with the other developers in your DataForge workspaces on when an export/import is happening to avoid migrating changes that are not ready to be migrated.
Start with a backup of the target project
Before importing from one project to another, take an export of the project configurations that are going to be replaced or updated from the target project. Consider incorporating DevOps best practices by integrating a tool like GitHub to use repositories and branches for project backups. For more information, refer to the User Manual's Managing Projects with Github documentation.
Create connections, cluster configurations, process configurations, and schedules in the target
(Only applicable if importing across workspaces)
When the import runs in the target workspace, the first thing DataForge will check is whether the workspace-wide objects exist in the target workspace to match. This can be done after getting the initial error on import, but helps create a smoother experience with no errors if done ahead of the import.
Use a repository in GitHub to validate changes from one set of configurations to the next
Utilize a tool like GitHub to compare/contrast configuration changes between sets of configurations. For more information, refer to the User Manual's Managing Projects with Github documentation.
Keep a copy of the imported zipped project folder
Similar to taking a backup of the target project, save a copy of the imported project zipped folder. If there is any issue or unexpected behavior, this folder copy will jumpstart the troubleshooting process to see what configurations were and weren't included in the import. Every project export includes a standard structure of object level folders and then individual object YAML files per object within the folders.
Migrate any updated Databricks objects
(Only applicable if importing across workspaces)
When migrating any custom notebook sources or outputs in DataForge, be sure to migrate the updated code from the Databricks notebooks or recreate the notebook if it is new in the target Databricks workspace. Since DataForge migration is separate from Databricks migration, this needs to be done manually by copy/pasting or importing the notebooks between environments.
In addition to notebooks, migrate any cluster pools from the source workspace that should be reused in the target workspace.
Check the Imports tab to verify if it was successful or failed
In the target project, open the Processing page and look for the latest import that is running or completed running. If the import process has a failed icon, expand the row and click on the Logs symbol to view the error message.
Validate the target project after migrating
Scan through the updated sources and outputs to verify the additions/updates/removals made it to the target project.
Reach out to DataForge Support if assistance is needed
Errors shown in the import process logs generally point to where the issue configuration is and what is blocking it. Use these error messages to find the issues and correct them. If uncertain or stuck, submit a support request to the DataForge support team.
Updated