Components

This section reviews the logical component architecture, then lists the associated infrastructure services within supported cloud providers required to support the platform.

DataForge leverages various cloud components in AWS, Azure, or GCP (depending on the platform selected). Regardless of the platform, the actions of the components remain the same. This section provides an overview of these different components and how these components interact with one another. Subsequent sections explore the details as they pertain to what specific services are leveraged in the cloud provider platform.

UI (User Interface)

The UI represents the workspace you see when you log into the DataForge platform. This is the front end view of DataForge. The UI is characteristic of the workspace left-hand menu and consists of screens including Sources, Connections, Agents, and Outputs.

API

DataForge API is a lightweight communicator between all of the components of the infrastructure. The API strictly communicates between the components and does not execute any business logic.

Metadata

Metadata is the component that stores the vast majority of business logic and transformation. Metadata encompasses the databases and functions (in Postgres SQL) to execute the required business logic.

Core

Core is the component of orchestration. Core manages executions of beginnings, hand offs, restarts, queues, and workflow engine. Core works closely with the Metadata to execute the appropriate business logic in the appropriate order. Core also works with Job Clusters to reference when to start up and when to shut down Spark infrastructure.

Agent

The DataForge Agent is an optional tool to move data from the client infrastructure, typically out of network, into the DataForge Cloud infrastructure. The DataForge Agent is only utilized during the Ingest stage of the Logical Data Flow.

Interactive Cluster (Private Enterprise only)

The primary purpose of the Interactive Cluster is to perform lightweight SQL operations against the Data Lake for specific UI features such as Data Viewer.

Job Cluster

When a job is ready for execution, Core starts a new Job Cluster with specific job instructions. Business logic that has been validated in the metadata layer is passed into this Job as a set of parameters/variables and then run. Job cluster instructions focus only on data processing. Job Clusters rely on Core to indicate when to begin, run, or shut down based on appropriate orchestration.

Data Lake

The data lake is the location where source data is stored, and is dependent on the services used: AWS (S3) or Azure (Azure Data Lake Gen 2) or Google Storage. Source hub tables in Databricks are based on the data stored in the data lake.

Infrastructure Application

DataForge supports AWS, Microsoft Azure, and GCP cloud providers. The diagram below represents the Standard deployment of the major DataForge services. Private Enterprise deployments are deployed in the customer-managed VPC with private keys for critical compliance industries. The Private Cloud Enterprise deployments can be modified to adhere to specific client security policies or to integrate in a more custom fashion with outside services.

Terraform (Private Enterprise only)

DataForge utilizes Terraform to codify and manage all infrastructure services, components, and dependencies.

For a detailed list of components, please refer to the infrastructure Github Repository and use the Terraform CLI command show to generate a human-readable list of all services for your specific cloud vendor.

If you do not have access to Github, please submit a request to the DataForge support team.

Updated December 11, 2024 15:38

Was this article helpful?

0 out of 0 found this helpful