Skip to main content

Core Concepts

Understanding the fundamental concepts of Qarion ETL.

Flows

A flow is a declarative definition of a data transformation pipeline. It describes:

  • What data to process
  • How to transform it
  • What the output should look like

Flows are defined in TOML files and processed by flow plugins.

Flow Types

Qarion ETL supports multiple flow types, each implementing a specific pattern:

  • change_feed: Tracks changes in data over time
  • delta_publishing: Financial transaction processing with merge operations
  • sessionization: Groups events into sessions based on time windows

Flow Definition Structure

id = "my_flow"
name = "My Flow"
flow_type = "change_feed"
namespace = "raw"

[input]
columns = [...]
primary_key = ["id"]

[properties]
# Flow-specific properties

Datasets

A dataset defines the structure of data at a specific point in a pipeline. It includes:

  • Column definitions with types
  • Constraints (primary keys, required fields)
  • Metadata (descriptions, properties)

Dataset Structure

name = "orders"
namespace = "raw"

[columns]
[columns.id]
schema_type = "integer"
required = true
description = "Order identifier"

Transformations

A transformation is the process of converting data from one dataset to another. Transformations are:

  • Engine-agnostic: Described using transformation instructions
  • Executable: Can be run on various engines (SQLite, Pandas, etc.)
  • Code-generatable: Can be converted to SQL, DBT, or Airflow code

Transformation Instructions

Transformation instructions describe what to transform without specifying how:

instruction = TransformationInstruction(
source_table="orders",
target_table="orders_changes",
rules=[
TransformationRule(
operation="select",
columns=["id", "customer_id", "amount"]
),
TransformationRule(
operation="filter",
condition="batch_id = :batch_id"
)
]
)

Flow DAGs

A FlowDAG (Directed Acyclic Graph) represents the execution plan for a flow. It consists of:

  • Nodes: Individual steps (ingestion or transformation)
  • Edges: Dependencies between steps

Node Types

  • Ingestion: Load data from files
  • Transformation: Transform data between datasets

DAG Example

[File] → [Ingestion Node] → [Landing Table]

[Transformation Node] → [Change Feed Table]

Engines

An engine is the execution environment for transformations. Qarion ETL supports:

  • SQLiteEngine: SQLite database
  • PandasInMemoryEngine: In-memory Pandas DataFrames
  • PandasLocalStorageEngine: Pandas with local file storage
  • DuckDBEngine: DuckDB database

Engine Abstraction

Engines are abstracted through:

  • BaseEngine: Common interface for all engines
  • BaseDialectAdapter: Converts transformation instructions to engine-specific code

Plugins

Qarion ETL uses a plugin architecture for extensibility:

Flow Plugins

Define flow-specific logic:

  • Dataset generation
  • DAG generation
  • Transformation instruction generation

Engine Plugins

Provide engine implementations:

  • Connection management
  • Query execution
  • Data retrieval

Code Generator Plugins

Generate code in various formats:

  • SQL files
  • DBT projects
  • Airflow DAGs

Repository Plugins

Provide storage backends:

  • Local file storage
  • Database storage

Code Generation

Qarion ETL can generate executable code from flows:

SQL Generation

Generates SQL files with parameterized queries:

INSERT INTO orders_changes
SELECT * FROM orders
WHERE batch_id = :batch_id

DBT Generation

Generates complete DBT projects:

  • Model files (.sql)
  • Schema definitions (schema.yml)
  • Project configuration (dbt_project.yml)
  • Documentation (.md files)

Airflow Generation

Generates Airflow DAGs:

  • Task definitions
  • Dependencies
  • Execution functions

Schema Evolution

Qarion ETL supports schema evolution with different modes:

Strict Mode

No schema changes allowed. Raises errors on schema mismatches.

Forward Mode

Allows adding new columns. Existing columns cannot be removed or changed.

Normal Mode

Allows schema changes with validation.

Batch Processing

Qarion ETL processes data in batches:

  • Batch ID: Unique identifier for each batch
  • Previous Batch ID: Reference to the previous batch for incremental processing
  • Batch Filtering: Automatic filtering by batch ID