Core Concepts

Understanding the fundamental concepts of Qarion ETL.

Flows

A flow is a declarative definition of a data transformation pipeline. It describes:

What data to process
How to transform it
What the output should look like

Flows are defined in TOML files and processed by flow plugins.

Flow Types

Qarion ETL supports multiple flow types, each implementing a specific pattern:

change_feed: Tracks changes in data over time
delta_publishing: Financial transaction processing with merge operations
sessionization: Groups events into sessions based on time windows

Flow Definition Structure

id = "my_flow"
name = "My Flow"
flow_type = "change_feed"
namespace = "raw"

[input]
  columns = [...]
  primary_key = ["id"]

[properties]
  # Flow-specific properties

Datasets

A dataset defines the structure of data at a specific point in a pipeline. It includes:

Column definitions with types
Constraints (primary keys, required fields)
Metadata (descriptions, properties)

Dataset Structure

name = "orders"
namespace = "raw"

[columns]
  [columns.id]
  schema_type = "integer"
  required = true
  description = "Order identifier"

Transformations

A transformation is the process of converting data from one dataset to another. Transformations are:

Engine-agnostic: Described using transformation instructions
Executable: Can be run on various engines (SQLite, Pandas, etc.)
Code-generatable: Can be converted to SQL, DBT, or Airflow code

Transformation Instructions

Transformation instructions describe what to transform without specifying how:

instruction = TransformationInstruction(
    source_table="orders",
    target_table="orders_changes",
    rules=[
        TransformationRule(
            operation="select",
            columns=["id", "customer_id", "amount"]
        ),
        TransformationRule(
            operation="filter",
            condition="batch_id = :batch_id"
        )
    ]
)

Flow DAGs

A FlowDAG (Directed Acyclic Graph) represents the execution plan for a flow. It consists of:

Nodes: Individual steps (ingestion or transformation)
Edges: Dependencies between steps

Node Types

Ingestion: Load data from files
Transformation: Transform data between datasets

DAG Example

[File] → [Ingestion Node] → [Landing Table]
                              ↓
                    [Transformation Node] → [Change Feed Table]

Engines

An engine is the execution environment for transformations. Qarion ETL supports:

SQLiteEngine: SQLite database
PandasInMemoryEngine: In-memory Pandas DataFrames
PandasLocalStorageEngine: Pandas with local file storage
DuckDBEngine: DuckDB database

Engine Abstraction

Engines are abstracted through:

BaseEngine: Common interface for all engines
BaseDialectAdapter: Converts transformation instructions to engine-specific code

Plugins

Qarion ETL uses a plugin architecture for extensibility:

Flow Plugins

Define flow-specific logic:

Dataset generation
DAG generation
Transformation instruction generation

Engine Plugins

Provide engine implementations:

Connection management
Query execution
Data retrieval

Code Generator Plugins

Generate code in various formats:

SQL files
DBT projects
Airflow DAGs

Repository Plugins

Provide storage backends:

Local file storage
Database storage

Code Generation

Qarion ETL can generate executable code from flows:

SQL Generation

Generates SQL files with parameterized queries:

INSERT INTO orders_changes
SELECT * FROM orders
WHERE batch_id = :batch_id

DBT Generation

Generates complete DBT projects:

Model files (.sql)
Schema definitions (schema.yml)
Project configuration (dbt_project.yml)
Documentation (.md files)

Airflow Generation

Generates Airflow DAGs:

Task definitions
Dependencies
Execution functions

Schema Evolution

Qarion ETL supports schema evolution with different modes:

Strict Mode

No schema changes allowed. Raises errors on schema mismatches.

Forward Mode

Allows adding new columns. Existing columns cannot be removed or changed.

Normal Mode

Allows schema changes with validation.

Batch Processing

Qarion ETL processes data in batches:

Batch ID: Unique identifier for each batch
Previous Batch ID: Reference to the previous batch for incremental processing
Batch Filtering: Automatic filtering by batch ID

Flows​

Flow Types​

Flow Definition Structure​

Datasets​

Dataset Structure​

Transformations​

Transformation Instructions​

Flow DAGs​

Node Types​

DAG Example​

Engines​

Engine Abstraction​

Plugins​

Flow Plugins​

Engine Plugins​

Code Generator Plugins​

Repository Plugins​

Code Generation​

SQL Generation​

DBT Generation​

Airflow Generation​

Schema Evolution​

Strict Mode​

Forward Mode​

Normal Mode​

Batch Processing​

Related Documentation​