Core Concepts
Understanding the fundamental concepts of Qarion ETL.
Flows
A flow is a declarative definition of a data transformation pipeline. It describes:
- What data to process
- How to transform it
- What the output should look like
Flows are defined in TOML files and processed by flow plugins.
Flow Types
Qarion ETL supports multiple flow types, each implementing a specific pattern:
- change_feed: Tracks changes in data over time
- delta_publishing: Financial transaction processing with merge operations
- sessionization: Groups events into sessions based on time windows
Flow Definition Structure
id = "my_flow"
name = "My Flow"
flow_type = "change_feed"
namespace = "raw"
[input]
columns = [...]
primary_key = ["id"]
[properties]
# Flow-specific properties
Datasets
A dataset defines the structure of data at a specific point in a pipeline. It includes:
- Column definitions with types
- Constraints (primary keys, required fields)
- Metadata (descriptions, properties)
Dataset Structure
name = "orders"
namespace = "raw"
[columns]
[columns.id]
schema_type = "integer"
required = true
description = "Order identifier"
Transformations
A transformation is the process of converting data from one dataset to another. Transformations are:
- Engine-agnostic: Described using transformation instructions
- Executable: Can be run on various engines (SQLite, Pandas, etc.)
- Code-generatable: Can be converted to SQL, DBT, or Airflow code
Transformation Instructions
Transformation instructions describe what to transform without specifying how:
instruction = TransformationInstruction(
source_table="orders",
target_table="orders_changes",
rules=[
TransformationRule(
operation="select",
columns=["id", "customer_id", "amount"]
),
TransformationRule(
operation="filter",
condition="batch_id = :batch_id"
)
]
)
Flow DAGs
A FlowDAG (Directed Acyclic Graph) represents the execution plan for a flow. It consists of:
- Nodes: Individual steps (ingestion or transformation)
- Edges: Dependencies between steps
Node Types
- Ingestion: Load data from files
- Transformation: Transform data between datasets
DAG Example
[File] → [Ingestion Node] → [Landing Table]
↓
[Transformation Node] → [Change Feed Table]
Engines
An engine is the execution environment for transformations. Qarion ETL supports:
- SQLiteEngine: SQLite database
- PandasInMemoryEngine: In-memory Pandas DataFrames
- PandasLocalStorageEngine: Pandas with local file storage
- DuckDBEngine: DuckDB database
Engine Abstraction
Engines are abstracted through:
- BaseEngine: Common interface for all engines
- BaseDialectAdapter: Converts transformation instructions to engine-specific code
Plugins
Qarion ETL uses a plugin architecture for extensibility:
Flow Plugins
Define flow-specific logic:
- Dataset generation
- DAG generation
- Transformation instruction generation
Engine Plugins
Provide engine implementations:
- Connection management
- Query execution
- Data retrieval
Code Generator Plugins
Generate code in various formats:
- SQL files
- DBT projects
- Airflow DAGs
Repository Plugins
Provide storage backends:
- Local file storage
- Database storage
Code Generation
Qarion ETL can generate executable code from flows:
SQL Generation
Generates SQL files with parameterized queries:
INSERT INTO orders_changes
SELECT * FROM orders
WHERE batch_id = :batch_id
DBT Generation
Generates complete DBT projects:
- Model files (
.sql) - Schema definitions (
schema.yml) - Project configuration (
dbt_project.yml) - Documentation (
.mdfiles)
Airflow Generation
Generates Airflow DAGs:
- Task definitions
- Dependencies
- Execution functions
Schema Evolution
Qarion ETL supports schema evolution with different modes:
Strict Mode
No schema changes allowed. Raises errors on schema mismatches.
Forward Mode
Allows adding new columns. Existing columns cannot be removed or changed.
Normal Mode
Allows schema changes with validation.
Batch Processing
Qarion ETL processes data in batches:
- Batch ID: Unique identifier for each batch
- Previous Batch ID: Reference to the previous batch for incremental processing
- Batch Filtering: Automatic filtering by batch ID