Engines and Storage

Understanding Qarion ETL's engine and storage layer configuration.

Overview

Qarion ETL uses a layered architecture for data processing and storage:

Engines: Execution environments for running transformations (SQLite, Pandas, DuckDB)
Storage Backends: File storage for input data (local filesystem, S3)
Repository Storage: Metadata storage for definitions and history (local files, database)

Engines

Engines are the execution environments where transformations run. They provide the database or data processing capabilities.

Qarion ETL supports two types of engines:

Processing Engine ([engine]): Used for data transformations, SQL execution, and data processing
Metadata Engine ([metadata_engine]): Used for storing metadata (flows, datasets, migrations) in database storage

Processing Engine Configuration

The processing engine is configured in the [engine] section of config.toml:

[engine]
name = "sqlite"
[engine.config]
path = "data/qarion-etl.db"

This engine is used for:

Executing SQL transformations
Running data processing tasks
Storing and querying dataset data
All transformation operations

Metadata Engine Configuration

The metadata engine is optional and configured in the [metadata_engine] section:

[metadata_engine]
name = "sqlite"
[metadata_engine.config]
path = "data/metadata.db"

If not specified, the processing engine is used for metadata storage (backward compatible).

The metadata engine is used for:

Storing flow definitions (when flow_storage = "database")
Storing dataset definitions (when dataset_storage = "database")
Storing migration history (when schema_storage = "database")
All metadata-related database operations

Separating Processing and Metadata Engines

You can use different engines for processing and metadata storage. This is useful when:

Processing on Spark/Pandas: Run transformations on Spark or Pandas while storing metadata in SQLite/PostgreSQL
Different Performance Requirements: Use a fast in-memory engine for processing and a persistent database for metadata
Resource Isolation: Separate processing workloads from metadata management

Example: Processing on Pandas, Metadata in SQLite

# Processing engine - for data transformations
[engine]
name = "pandas_memory"
[engine.config]
# No configuration required

# Metadata engine - for storing metadata
[metadata_engine]
name = "sqlite"
[metadata_engine.config]
path = "data/metadata.db"

# Use database storage for metadata
dataset_storage = "database"
flow_storage = "database"
schema_storage = "database"

Example: Processing on PySpark, Metadata in PostgreSQL

# Processing engine - for data transformations
[engine]
name = "pyspark"
[engine.config]
master = "local[*]"
app_name = "qarion-etl"

# Metadata engine - for storing metadata
[metadata_engine]
name = "postgresql"
[metadata_engine.config]
connection_string = "postgresql://user:pass@localhost/metadata"

When to Use Separate Engines:

✅ Processing large datasets with Spark/Pandas while keeping metadata in a lightweight database
✅ Using in-memory engines for processing but needing persistent metadata storage
✅ Isolating processing workloads from metadata operations
✅ Different engines optimized for different purposes

When to Use the Same Engine:

✅ Simple setups where one engine is sufficient
✅ Development and testing environments
✅ When processing engine supports both use cases well

Available Engines

SQLite Engine

SQLite is a file-based SQL database engine, ideal for local development and small to medium datasets.

Configuration:

[engine]
name = "sqlite"
[engine.config]
path = "data/qarion-etl.db"

Features:

File-based (single database file)
No server required
ACID-compliant transactions
Good for development and testing
Limited concurrency

Use When:

Developing locally
Working with small to medium datasets
Need a simple, file-based solution
Testing transformations

Pandas In-Memory Engine

Pandas-based engine that stores data in memory. Fast but data is lost when the process ends.

Configuration:

[engine]
name = "pandas_memory"
[engine.config]
# No configuration required

Features:

In-memory storage (very fast)
No persistence (data lost on exit)
Good for testing and development
Supports DataFrame operations

Use When:

Testing transformations quickly
Development and prototyping
Data doesn't need to persist
Working with small datasets

Pandas Local Storage Engine

Pandas-based engine with local file persistence using Parquet format.

Configuration:

[engine]
name = "pandas_local"
[engine.config]
storage_dir = "data/pandas"

Features:

Persists data to Parquet files
Fast read/write operations
Good for analytical workloads
Supports DataFrame operations

Use When:

Need persistence with Pandas
Working with analytical data
Prefer Parquet format
Local file-based storage

DuckDB Engine

DuckDB is an in-process analytical database, optimized for analytical queries.

Configuration:

[engine]
name = "duckdb"
[engine.config]
path = "data/qarion-etl.duckdb"

Features:

In-process analytical database
Optimized for analytical queries
Supports SQL and Parquet
Fast columnar operations

Use When:

Analytical workloads
Need fast columnar operations
Working with large datasets
Complex analytical queries

PostgreSQL Engine

PostgreSQL is a production-grade relational database, ideal for production environments and multi-user scenarios.

Configuration:

[engine]
name = "postgresql"
[engine.config]
host = "localhost"
port = 5432
database = "mydb"
user = "myuser"
password = "${credential:db_password}"  # Using credential store

Or using connection string:

[engine]
name = "postgresql"
[engine.config]
connection_string = "postgresql://user:password@localhost:5432/mydb"

Features:

Production-grade relational database
ACID-compliant transactions
Multi-user support
Excellent for production environments
Supports complex SQL operations
Requires psycopg2 or psycopg2-binary package

Installation:

pip install psycopg2-binary

Use When:

Production environments
Multi-user scenarios
Need robust transaction support
Centralized database management
Large-scale data processing
Team collaboration

Configuration Options:

host: Database host (default: localhost)
port: Database port (default: 5432)
database: Database name (required if not using connection_string)
user: Database user (optional, can use credential store)
password: Database password (optional, recommended to use credential store)
connection_string: Full connection string (alternative to individual parameters)

Example with Credential Store:

# qarion-etl.toml
[engine]
name = "postgresql"
[engine.config]
host = "db.example.com"
port = 5432
database = "production_db"
user = "${credential:db_user}"
password = "${credential:db_password}"

PySpark Engine

PySpark engine uses Apache Spark's DataFrame API for distributed data processing.

Configuration:

[engine]
name = "pyspark"
[engine.config]
app_name = "Qarion ETL"
master = "local[*]"
enable_hive_support = false
warehouse_dir = "spark-warehouse"

Or for Spark cluster:

[engine]
name = "pyspark"
[engine.config]
app_name = "Qarion ETL"
master = "spark://master:7077"
config = {
    "spark.executor.memory" = "4g"
    "spark.executor.cores" = "2"
}

Features:

Distributed data processing
Spark DataFrame API
Supports large-scale data processing
Can run locally or on Spark cluster
Supports Hive (optional)
Requires pyspark package

Installation:

pip install pyspark

Configuration Options:

app_name: Application name for Spark (default: "Qarion ETL")
master: Spark master URL (default: "local[*]")
- "local[*]": Run locally using all available cores
- "local[4]": Run locally using 4 cores
- "spark://host:port": Connect to Spark cluster
- "yarn": Run on YARN cluster
enable_hive_support: Enable Hive support (default: false)
warehouse_dir: Spark warehouse directory (optional)
config: Additional Spark configuration options (dict)

Use When:

Processing large datasets that don't fit in memory
Need distributed processing capabilities
Working with big data workloads
Running on Spark clusters
Need Spark DataFrame operations

Example: Local Development

[engine]
name = "pyspark"
[engine.config]
app_name = "MyProject"
master = "local[*]"

Example: Spark Cluster

[engine]
name = "pyspark"
[engine.config]
app_name = "ProductionPipeline"
master = "spark://spark-master:7077"
config = {
    "spark.executor.memory" = "8g"
    "spark.executor.cores" = "4"
    "spark.sql.shuffle.partitions" = "200"
}

Polars Engine

Polars is a fast DataFrame library written in Rust with Python bindings, optimized for analytical workloads.

Configuration:

[engine]
name = "polars"
[engine.config]
storage_dir = "data/polars"  # Optional: for persistence

Or in-memory only:

[engine]
name = "polars"
[engine.config]
# No storage_dir = in-memory only

Features:

Very fast DataFrame operations (Rust-based)
Lazy evaluation for query optimization
Columnar data processing
Optional persistence to Parquet files
Supports standard SQL queries
Memory efficient
Requires polars package

Installation:

pip install polars

Configuration Options:

storage_dir: Optional directory for persisting DataFrames as Parquet files. If not specified, operates in-memory only.

Use When:

Need very fast DataFrame operations
Working with large datasets
Analytical workloads
Want lazy evaluation benefits
Need memory-efficient processing
Prefer Rust-based performance

Example: In-Memory

[engine]
name = "polars"
[engine.config]
# In-memory only, no persistence

Example: With Persistence

[engine]
name = "polars"
[engine.config]
storage_dir = "data/polars"

Polars vs Pandas:

Polars is significantly faster for most operations
Polars uses lazy evaluation (can optimize queries)
Polars is more memory efficient
Pandas has a larger ecosystem and more features
Polars is better for analytical workloads
Pandas is better for data manipulation and exploration

SparkSQL Engine

SparkSQL engine is optimized for SQL-based data processing using Spark SQL.

Configuration:

[engine]
name = "sparksql"
[engine.config]
app_name = "Qarion ETL-SQL"
master = "local[*]"
enable_hive_support = true
warehouse_dir = "spark-warehouse"

Features:

Optimized for SQL operations
Spark SQL execution engine
Hive support enabled by default
SQL-first approach
Same distributed capabilities as PySpark
Requires pyspark package

Installation:

pip install pyspark

Configuration Options:

app_name: Application name for Spark (default: "Qarion ETL-SQL")
master: Spark master URL (default: "local[*]")
enable_hive_support: Enable Hive support (default: true)
warehouse_dir: Spark warehouse directory (optional)
config: Additional Spark configuration options (dict)

Use When:

Primarily using SQL for transformations
Need Hive SQL compatibility
SQL-focused workflows
Want optimized SQL execution
Working with Hive tables

Example: SQL-Focused Workflow

[engine]
name = "sparksql"
[engine.config]
app_name = "SQLPipeline"
master = "local[*]"
enable_hive_support = true

PySpark vs SparkSQL:

Feature	PySpark	SparkSQL
Primary API	DataFrame API	SQL
Hive Support	Optional (default: false)	Enabled (default: true)
Best For	DataFrame operations	SQL operations
Use Case	Programmatic data processing	SQL-based transformations

Choosing an Engine

Engine	Best For	Persistence	Performance	Complexity
SQLite	Development, small datasets	✅ File-based	Good	Low
Pandas Memory	Testing, prototyping	❌ In-memory only	Very Fast	Low
Pandas Local	Analytical workloads	✅ Parquet files	Fast	Low
DuckDB	Analytical queries, large data	✅ File-based	Very Fast	Medium
PostgreSQL	Production, multi-user	✅ Server-based	Excellent	Medium
Polars	Fast analytical workloads	✅ Optional (Parquet)	Excellent	Low
PySpark	Large-scale distributed processing	✅ Cluster-based	Excellent	High
SparkSQL	SQL-based big data processing	✅ Cluster-based	Excellent	High

Performance Comparison:

Fastest: Polars (Rust-based, optimized for analytics)
Very Fast: DuckDB, Pandas (for in-memory operations)
Fast: Pandas Local (with Parquet persistence)
Excellent: PostgreSQL, PySpark, SparkSQL (for distributed/large-scale)

Storage Backends

Storage backends handle file storage for input data. They abstract local filesystem and remote storage (S3).

Storage Backend Configuration

Storage backends are automatically detected from file paths. No explicit configuration is required in most cases.

Local Storage

Local filesystem storage backend. Used automatically for local file paths.

Features:

File and directory operations
Pattern matching (glob patterns)
Recursive directory scanning
No additional configuration needed

Example:

[properties.input_ingestion]
path = "data/orders"
pattern = "orders_*.csv"
recursive = true

S3 Storage

AWS S3 storage backend for remote file access.

Features:

S3 path resolution (s3://bucket/path)
File listing with patterns
File download
Credential management

Configuration:

[properties.input_ingestion]
path = "s3://my-bucket/data/"
pattern = "orders_*.csv"
recursive = true
credentials = {
    aws_access_key_id = "your-access-key"
    aws_secret_access_key = "your-secret-key"
    region_name = "us-east-1"
}

S3 Path Format:

s3://bucket-name/path/to/files/

Credentials:

Can be provided in flow configuration (not recommended for production)
Can use credential store references: credentials = "${credential:my_aws_creds}" (recommended)
Can use AWS environment variables
Can use IAM roles (when running on AWS)

See Credential Management for secure credential management.

FTP Storage

FTP (File Transfer Protocol) storage backend for remote file access.

Features:

FTP path resolution (ftp://host/path)
File listing with patterns
File download
Standard FTP protocol support

Configuration:

[properties.input_ingestion]
path = "ftp://ftp.example.com/data/"
pattern = "orders_*.csv"
recursive = true
credentials = {
    host = "ftp.example.com"
    port = 21
    username = "ftpuser"
    password = "${credential:ftp_password}"
}

FTP Path Format:

ftp://hostname/path/to/files/

Installation: FTP support uses Python's built-in ftplib (standard library, no installation required).

Configuration Options:

host: FTP server hostname (required)
port: FTP server port (default: 21)
username: FTP username (optional, defaults to anonymous)
password: FTP password (optional)
timeout: Connection timeout in seconds (default: 30)
passive: Use passive mode (default: true)

Example:

[properties.input_ingestion]
path = "ftp://ftp.example.com/data/"
pattern = "*.csv"
credentials = {
    host = "ftp.example.com"
    username = "user"
    password = "${credential:ftp_password}"
}

SFTP Storage

SFTP (SSH File Transfer Protocol) storage backend for secure remote file access.

Features:

SFTP path resolution (sftp://host/path)
Secure file transfer over SSH
File listing with patterns
File download
Key-based or password authentication

Configuration:

[properties.input_ingestion]
path = "sftp://sftp.example.com/data/"
pattern = "orders_*.csv"
recursive = true
credentials = {
    host = "sftp.example.com"
    port = 22
    username = "sftpuser"
    password = "${credential:sftp_password}"
}

Or with key-based authentication:

[properties.input_ingestion]
path = "sftp://sftp.example.com/data/"
credentials = {
    host = "sftp.example.com"
    username = "sftpuser"
    key_filename = "/path/to/private_key"
}

SFTP Path Format:

sftp://hostname/path/to/files/

Installation:

pip install paramiko

Configuration Options:

host: SFTP server hostname (required)
port: SFTP server port (default: 22)
username: SFTP username (required)
password: SFTP password (optional, if using key authentication)
key_filename: Path to private key file (optional)
key_data: Private key data as string (optional)
timeout: Connection timeout in seconds (default: 30)

Example:

[properties.input_ingestion]
path = "sftp://sftp.example.com/data/"
pattern = "*.csv"
credentials = {
    host = "sftp.example.com"
    username = "user"
    key_filename = "~/.ssh/id_rsa"
}

PostgreSQL Storage

PostgreSQL connector for file/blob storage using PostgreSQL database.

Features:

PostgreSQL path resolution (postgresql://path/to/file)
File storage in PostgreSQL (using bytea or large objects)
File listing with patterns
File download
Database-backed file storage

Configuration:

[properties.input_ingestion]
path = "postgresql://data/files/"
pattern = "*.csv"
credentials = {
    connection_string = "postgresql://user:password@host:5432/database"
    storage_table = "_file_storage"
    storage_schema = "public"
}

Or using individual parameters:

credentials = {
    host = "db.example.com"
    port = 5432
    database = "mydb"
    user = "myuser"
    password = "${credential:db_password}"
    storage_table = "_file_storage"
    storage_schema = "public"
}

PostgreSQL Path Format:

postgresql://path/to/file.csv
postgres://path/to/file.csv

Installation:

pip install psycopg2-binary

Configuration Options:

connection_string: Full PostgreSQL connection string (alternative to individual parameters)
host: Database host (required if not using connection_string)
port: Database port (default: 5432)
database: Database name (required if not using connection_string)
user: Database user
password: Database password
storage_table: Table name for file storage (default: "_file_storage")
storage_schema: Schema name for storage table (default: "public")

Example:

[properties.input_ingestion]
path = "postgresql://data/files/"
pattern = "*.csv"
credentials = {
    host = "db.example.com"
    database = "mydb"
    user = "${credential:db_user}"
    password = "${credential:db_password}"
}

Kafka Storage

Apache Kafka connector for message/file operations using Kafka topics.

Features:

Kafka path resolution (kafka://topic-name)
Message listing (treats messages as files)
Message reading
Topic-based file operations

Configuration:

[properties.input_ingestion]
path = "kafka://my-topic"
credentials = {
    bootstrap_servers = "localhost:9092"
    auto_offset_reset = "earliest"
}

Kafka Path Format:

kafka://topic-name
kafka://topic-name/partition/offset

Installation:

pip install kafka-python

Configuration Options:

bootstrap_servers: Kafka broker addresses (required, can be string or list)
auto_offset_reset: Offset reset policy (default: "earliest")
enable_auto_commit: Enable auto-commit (default: true)
consumer_timeout_ms: Consumer timeout in milliseconds (default: 1000)
max_list_messages: Maximum messages to list (default: 100)
security_protocol: Security protocol (e.g., "SASL_SSL")
sasl_mechanism: SASL mechanism (e.g., "PLAIN")
sasl_plain_username: SASL username
sasl_plain_password: SASL password

Example:

[properties.input_ingestion]
path = "kafka://orders-topic"
credentials = {
    bootstrap_servers = ["kafka1:9092", "kafka2:9092"]
    auto_offset_reset = "earliest"
}

Example with SASL:

credentials = {
    bootstrap_servers = "kafka.example.com:9092"
    security_protocol = "SASL_SSL"
    sasl_mechanism = "PLAIN"
    sasl_plain_username = "${credential:kafka_user}"
    sasl_plain_password = "${credential:kafka_password}"
}

Azure Service Bus Storage

Azure Service Bus connector for message/file operations using queues and topics.

Features:

Azure Service Bus path resolution (azureservicebus://queue-name or asb://topic-name/subscription)
Message listing (treats messages as files)
Message reading
Queue and topic support

Configuration:

[properties.input_ingestion]
path = "azureservicebus://my-queue"
credentials = {
    connection_string = "${credential:azure_service_bus_connection_string}"
}

Or using managed identity:

credentials = {
    fully_qualified_namespace = "my-namespace.servicebus.windows.net"
}

Azure Service Bus Path Format:

azureservicebus://queue-name
asb://queue-name
azureservicebus://topic-name/subscription-name
asb://topic-name/subscription-name

Installation:

pip install azure-servicebus azure-identity

Configuration Options:

connection_string: Azure Service Bus connection string (required if not using fully_qualified_namespace)
fully_qualified_namespace: Fully qualified namespace (required if not using connection_string)
max_list_messages: Maximum messages to list (default: 100)

Example with Connection String:

[properties.input_ingestion]
path = "azureservicebus://orders-queue"
credentials = {
    connection_string = "${credential:azure_service_bus_connection_string}"
}

Example with Managed Identity:

[properties.input_ingestion]
path = "asb://orders-topic/subscription1"
credentials = {
    fully_qualified_namespace = "my-namespace.servicebus.windows.net"
}

Note: When using fully_qualified_namespace, the connector uses Azure Default Credential, which supports:

Managed Identity (when running on Azure)
Environment variables
Azure CLI credentials
Visual Studio Code credentials

Repository Storage

Repository storage controls where metadata (dataset definitions, flow definitions, migration history) is stored.

Storage Types

Local Storage

Stores metadata in local files (TOML/JSON files).

Configuration:

# qarion-etl.toml
dataset_storage = "local"
flow_storage = "local"
schema_storage = "local"

# Directory configuration
dataset_dir = "datasets"
flow_dir = "flows"
migration_dir = "migrations"

Complete Example:

# qarion-etl.toml
[app]
app = "Qarion ETL"
type = "project"
project_name = "my_project"

[engine]
name = "sqlite"
[engine.config]
path = "data/qarion-etl.db"

# Local storage configuration
dataset_storage = "local"
flow_storage = "local"
schema_storage = "local"
dataset_dir = "datasets"
flow_dir = "flows"
migration_dir = "migrations"

Database Storage

Stores metadata in database tables.

Configuration:

# qarion-etl.toml
dataset_storage = "database"
flow_storage = "database"
schema_storage = "database"

# Optional: separate metadata engine
[metadata_engine]
name = "sqlite"
[metadata_engine.config]
path = "data/metadata.db"

# Optional: namespace for metadata tables
metadata_namespace = "xt"

Complete Example:

# qarion-etl.toml
[engine]
name = "pandas_memory"
# Processing engine - in-memory for transformations

[metadata_engine]
name = "sqlite"
[metadata_engine.config]
path = "data/metadata.db"
# Metadata engine - persistent storage for metadata

# Use database storage for all metadata
dataset_storage = "database"
flow_storage = "database"
schema_storage = "database"
metadata_namespace = "xt"

When to Use Database Storage:

Team environments where metadata needs to be shared
Production environments requiring centralized metadata
When using separate metadata engine
Need for metadata querying and reporting

Schema History Storage

Schema history tracks the evolution of dataset schemas over time.

Local Schema History

Schema history is read from migration JSON files.

Configuration:

[schema_storage]
type = "local"
config = { migration_dir = "migrations" }

Features:

Migration files are source of truth
No database connection required
Version controlled
File-based

Use When:

Version-controlled projects
File-based workflows
Offline work
Development

Database Schema History

Schema history is stored in database tables.

Configuration:

[schema_storage]
type = "database"
config = {
    connection_string = "sqlite:///metadata.db"
    namespace = "xt"
}

Features:

Centralized schema history
Database-backed
Multi-user support
Requires database connection

Use When:

Production environments
Multi-user scenarios
Centralized management

Complete Configuration Examples

Example 1: Simple Setup (Same Engine for Everything)

[app]
name = "my_project"
type = "data_pipeline"

# Processing Engine (also used for metadata if database storage is enabled)
[engine]
name = "sqlite"
[engine.config]
path = "data/qarion-etl.db"

# Repository Storage (Local)
[dataset_storage]
type = "local"
config = { dataset_dir = "datasets" }

[flow_storage]
type = "local"
config = { flow_dir = "flows" }

[schema_storage]
type = "local"
config = { migration_dir = "migrations" }

Example 2: Separate Processing and Metadata Engines

[app]
name = "my_project"
type = "data_pipeline"

# Processing Engine - for data transformations
[engine]
name = "pandas_memory"
[engine.config]
# No configuration required

# Metadata Engine - for storing metadata in database
[metadata_engine]
name = "sqlite"
[metadata_engine.config]
path = "data/metadata.db"

# Repository Storage (Database)
dataset_storage = "database"
flow_storage = "database"
schema_storage = "database"

Example 3: Production Setup with Database Metadata

[app]
name = "production_pipeline"
type = "data_pipeline"

# Processing Engine - for large-scale data processing
[engine]
name = "duckdb"
[engine.config]
path = "data/processing.duckdb"

# Metadata Engine - for centralized metadata management
[metadata_engine]
name = "postgresql"
[metadata_engine.config]
connection_string = "postgresql://user:pass@db-server/metadata"

# Repository Storage (Database)
dataset_storage = "database"
flow_storage = "database"
schema_storage = "database"

Best Practices

Engine Selection

Development: Use SQLite for simplicity
Testing: Use Pandas Memory or Polars for speed
Production: Choose based on workload:
- Analytical: Polars (fastest), DuckDB, or Pandas Local
- Transactional: PostgreSQL (recommended) or SQLite
- Large scale: PostgreSQL, PySpark, or SparkSQL
- Multi-user/Team: PostgreSQL
- Fast analytics: Polars (recommended for single-machine workloads)

Processing vs Metadata Engine Separation

Use the same engine when:

Simple setups where one engine is sufficient
Development and testing environments
Processing engine supports both use cases well (e.g., SQLite for both)

Use separate engines when:

Processing large datasets with Spark/Pandas while keeping metadata in a lightweight database
Using in-memory engines for processing but needing persistent metadata storage
Isolating processing workloads from metadata operations
Different engines optimized for different purposes

Example Scenarios:

Spark Processing + PostgreSQL Metadata
- Processing: Spark (for large-scale data processing)
- Metadata: PostgreSQL (for centralized metadata management)
Pandas Memory + SQLite Metadata
- Processing: Pandas Memory (for fast in-memory operations)
- Metadata: SQLite (for persistent metadata storage)
DuckDB Processing + Same DuckDB Metadata
- Processing: DuckDB (for analytical queries)
- Metadata: DuckDB (same engine, different database file)

Storage Backend Selection

Local Development: Use local storage
Production: Use S3 for remote files
Credentials: Store securely using credential stores (see Credential Management) or environment variables (see Configuration Guide)

Repository Storage Selection

Development: Use local storage (files in git)
Production: Consider database storage for centralized management
Schema History: Match your workflow (local for file-based, database for centralized)

Environment Variables

Use environment variables for sensitive values and environment-specific configuration:

[engine.config]
path = "${DB_PATH:-data/qarion-etl.db}"

See Configuration Guide for details.

Overview​

Engines​

Processing Engine Configuration​

Metadata Engine Configuration​

Separating Processing and Metadata Engines​

Available Engines​

SQLite Engine​

Pandas In-Memory Engine​

Pandas Local Storage Engine​

DuckDB Engine​

PostgreSQL Engine​

PySpark Engine​

Polars Engine​

SparkSQL Engine​

Choosing an Engine​

Storage Backends​

Storage Backend Configuration​

Local Storage​

S3 Storage​

FTP Storage​

SFTP Storage​

PostgreSQL Storage​

Kafka Storage​

Azure Service Bus Storage​

Repository Storage​

Storage Types​

Local Storage​

Database Storage​

Schema History Storage​

Local Schema History​

Database Schema History​

Complete Configuration Examples​

Example 1: Simple Setup (Same Engine for Everything)​

Example 2: Separate Processing and Metadata Engines​

Example 3: Production Setup with Database Metadata​

Best Practices​

Engine Selection​

Processing vs Metadata Engine Separation​

Storage Backend Selection​

Repository Storage Selection​

Environment Variables​

Related Documentation​

Overview

Engines

Processing Engine Configuration

Metadata Engine Configuration

Separating Processing and Metadata Engines

Available Engines

SQLite Engine

Pandas In-Memory Engine

Pandas Local Storage Engine

DuckDB Engine

PostgreSQL Engine

PySpark Engine

Polars Engine

SparkSQL Engine

Choosing an Engine

Storage Backends

Storage Backend Configuration

Local Storage

S3 Storage

FTP Storage

SFTP Storage

PostgreSQL Storage

Kafka Storage

Azure Service Bus Storage

Repository Storage

Storage Types

Local Storage

Database Storage

Schema History Storage

Local Schema History

Database Schema History

Complete Configuration Examples

Example 1: Simple Setup (Same Engine for Everything)

Example 2: Separate Processing and Metadata Engines

Example 3: Production Setup with Database Metadata

Best Practices

Engine Selection

Processing vs Metadata Engine Separation

Storage Backend Selection

Repository Storage Selection

Environment Variables

Related Documentation