Skip to main content

Engines and Storage

Understanding Qarion ETL's engine and storage layer configuration.

Overview

Qarion ETL uses a layered architecture for data processing and storage:

  1. Engines: Execution environments for running transformations (SQLite, Pandas, DuckDB)
  2. Storage Backends: File storage for input data (local filesystem, S3)
  3. Repository Storage: Metadata storage for definitions and history (local files, database)

Engines

Engines are the execution environments where transformations run. They provide the database or data processing capabilities.

Qarion ETL supports two types of engines:

  1. Processing Engine ([engine]): Used for data transformations, SQL execution, and data processing
  2. Metadata Engine ([metadata_engine]): Used for storing metadata (flows, datasets, migrations) in database storage

Processing Engine Configuration

The processing engine is configured in the [engine] section of config.toml:

[engine]
name = "sqlite"
[engine.config]
path = "data/qarion-etl.db"

This engine is used for:

  • Executing SQL transformations
  • Running data processing tasks
  • Storing and querying dataset data
  • All transformation operations

Metadata Engine Configuration

The metadata engine is optional and configured in the [metadata_engine] section:

[metadata_engine]
name = "sqlite"
[metadata_engine.config]
path = "data/metadata.db"

If not specified, the processing engine is used for metadata storage (backward compatible).

The metadata engine is used for:

  • Storing flow definitions (when flow_storage = "database")
  • Storing dataset definitions (when dataset_storage = "database")
  • Storing migration history (when schema_storage = "database")
  • All metadata-related database operations

Separating Processing and Metadata Engines

You can use different engines for processing and metadata storage. This is useful when:

  • Processing on Spark/Pandas: Run transformations on Spark or Pandas while storing metadata in SQLite/PostgreSQL
  • Different Performance Requirements: Use a fast in-memory engine for processing and a persistent database for metadata
  • Resource Isolation: Separate processing workloads from metadata management

Example: Processing on Pandas, Metadata in SQLite

# Processing engine - for data transformations
[engine]
name = "pandas_memory"
[engine.config]
# No configuration required

# Metadata engine - for storing metadata
[metadata_engine]
name = "sqlite"
[metadata_engine.config]
path = "data/metadata.db"

# Use database storage for metadata
dataset_storage = "database"
flow_storage = "database"
schema_storage = "database"

Example: Processing on PySpark, Metadata in PostgreSQL

# Processing engine - for data transformations
[engine]
name = "pyspark"
[engine.config]
master = "local[*]"
app_name = "qarion-etl"

# Metadata engine - for storing metadata
[metadata_engine]
name = "postgresql"
[metadata_engine.config]
connection_string = "postgresql://user:pass@localhost/metadata"

When to Use Separate Engines:

  • ✅ Processing large datasets with Spark/Pandas while keeping metadata in a lightweight database
  • ✅ Using in-memory engines for processing but needing persistent metadata storage
  • ✅ Isolating processing workloads from metadata operations
  • ✅ Different engines optimized for different purposes

When to Use the Same Engine:

  • ✅ Simple setups where one engine is sufficient
  • ✅ Development and testing environments
  • ✅ When processing engine supports both use cases well

Available Engines

SQLite Engine

SQLite is a file-based SQL database engine, ideal for local development and small to medium datasets.

Configuration:

[engine]
name = "sqlite"
[engine.config]
path = "data/qarion-etl.db"

Features:

  • File-based (single database file)
  • No server required
  • ACID-compliant transactions
  • Good for development and testing
  • Limited concurrency

Use When:

  • Developing locally
  • Working with small to medium datasets
  • Need a simple, file-based solution
  • Testing transformations

Pandas In-Memory Engine

Pandas-based engine that stores data in memory. Fast but data is lost when the process ends.

Configuration:

[engine]
name = "pandas_memory"
[engine.config]
# No configuration required

Features:

  • In-memory storage (very fast)
  • No persistence (data lost on exit)
  • Good for testing and development
  • Supports DataFrame operations

Use When:

  • Testing transformations quickly
  • Development and prototyping
  • Data doesn't need to persist
  • Working with small datasets

Pandas Local Storage Engine

Pandas-based engine with local file persistence using Parquet format.

Configuration:

[engine]
name = "pandas_local"
[engine.config]
storage_dir = "data/pandas"

Features:

  • Persists data to Parquet files
  • Fast read/write operations
  • Good for analytical workloads
  • Supports DataFrame operations

Use When:

  • Need persistence with Pandas
  • Working with analytical data
  • Prefer Parquet format
  • Local file-based storage

DuckDB Engine

DuckDB is an in-process analytical database, optimized for analytical queries.

Configuration:

[engine]
name = "duckdb"
[engine.config]
path = "data/qarion-etl.duckdb"

Features:

  • In-process analytical database
  • Optimized for analytical queries
  • Supports SQL and Parquet
  • Fast columnar operations

Use When:

  • Analytical workloads
  • Need fast columnar operations
  • Working with large datasets
  • Complex analytical queries

PostgreSQL Engine

PostgreSQL is a production-grade relational database, ideal for production environments and multi-user scenarios.

Configuration:

[engine]
name = "postgresql"
[engine.config]
host = "localhost"
port = 5432
database = "mydb"
user = "myuser"
password = "${credential:db_password}" # Using credential store

Or using connection string:

[engine]
name = "postgresql"
[engine.config]
connection_string = "postgresql://user:password@localhost:5432/mydb"

Features:

  • Production-grade relational database
  • ACID-compliant transactions
  • Multi-user support
  • Excellent for production environments
  • Supports complex SQL operations
  • Requires psycopg2 or psycopg2-binary package

Installation:

pip install psycopg2-binary

Use When:

  • Production environments
  • Multi-user scenarios
  • Need robust transaction support
  • Centralized database management
  • Large-scale data processing
  • Team collaboration

Configuration Options:

  • host: Database host (default: localhost)
  • port: Database port (default: 5432)
  • database: Database name (required if not using connection_string)
  • user: Database user (optional, can use credential store)
  • password: Database password (optional, recommended to use credential store)
  • connection_string: Full connection string (alternative to individual parameters)

Example with Credential Store:

# qarion-etl.toml
[engine]
name = "postgresql"
[engine.config]
host = "db.example.com"
port = 5432
database = "production_db"
user = "${credential:db_user}"
password = "${credential:db_password}"

PySpark Engine

PySpark engine uses Apache Spark's DataFrame API for distributed data processing.

Configuration:

[engine]
name = "pyspark"
[engine.config]
app_name = "Qarion ETL"
master = "local[*]"
enable_hive_support = false
warehouse_dir = "spark-warehouse"

Or for Spark cluster:

[engine]
name = "pyspark"
[engine.config]
app_name = "Qarion ETL"
master = "spark://master:7077"
config = {
"spark.executor.memory" = "4g"
"spark.executor.cores" = "2"
}

Features:

  • Distributed data processing
  • Spark DataFrame API
  • Supports large-scale data processing
  • Can run locally or on Spark cluster
  • Supports Hive (optional)
  • Requires pyspark package

Installation:

pip install pyspark

Configuration Options:

  • app_name: Application name for Spark (default: "Qarion ETL")
  • master: Spark master URL (default: "local[*]")
    • "local[*]": Run locally using all available cores
    • "local[4]": Run locally using 4 cores
    • "spark://host:port": Connect to Spark cluster
    • "yarn": Run on YARN cluster
  • enable_hive_support: Enable Hive support (default: false)
  • warehouse_dir: Spark warehouse directory (optional)
  • config: Additional Spark configuration options (dict)

Use When:

  • Processing large datasets that don't fit in memory
  • Need distributed processing capabilities
  • Working with big data workloads
  • Running on Spark clusters
  • Need Spark DataFrame operations

Example: Local Development

[engine]
name = "pyspark"
[engine.config]
app_name = "MyProject"
master = "local[*]"

Example: Spark Cluster

[engine]
name = "pyspark"
[engine.config]
app_name = "ProductionPipeline"
master = "spark://spark-master:7077"
config = {
"spark.executor.memory" = "8g"
"spark.executor.cores" = "4"
"spark.sql.shuffle.partitions" = "200"
}

Polars Engine

Polars is a fast DataFrame library written in Rust with Python bindings, optimized for analytical workloads.

Configuration:

[engine]
name = "polars"
[engine.config]
storage_dir = "data/polars" # Optional: for persistence

Or in-memory only:

[engine]
name = "polars"
[engine.config]
# No storage_dir = in-memory only

Features:

  • Very fast DataFrame operations (Rust-based)
  • Lazy evaluation for query optimization
  • Columnar data processing
  • Optional persistence to Parquet files
  • Supports standard SQL queries
  • Memory efficient
  • Requires polars package

Installation:

pip install polars

Configuration Options:

  • storage_dir: Optional directory for persisting DataFrames as Parquet files. If not specified, operates in-memory only.

Use When:

  • Need very fast DataFrame operations
  • Working with large datasets
  • Analytical workloads
  • Want lazy evaluation benefits
  • Need memory-efficient processing
  • Prefer Rust-based performance

Example: In-Memory

[engine]
name = "polars"
[engine.config]
# In-memory only, no persistence

Example: With Persistence

[engine]
name = "polars"
[engine.config]
storage_dir = "data/polars"

Polars vs Pandas:

  • Polars is significantly faster for most operations
  • Polars uses lazy evaluation (can optimize queries)
  • Polars is more memory efficient
  • Pandas has a larger ecosystem and more features
  • Polars is better for analytical workloads
  • Pandas is better for data manipulation and exploration

SparkSQL Engine

SparkSQL engine is optimized for SQL-based data processing using Spark SQL.

Configuration:

[engine]
name = "sparksql"
[engine.config]
app_name = "Qarion ETL-SQL"
master = "local[*]"
enable_hive_support = true
warehouse_dir = "spark-warehouse"

Features:

  • Optimized for SQL operations
  • Spark SQL execution engine
  • Hive support enabled by default
  • SQL-first approach
  • Same distributed capabilities as PySpark
  • Requires pyspark package

Installation:

pip install pyspark

Configuration Options:

  • app_name: Application name for Spark (default: "Qarion ETL-SQL")
  • master: Spark master URL (default: "local[*]")
  • enable_hive_support: Enable Hive support (default: true)
  • warehouse_dir: Spark warehouse directory (optional)
  • config: Additional Spark configuration options (dict)

Use When:

  • Primarily using SQL for transformations
  • Need Hive SQL compatibility
  • SQL-focused workflows
  • Want optimized SQL execution
  • Working with Hive tables

Example: SQL-Focused Workflow

[engine]
name = "sparksql"
[engine.config]
app_name = "SQLPipeline"
master = "local[*]"
enable_hive_support = true

PySpark vs SparkSQL:

FeaturePySparkSparkSQL
Primary APIDataFrame APISQL
Hive SupportOptional (default: false)Enabled (default: true)
Best ForDataFrame operationsSQL operations
Use CaseProgrammatic data processingSQL-based transformations

Choosing an Engine

EngineBest ForPersistencePerformanceComplexity
SQLiteDevelopment, small datasets✅ File-basedGoodLow
Pandas MemoryTesting, prototyping❌ In-memory onlyVery FastLow
Pandas LocalAnalytical workloads✅ Parquet filesFastLow
DuckDBAnalytical queries, large data✅ File-basedVery FastMedium
PostgreSQLProduction, multi-user✅ Server-basedExcellentMedium
PolarsFast analytical workloads✅ Optional (Parquet)ExcellentLow
PySparkLarge-scale distributed processing✅ Cluster-basedExcellentHigh
SparkSQLSQL-based big data processing✅ Cluster-basedExcellentHigh

Performance Comparison:

  • Fastest: Polars (Rust-based, optimized for analytics)
  • Very Fast: DuckDB, Pandas (for in-memory operations)
  • Fast: Pandas Local (with Parquet persistence)
  • Excellent: PostgreSQL, PySpark, SparkSQL (for distributed/large-scale)

Storage Backends

Storage backends handle file storage for input data. They abstract local filesystem and remote storage (S3).

Storage Backend Configuration

Storage backends are automatically detected from file paths. No explicit configuration is required in most cases.

Local Storage

Local filesystem storage backend. Used automatically for local file paths.

Features:

  • File and directory operations
  • Pattern matching (glob patterns)
  • Recursive directory scanning
  • No additional configuration needed

Example:

[properties.input_ingestion]
path = "data/orders"
pattern = "orders_*.csv"
recursive = true

S3 Storage

AWS S3 storage backend for remote file access.

Features:

  • S3 path resolution (s3://bucket/path)
  • File listing with patterns
  • File download
  • Credential management

Configuration:

[properties.input_ingestion]
path = "s3://my-bucket/data/"
pattern = "orders_*.csv"
recursive = true
credentials = {
aws_access_key_id = "your-access-key"
aws_secret_access_key = "your-secret-key"
region_name = "us-east-1"
}

S3 Path Format:

s3://bucket-name/path/to/files/

Credentials:

  • Can be provided in flow configuration (not recommended for production)
  • Can use credential store references: credentials = "${credential:my_aws_creds}" (recommended)
  • Can use AWS environment variables
  • Can use IAM roles (when running on AWS)

See Credential Management for secure credential management.

FTP Storage

FTP (File Transfer Protocol) storage backend for remote file access.

Features:

  • FTP path resolution (ftp://host/path)
  • File listing with patterns
  • File download
  • Standard FTP protocol support

Configuration:

[properties.input_ingestion]
path = "ftp://ftp.example.com/data/"
pattern = "orders_*.csv"
recursive = true
credentials = {
host = "ftp.example.com"
port = 21
username = "ftpuser"
password = "${credential:ftp_password}"
}

FTP Path Format:

ftp://hostname/path/to/files/

Installation: FTP support uses Python's built-in ftplib (standard library, no installation required).

Configuration Options:

  • host: FTP server hostname (required)
  • port: FTP server port (default: 21)
  • username: FTP username (optional, defaults to anonymous)
  • password: FTP password (optional)
  • timeout: Connection timeout in seconds (default: 30)
  • passive: Use passive mode (default: true)

Example:

[properties.input_ingestion]
path = "ftp://ftp.example.com/data/"
pattern = "*.csv"
credentials = {
host = "ftp.example.com"
username = "user"
password = "${credential:ftp_password}"
}

SFTP Storage

SFTP (SSH File Transfer Protocol) storage backend for secure remote file access.

Features:

  • SFTP path resolution (sftp://host/path)
  • Secure file transfer over SSH
  • File listing with patterns
  • File download
  • Key-based or password authentication

Configuration:

[properties.input_ingestion]
path = "sftp://sftp.example.com/data/"
pattern = "orders_*.csv"
recursive = true
credentials = {
host = "sftp.example.com"
port = 22
username = "sftpuser"
password = "${credential:sftp_password}"
}

Or with key-based authentication:

[properties.input_ingestion]
path = "sftp://sftp.example.com/data/"
credentials = {
host = "sftp.example.com"
username = "sftpuser"
key_filename = "/path/to/private_key"
}

SFTP Path Format:

sftp://hostname/path/to/files/

Installation:

pip install paramiko

Configuration Options:

  • host: SFTP server hostname (required)
  • port: SFTP server port (default: 22)
  • username: SFTP username (required)
  • password: SFTP password (optional, if using key authentication)
  • key_filename: Path to private key file (optional)
  • key_data: Private key data as string (optional)
  • timeout: Connection timeout in seconds (default: 30)

Example:

[properties.input_ingestion]
path = "sftp://sftp.example.com/data/"
pattern = "*.csv"
credentials = {
host = "sftp.example.com"
username = "user"
key_filename = "~/.ssh/id_rsa"
}

PostgreSQL Storage

PostgreSQL connector for file/blob storage using PostgreSQL database.

Features:

  • PostgreSQL path resolution (postgresql://path/to/file)
  • File storage in PostgreSQL (using bytea or large objects)
  • File listing with patterns
  • File download
  • Database-backed file storage

Configuration:

[properties.input_ingestion]
path = "postgresql://data/files/"
pattern = "*.csv"
credentials = {
connection_string = "postgresql://user:password@host:5432/database"
storage_table = "_file_storage"
storage_schema = "public"
}

Or using individual parameters:

credentials = {
host = "db.example.com"
port = 5432
database = "mydb"
user = "myuser"
password = "${credential:db_password}"
storage_table = "_file_storage"
storage_schema = "public"
}

PostgreSQL Path Format:

postgresql://path/to/file.csv
postgres://path/to/file.csv

Installation:

pip install psycopg2-binary

Configuration Options:

  • connection_string: Full PostgreSQL connection string (alternative to individual parameters)
  • host: Database host (required if not using connection_string)
  • port: Database port (default: 5432)
  • database: Database name (required if not using connection_string)
  • user: Database user
  • password: Database password
  • storage_table: Table name for file storage (default: "_file_storage")
  • storage_schema: Schema name for storage table (default: "public")

Example:

[properties.input_ingestion]
path = "postgresql://data/files/"
pattern = "*.csv"
credentials = {
host = "db.example.com"
database = "mydb"
user = "${credential:db_user}"
password = "${credential:db_password}"
}

Kafka Storage

Apache Kafka connector for message/file operations using Kafka topics.

Features:

  • Kafka path resolution (kafka://topic-name)
  • Message listing (treats messages as files)
  • Message reading
  • Topic-based file operations

Configuration:

[properties.input_ingestion]
path = "kafka://my-topic"
credentials = {
bootstrap_servers = "localhost:9092"
auto_offset_reset = "earliest"
}

Kafka Path Format:

kafka://topic-name
kafka://topic-name/partition/offset

Installation:

pip install kafka-python

Configuration Options:

  • bootstrap_servers: Kafka broker addresses (required, can be string or list)
  • auto_offset_reset: Offset reset policy (default: "earliest")
  • enable_auto_commit: Enable auto-commit (default: true)
  • consumer_timeout_ms: Consumer timeout in milliseconds (default: 1000)
  • max_list_messages: Maximum messages to list (default: 100)
  • security_protocol: Security protocol (e.g., "SASL_SSL")
  • sasl_mechanism: SASL mechanism (e.g., "PLAIN")
  • sasl_plain_username: SASL username
  • sasl_plain_password: SASL password

Example:

[properties.input_ingestion]
path = "kafka://orders-topic"
credentials = {
bootstrap_servers = ["kafka1:9092", "kafka2:9092"]
auto_offset_reset = "earliest"
}

Example with SASL:

credentials = {
bootstrap_servers = "kafka.example.com:9092"
security_protocol = "SASL_SSL"
sasl_mechanism = "PLAIN"
sasl_plain_username = "${credential:kafka_user}"
sasl_plain_password = "${credential:kafka_password}"
}

Azure Service Bus Storage

Azure Service Bus connector for message/file operations using queues and topics.

Features:

  • Azure Service Bus path resolution (azureservicebus://queue-name or asb://topic-name/subscription)
  • Message listing (treats messages as files)
  • Message reading
  • Queue and topic support

Configuration:

[properties.input_ingestion]
path = "azureservicebus://my-queue"
credentials = {
connection_string = "${credential:azure_service_bus_connection_string}"
}

Or using managed identity:

credentials = {
fully_qualified_namespace = "my-namespace.servicebus.windows.net"
}

Azure Service Bus Path Format:

azureservicebus://queue-name
asb://queue-name
azureservicebus://topic-name/subscription-name
asb://topic-name/subscription-name

Installation:

pip install azure-servicebus azure-identity

Configuration Options:

  • connection_string: Azure Service Bus connection string (required if not using fully_qualified_namespace)
  • fully_qualified_namespace: Fully qualified namespace (required if not using connection_string)
  • max_list_messages: Maximum messages to list (default: 100)

Example with Connection String:

[properties.input_ingestion]
path = "azureservicebus://orders-queue"
credentials = {
connection_string = "${credential:azure_service_bus_connection_string}"
}

Example with Managed Identity:

[properties.input_ingestion]
path = "asb://orders-topic/subscription1"
credentials = {
fully_qualified_namespace = "my-namespace.servicebus.windows.net"
}

Note: When using fully_qualified_namespace, the connector uses Azure Default Credential, which supports:

  • Managed Identity (when running on Azure)
  • Environment variables
  • Azure CLI credentials
  • Visual Studio Code credentials

Repository Storage

Repository storage controls where metadata (dataset definitions, flow definitions, migration history) is stored.

Storage Types

Local Storage

Stores metadata in local files (TOML/JSON files).

Configuration:

# qarion-etl.toml
dataset_storage = "local"
flow_storage = "local"
schema_storage = "local"

# Directory configuration
dataset_dir = "datasets"
flow_dir = "flows"
migration_dir = "migrations"

Complete Example:

# qarion-etl.toml
[app]
app = "Qarion ETL"
type = "project"
project_name = "my_project"

[engine]
name = "sqlite"
[engine.config]
path = "data/qarion-etl.db"

# Local storage configuration
dataset_storage = "local"
flow_storage = "local"
schema_storage = "local"
dataset_dir = "datasets"
flow_dir = "flows"
migration_dir = "migrations"

Database Storage

Stores metadata in database tables.

Configuration:

# qarion-etl.toml
dataset_storage = "database"
flow_storage = "database"
schema_storage = "database"

# Optional: separate metadata engine
[metadata_engine]
name = "sqlite"
[metadata_engine.config]
path = "data/metadata.db"

# Optional: namespace for metadata tables
metadata_namespace = "xt"

Complete Example:

# qarion-etl.toml
[engine]
name = "pandas_memory"
# Processing engine - in-memory for transformations

[metadata_engine]
name = "sqlite"
[metadata_engine.config]
path = "data/metadata.db"
# Metadata engine - persistent storage for metadata

# Use database storage for all metadata
dataset_storage = "database"
flow_storage = "database"
schema_storage = "database"
metadata_namespace = "xt"

When to Use Database Storage:

  • Team environments where metadata needs to be shared
  • Production environments requiring centralized metadata
  • When using separate metadata engine
  • Need for metadata querying and reporting

Schema History Storage

Schema history tracks the evolution of dataset schemas over time.

Local Schema History

Schema history is read from migration JSON files.

Configuration:

[schema_storage]
type = "local"
config = { migration_dir = "migrations" }

Features:

  • Migration files are source of truth
  • No database connection required
  • Version controlled
  • File-based

Use When:

  • Version-controlled projects
  • File-based workflows
  • Offline work
  • Development

Database Schema History

Schema history is stored in database tables.

Configuration:

[schema_storage]
type = "database"
config = {
connection_string = "sqlite:///metadata.db"
namespace = "xt"
}

Features:

  • Centralized schema history
  • Database-backed
  • Multi-user support
  • Requires database connection

Use When:

  • Production environments
  • Multi-user scenarios
  • Centralized management

Complete Configuration Examples

Example 1: Simple Setup (Same Engine for Everything)

[app]
name = "my_project"
type = "data_pipeline"

# Processing Engine (also used for metadata if database storage is enabled)
[engine]
name = "sqlite"
[engine.config]
path = "data/qarion-etl.db"

# Repository Storage (Local)
[dataset_storage]
type = "local"
config = { dataset_dir = "datasets" }

[flow_storage]
type = "local"
config = { flow_dir = "flows" }

[schema_storage]
type = "local"
config = { migration_dir = "migrations" }

Example 2: Separate Processing and Metadata Engines

[app]
name = "my_project"
type = "data_pipeline"

# Processing Engine - for data transformations
[engine]
name = "pandas_memory"
[engine.config]
# No configuration required

# Metadata Engine - for storing metadata in database
[metadata_engine]
name = "sqlite"
[metadata_engine.config]
path = "data/metadata.db"

# Repository Storage (Database)
dataset_storage = "database"
flow_storage = "database"
schema_storage = "database"

Example 3: Production Setup with Database Metadata

[app]
name = "production_pipeline"
type = "data_pipeline"

# Processing Engine - for large-scale data processing
[engine]
name = "duckdb"
[engine.config]
path = "data/processing.duckdb"

# Metadata Engine - for centralized metadata management
[metadata_engine]
name = "postgresql"
[metadata_engine.config]
connection_string = "postgresql://user:pass@db-server/metadata"

# Repository Storage (Database)
dataset_storage = "database"
flow_storage = "database"
schema_storage = "database"

Best Practices

Engine Selection

  1. Development: Use SQLite for simplicity
  2. Testing: Use Pandas Memory or Polars for speed
  3. Production: Choose based on workload:
    • Analytical: Polars (fastest), DuckDB, or Pandas Local
    • Transactional: PostgreSQL (recommended) or SQLite
    • Large scale: PostgreSQL, PySpark, or SparkSQL
    • Multi-user/Team: PostgreSQL
    • Fast analytics: Polars (recommended for single-machine workloads)

Processing vs Metadata Engine Separation

Use the same engine when:

  • Simple setups where one engine is sufficient
  • Development and testing environments
  • Processing engine supports both use cases well (e.g., SQLite for both)

Use separate engines when:

  • Processing large datasets with Spark/Pandas while keeping metadata in a lightweight database
  • Using in-memory engines for processing but needing persistent metadata storage
  • Isolating processing workloads from metadata operations
  • Different engines optimized for different purposes

Example Scenarios:

  1. Spark Processing + PostgreSQL Metadata

    • Processing: Spark (for large-scale data processing)
    • Metadata: PostgreSQL (for centralized metadata management)
  2. Pandas Memory + SQLite Metadata

    • Processing: Pandas Memory (for fast in-memory operations)
    • Metadata: SQLite (for persistent metadata storage)
  3. DuckDB Processing + Same DuckDB Metadata

    • Processing: DuckDB (for analytical queries)
    • Metadata: DuckDB (same engine, different database file)

Storage Backend Selection

  1. Local Development: Use local storage
  2. Production: Use S3 for remote files
  3. Credentials: Store securely using credential stores (see Credential Management) or environment variables (see Configuration Guide)

Repository Storage Selection

  1. Development: Use local storage (files in git)
  2. Production: Consider database storage for centralized management
  3. Schema History: Match your workflow (local for file-based, database for centralized)

Environment Variables

Use environment variables for sensitive values and environment-specific configuration:

[engine.config]
path = "${DB_PATH:-data/qarion-etl.db}"

See Configuration Guide for details.