Engines and Storage
Understanding Qarion ETL's engine and storage layer configuration.
Overview
Qarion ETL uses a layered architecture for data processing and storage:
- Engines: Execution environments for running transformations (SQLite, Pandas, DuckDB)
- Storage Backends: File storage for input data (local filesystem, S3)
- Repository Storage: Metadata storage for definitions and history (local files, database)
Engines
Engines are the execution environments where transformations run. They provide the database or data processing capabilities.
Qarion ETL supports two types of engines:
- Processing Engine (
[engine]): Used for data transformations, SQL execution, and data processing - Metadata Engine (
[metadata_engine]): Used for storing metadata (flows, datasets, migrations) in database storage
Processing Engine Configuration
The processing engine is configured in the [engine] section of config.toml:
[engine]
name = "sqlite"
[engine.config]
path = "data/qarion-etl.db"
This engine is used for:
- Executing SQL transformations
- Running data processing tasks
- Storing and querying dataset data
- All transformation operations
Metadata Engine Configuration
The metadata engine is optional and configured in the [metadata_engine] section:
[metadata_engine]
name = "sqlite"
[metadata_engine.config]
path = "data/metadata.db"
If not specified, the processing engine is used for metadata storage (backward compatible).
The metadata engine is used for:
- Storing flow definitions (when
flow_storage = "database") - Storing dataset definitions (when
dataset_storage = "database") - Storing migration history (when
schema_storage = "database") - All metadata-related database operations
Separating Processing and Metadata Engines
You can use different engines for processing and metadata storage. This is useful when:
- Processing on Spark/Pandas: Run transformations on Spark or Pandas while storing metadata in SQLite/PostgreSQL
- Different Performance Requirements: Use a fast in-memory engine for processing and a persistent database for metadata
- Resource Isolation: Separate processing workloads from metadata management
Example: Processing on Pandas, Metadata in SQLite
# Processing engine - for data transformations
[engine]
name = "pandas_memory"
[engine.config]
# No configuration required
# Metadata engine - for storing metadata
[metadata_engine]
name = "sqlite"
[metadata_engine.config]
path = "data/metadata.db"
# Use database storage for metadata
dataset_storage = "database"
flow_storage = "database"
schema_storage = "database"
Example: Processing on PySpark, Metadata in PostgreSQL
# Processing engine - for data transformations
[engine]
name = "pyspark"
[engine.config]
master = "local[*]"
app_name = "qarion-etl"
# Metadata engine - for storing metadata
[metadata_engine]
name = "postgresql"
[metadata_engine.config]
connection_string = "postgresql://user:pass@localhost/metadata"
When to Use Separate Engines:
- ✅ Processing large datasets with Spark/Pandas while keeping metadata in a lightweight database
- ✅ Using in-memory engines for processing but needing persistent metadata storage
- ✅ Isolating processing workloads from metadata operations
- ✅ Different engines optimized for different purposes
When to Use the Same Engine:
- ✅ Simple setups where one engine is sufficient
- ✅ Development and testing environments
- ✅ When processing engine supports both use cases well
Available Engines
SQLite Engine
SQLite is a file-based SQL database engine, ideal for local development and small to medium datasets.
Configuration:
[engine]
name = "sqlite"
[engine.config]
path = "data/qarion-etl.db"
Features:
- File-based (single database file)
- No server required
- ACID-compliant transactions
- Good for development and testing
- Limited concurrency
Use When:
- Developing locally
- Working with small to medium datasets
- Need a simple, file-based solution
- Testing transformations
Pandas In-Memory Engine
Pandas-based engine that stores data in memory. Fast but data is lost when the process ends.
Configuration:
[engine]
name = "pandas_memory"
[engine.config]
# No configuration required
Features:
- In-memory storage (very fast)
- No persistence (data lost on exit)
- Good for testing and development
- Supports DataFrame operations
Use When:
- Testing transformations quickly
- Development and prototyping
- Data doesn't need to persist
- Working with small datasets
Pandas Local Storage Engine
Pandas-based engine with local file persistence using Parquet format.
Configuration:
[engine]
name = "pandas_local"
[engine.config]
storage_dir = "data/pandas"
Features:
- Persists data to Parquet files
- Fast read/write operations
- Good for analytical workloads
- Supports DataFrame operations
Use When:
- Need persistence with Pandas
- Working with analytical data
- Prefer Parquet format
- Local file-based storage
DuckDB Engine
DuckDB is an in-process analytical database, optimized for analytical queries.
Configuration:
[engine]
name = "duckdb"
[engine.config]
path = "data/qarion-etl.duckdb"
Features:
- In-process analytical database
- Optimized for analytical queries
- Supports SQL and Parquet
- Fast columnar operations
Use When:
- Analytical workloads
- Need fast columnar operations
- Working with large datasets
- Complex analytical queries
PostgreSQL Engine
PostgreSQL is a production-grade relational database, ideal for production environments and multi-user scenarios.
Configuration:
[engine]
name = "postgresql"
[engine.config]
host = "localhost"
port = 5432
database = "mydb"
user = "myuser"
password = "${credential:db_password}" # Using credential store
Or using connection string:
[engine]
name = "postgresql"
[engine.config]
connection_string = "postgresql://user:password@localhost:5432/mydb"
Features:
- Production-grade relational database
- ACID-compliant transactions
- Multi-user support
- Excellent for production environments
- Supports complex SQL operations
- Requires
psycopg2orpsycopg2-binarypackage
Installation:
pip install psycopg2-binary
Use When:
- Production environments
- Multi-user scenarios
- Need robust transaction support
- Centralized database management
- Large-scale data processing
- Team collaboration
Configuration Options:
host: Database host (default:localhost)port: Database port (default:5432)database: Database name (required if not usingconnection_string)user: Database user (optional, can use credential store)password: Database password (optional, recommended to use credential store)connection_string: Full connection string (alternative to individual parameters)
Example with Credential Store:
# qarion-etl.toml
[engine]
name = "postgresql"
[engine.config]
host = "db.example.com"
port = 5432
database = "production_db"
user = "${credential:db_user}"
password = "${credential:db_password}"
PySpark Engine
PySpark engine uses Apache Spark's DataFrame API for distributed data processing.
Configuration:
[engine]
name = "pyspark"
[engine.config]
app_name = "Qarion ETL"
master = "local[*]"
enable_hive_support = false
warehouse_dir = "spark-warehouse"
Or for Spark cluster:
[engine]
name = "pyspark"
[engine.config]
app_name = "Qarion ETL"
master = "spark://master:7077"
config = {
"spark.executor.memory" = "4g"
"spark.executor.cores" = "2"
}
Features:
- Distributed data processing
- Spark DataFrame API
- Supports large-scale data processing
- Can run locally or on Spark cluster
- Supports Hive (optional)
- Requires
pysparkpackage
Installation:
pip install pyspark
Configuration Options:
app_name: Application name for Spark (default:"Qarion ETL")master: Spark master URL (default:"local[*]")"local[*]": Run locally using all available cores"local[4]": Run locally using 4 cores"spark://host:port": Connect to Spark cluster"yarn": Run on YARN cluster
enable_hive_support: Enable Hive support (default:false)warehouse_dir: Spark warehouse directory (optional)config: Additional Spark configuration options (dict)
Use When:
- Processing large datasets that don't fit in memory
- Need distributed processing capabilities
- Working with big data workloads
- Running on Spark clusters
- Need Spark DataFrame operations
Example: Local Development
[engine]
name = "pyspark"
[engine.config]
app_name = "MyProject"
master = "local[*]"
Example: Spark Cluster
[engine]
name = "pyspark"
[engine.config]
app_name = "ProductionPipeline"
master = "spark://spark-master:7077"
config = {
"spark.executor.memory" = "8g"
"spark.executor.cores" = "4"
"spark.sql.shuffle.partitions" = "200"
}
Polars Engine
Polars is a fast DataFrame library written in Rust with Python bindings, optimized for analytical workloads.
Configuration:
[engine]
name = "polars"
[engine.config]
storage_dir = "data/polars" # Optional: for persistence
Or in-memory only:
[engine]
name = "polars"
[engine.config]
# No storage_dir = in-memory only
Features:
- Very fast DataFrame operations (Rust-based)
- Lazy evaluation for query optimization
- Columnar data processing
- Optional persistence to Parquet files
- Supports standard SQL queries
- Memory efficient
- Requires
polarspackage
Installation:
pip install polars
Configuration Options:
storage_dir: Optional directory for persisting DataFrames as Parquet files. If not specified, operates in-memory only.
Use When:
- Need very fast DataFrame operations
- Working with large datasets
- Analytical workloads
- Want lazy evaluation benefits
- Need memory-efficient processing
- Prefer Rust-based performance
Example: In-Memory
[engine]
name = "polars"
[engine.config]
# In-memory only, no persistence
Example: With Persistence
[engine]
name = "polars"
[engine.config]
storage_dir = "data/polars"
Polars vs Pandas:
- Polars is significantly faster for most operations
- Polars uses lazy evaluation (can optimize queries)
- Polars is more memory efficient
- Pandas has a larger ecosystem and more features
- Polars is better for analytical workloads
- Pandas is better for data manipulation and exploration
SparkSQL Engine
SparkSQL engine is optimized for SQL-based data processing using Spark SQL.
Configuration:
[engine]
name = "sparksql"
[engine.config]
app_name = "Qarion ETL-SQL"
master = "local[*]"
enable_hive_support = true
warehouse_dir = "spark-warehouse"
Features:
- Optimized for SQL operations
- Spark SQL execution engine
- Hive support enabled by default
- SQL-first approach
- Same distributed capabilities as PySpark
- Requires
pysparkpackage
Installation:
pip install pyspark
Configuration Options:
app_name: Application name for Spark (default:"Qarion ETL-SQL")master: Spark master URL (default:"local[*]")enable_hive_support: Enable Hive support (default:true)warehouse_dir: Spark warehouse directory (optional)config: Additional Spark configuration options (dict)
Use When:
- Primarily using SQL for transformations
- Need Hive SQL compatibility
- SQL-focused workflows
- Want optimized SQL execution
- Working with Hive tables
Example: SQL-Focused Workflow
[engine]
name = "sparksql"
[engine.config]
app_name = "SQLPipeline"
master = "local[*]"
enable_hive_support = true
PySpark vs SparkSQL:
| Feature | PySpark | SparkSQL |
|---|---|---|
| Primary API | DataFrame API | SQL |
| Hive Support | Optional (default: false) | Enabled (default: true) |
| Best For | DataFrame operations | SQL operations |
| Use Case | Programmatic data processing | SQL-based transformations |
Choosing an Engine
| Engine | Best For | Persistence | Performance | Complexity |
|---|---|---|---|---|
| SQLite | Development, small datasets | ✅ File-based | Good | Low |
| Pandas Memory | Testing, prototyping | ❌ In-memory only | Very Fast | Low |
| Pandas Local | Analytical workloads | ✅ Parquet files | Fast | Low |
| DuckDB | Analytical queries, large data | ✅ File-based | Very Fast | Medium |
| PostgreSQL | Production, multi-user | ✅ Server-based | Excellent | Medium |
| Polars | Fast analytical workloads | ✅ Optional (Parquet) | Excellent | Low |
| PySpark | Large-scale distributed processing | ✅ Cluster-based | Excellent | High |
| SparkSQL | SQL-based big data processing | ✅ Cluster-based | Excellent | High |
Performance Comparison:
- Fastest: Polars (Rust-based, optimized for analytics)
- Very Fast: DuckDB, Pandas (for in-memory operations)
- Fast: Pandas Local (with Parquet persistence)
- Excellent: PostgreSQL, PySpark, SparkSQL (for distributed/large-scale)
Storage Backends
Storage backends handle file storage for input data. They abstract local filesystem and remote storage (S3).
Storage Backend Configuration
Storage backends are automatically detected from file paths. No explicit configuration is required in most cases.
Local Storage
Local filesystem storage backend. Used automatically for local file paths.
Features:
- File and directory operations
- Pattern matching (glob patterns)
- Recursive directory scanning
- No additional configuration needed
Example:
[properties.input_ingestion]
path = "data/orders"
pattern = "orders_*.csv"
recursive = true
S3 Storage
AWS S3 storage backend for remote file access.
Features:
- S3 path resolution (
s3://bucket/path) - File listing with patterns
- File download
- Credential management
Configuration:
[properties.input_ingestion]
path = "s3://my-bucket/data/"
pattern = "orders_*.csv"
recursive = true
credentials = {
aws_access_key_id = "your-access-key"
aws_secret_access_key = "your-secret-key"
region_name = "us-east-1"
}
S3 Path Format:
s3://bucket-name/path/to/files/
Credentials:
- Can be provided in flow configuration (not recommended for production)
- Can use credential store references:
credentials = "${credential:my_aws_creds}"(recommended) - Can use AWS environment variables
- Can use IAM roles (when running on AWS)
See Credential Management for secure credential management.
FTP Storage
FTP (File Transfer Protocol) storage backend for remote file access.
Features:
- FTP path resolution (
ftp://host/path) - File listing with patterns
- File download
- Standard FTP protocol support
Configuration:
[properties.input_ingestion]
path = "ftp://ftp.example.com/data/"
pattern = "orders_*.csv"
recursive = true
credentials = {
host = "ftp.example.com"
port = 21
username = "ftpuser"
password = "${credential:ftp_password}"
}
FTP Path Format:
ftp://hostname/path/to/files/
Installation:
FTP support uses Python's built-in ftplib (standard library, no installation required).
Configuration Options:
host: FTP server hostname (required)port: FTP server port (default:21)username: FTP username (optional, defaults to anonymous)password: FTP password (optional)timeout: Connection timeout in seconds (default:30)passive: Use passive mode (default:true)
Example:
[properties.input_ingestion]
path = "ftp://ftp.example.com/data/"
pattern = "*.csv"
credentials = {
host = "ftp.example.com"
username = "user"
password = "${credential:ftp_password}"
}
SFTP Storage
SFTP (SSH File Transfer Protocol) storage backend for secure remote file access.
Features:
- SFTP path resolution (
sftp://host/path) - Secure file transfer over SSH
- File listing with patterns
- File download
- Key-based or password authentication
Configuration:
[properties.input_ingestion]
path = "sftp://sftp.example.com/data/"
pattern = "orders_*.csv"
recursive = true
credentials = {
host = "sftp.example.com"
port = 22
username = "sftpuser"
password = "${credential:sftp_password}"
}
Or with key-based authentication:
[properties.input_ingestion]
path = "sftp://sftp.example.com/data/"
credentials = {
host = "sftp.example.com"
username = "sftpuser"
key_filename = "/path/to/private_key"
}
SFTP Path Format:
sftp://hostname/path/to/files/
Installation:
pip install paramiko
Configuration Options:
host: SFTP server hostname (required)port: SFTP server port (default:22)username: SFTP username (required)password: SFTP password (optional, if using key authentication)key_filename: Path to private key file (optional)key_data: Private key data as string (optional)timeout: Connection timeout in seconds (default:30)
Example:
[properties.input_ingestion]
path = "sftp://sftp.example.com/data/"
pattern = "*.csv"
credentials = {
host = "sftp.example.com"
username = "user"
key_filename = "~/.ssh/id_rsa"
}
PostgreSQL Storage
PostgreSQL connector for file/blob storage using PostgreSQL database.
Features:
- PostgreSQL path resolution (
postgresql://path/to/file) - File storage in PostgreSQL (using bytea or large objects)
- File listing with patterns
- File download
- Database-backed file storage
Configuration:
[properties.input_ingestion]
path = "postgresql://data/files/"
pattern = "*.csv"
credentials = {
connection_string = "postgresql://user:password@host:5432/database"
storage_table = "_file_storage"
storage_schema = "public"
}
Or using individual parameters:
credentials = {
host = "db.example.com"
port = 5432
database = "mydb"
user = "myuser"
password = "${credential:db_password}"
storage_table = "_file_storage"
storage_schema = "public"
}
PostgreSQL Path Format:
postgresql://path/to/file.csv
postgres://path/to/file.csv
Installation:
pip install psycopg2-binary
Configuration Options:
connection_string: Full PostgreSQL connection string (alternative to individual parameters)host: Database host (required if not usingconnection_string)port: Database port (default:5432)database: Database name (required if not usingconnection_string)user: Database userpassword: Database passwordstorage_table: Table name for file storage (default:"_file_storage")storage_schema: Schema name for storage table (default:"public")
Example:
[properties.input_ingestion]
path = "postgresql://data/files/"
pattern = "*.csv"
credentials = {
host = "db.example.com"
database = "mydb"
user = "${credential:db_user}"
password = "${credential:db_password}"
}
Kafka Storage
Apache Kafka connector for message/file operations using Kafka topics.
Features:
- Kafka path resolution (
kafka://topic-name) - Message listing (treats messages as files)
- Message reading
- Topic-based file operations
Configuration:
[properties.input_ingestion]
path = "kafka://my-topic"
credentials = {
bootstrap_servers = "localhost:9092"
auto_offset_reset = "earliest"
}
Kafka Path Format:
kafka://topic-name
kafka://topic-name/partition/offset
Installation:
pip install kafka-python
Configuration Options:
bootstrap_servers: Kafka broker addresses (required, can be string or list)auto_offset_reset: Offset reset policy (default:"earliest")enable_auto_commit: Enable auto-commit (default:true)consumer_timeout_ms: Consumer timeout in milliseconds (default:1000)max_list_messages: Maximum messages to list (default:100)security_protocol: Security protocol (e.g.,"SASL_SSL")sasl_mechanism: SASL mechanism (e.g.,"PLAIN")sasl_plain_username: SASL usernamesasl_plain_password: SASL password
Example:
[properties.input_ingestion]
path = "kafka://orders-topic"
credentials = {
bootstrap_servers = ["kafka1:9092", "kafka2:9092"]
auto_offset_reset = "earliest"
}
Example with SASL:
credentials = {
bootstrap_servers = "kafka.example.com:9092"
security_protocol = "SASL_SSL"
sasl_mechanism = "PLAIN"
sasl_plain_username = "${credential:kafka_user}"
sasl_plain_password = "${credential:kafka_password}"
}
Azure Service Bus Storage
Azure Service Bus connector for message/file operations using queues and topics.
Features:
- Azure Service Bus path resolution (
azureservicebus://queue-nameorasb://topic-name/subscription) - Message listing (treats messages as files)
- Message reading
- Queue and topic support
Configuration:
[properties.input_ingestion]
path = "azureservicebus://my-queue"
credentials = {
connection_string = "${credential:azure_service_bus_connection_string}"
}
Or using managed identity:
credentials = {
fully_qualified_namespace = "my-namespace.servicebus.windows.net"
}
Azure Service Bus Path Format:
azureservicebus://queue-name
asb://queue-name
azureservicebus://topic-name/subscription-name
asb://topic-name/subscription-name
Installation:
pip install azure-servicebus azure-identity
Configuration Options:
connection_string: Azure Service Bus connection string (required if not usingfully_qualified_namespace)fully_qualified_namespace: Fully qualified namespace (required if not usingconnection_string)max_list_messages: Maximum messages to list (default:100)
Example with Connection String:
[properties.input_ingestion]
path = "azureservicebus://orders-queue"
credentials = {
connection_string = "${credential:azure_service_bus_connection_string}"
}
Example with Managed Identity:
[properties.input_ingestion]
path = "asb://orders-topic/subscription1"
credentials = {
fully_qualified_namespace = "my-namespace.servicebus.windows.net"
}
Note: When using fully_qualified_namespace, the connector uses Azure Default Credential, which supports:
- Managed Identity (when running on Azure)
- Environment variables
- Azure CLI credentials
- Visual Studio Code credentials
Repository Storage
Repository storage controls where metadata (dataset definitions, flow definitions, migration history) is stored.
Storage Types
Local Storage
Stores metadata in local files (TOML/JSON files).
Configuration:
# qarion-etl.toml
dataset_storage = "local"
flow_storage = "local"
schema_storage = "local"
# Directory configuration
dataset_dir = "datasets"
flow_dir = "flows"
migration_dir = "migrations"
Complete Example:
# qarion-etl.toml
[app]
app = "Qarion ETL"
type = "project"
project_name = "my_project"
[engine]
name = "sqlite"
[engine.config]
path = "data/qarion-etl.db"
# Local storage configuration
dataset_storage = "local"
flow_storage = "local"
schema_storage = "local"
dataset_dir = "datasets"
flow_dir = "flows"
migration_dir = "migrations"
Database Storage
Stores metadata in database tables.
Configuration:
# qarion-etl.toml
dataset_storage = "database"
flow_storage = "database"
schema_storage = "database"
# Optional: separate metadata engine
[metadata_engine]
name = "sqlite"
[metadata_engine.config]
path = "data/metadata.db"
# Optional: namespace for metadata tables
metadata_namespace = "xt"
Complete Example:
# qarion-etl.toml
[engine]
name = "pandas_memory"
# Processing engine - in-memory for transformations
[metadata_engine]
name = "sqlite"
[metadata_engine.config]
path = "data/metadata.db"
# Metadata engine - persistent storage for metadata
# Use database storage for all metadata
dataset_storage = "database"
flow_storage = "database"
schema_storage = "database"
metadata_namespace = "xt"
When to Use Database Storage:
- Team environments where metadata needs to be shared
- Production environments requiring centralized metadata
- When using separate metadata engine
- Need for metadata querying and reporting
Schema History Storage
Schema history tracks the evolution of dataset schemas over time.
Local Schema History
Schema history is read from migration JSON files.
Configuration:
[schema_storage]
type = "local"
config = { migration_dir = "migrations" }
Features:
- Migration files are source of truth
- No database connection required
- Version controlled
- File-based
Use When:
- Version-controlled projects
- File-based workflows
- Offline work
- Development
Database Schema History
Schema history is stored in database tables.
Configuration:
[schema_storage]
type = "database"
config = {
connection_string = "sqlite:///metadata.db"
namespace = "xt"
}
Features:
- Centralized schema history
- Database-backed
- Multi-user support
- Requires database connection
Use When:
- Production environments
- Multi-user scenarios
- Centralized management
Complete Configuration Examples
Example 1: Simple Setup (Same Engine for Everything)
[app]
name = "my_project"
type = "data_pipeline"
# Processing Engine (also used for metadata if database storage is enabled)
[engine]
name = "sqlite"
[engine.config]
path = "data/qarion-etl.db"
# Repository Storage (Local)
[dataset_storage]
type = "local"
config = { dataset_dir = "datasets" }
[flow_storage]
type = "local"
config = { flow_dir = "flows" }
[schema_storage]
type = "local"
config = { migration_dir = "migrations" }
Example 2: Separate Processing and Metadata Engines
[app]
name = "my_project"
type = "data_pipeline"
# Processing Engine - for data transformations
[engine]
name = "pandas_memory"
[engine.config]
# No configuration required
# Metadata Engine - for storing metadata in database
[metadata_engine]
name = "sqlite"
[metadata_engine.config]
path = "data/metadata.db"
# Repository Storage (Database)
dataset_storage = "database"
flow_storage = "database"
schema_storage = "database"
Example 3: Production Setup with Database Metadata
[app]
name = "production_pipeline"
type = "data_pipeline"
# Processing Engine - for large-scale data processing
[engine]
name = "duckdb"
[engine.config]
path = "data/processing.duckdb"
# Metadata Engine - for centralized metadata management
[metadata_engine]
name = "postgresql"
[metadata_engine.config]
connection_string = "postgresql://user:pass@db-server/metadata"
# Repository Storage (Database)
dataset_storage = "database"
flow_storage = "database"
schema_storage = "database"
Best Practices
Engine Selection
- Development: Use SQLite for simplicity
- Testing: Use Pandas Memory or Polars for speed
- Production: Choose based on workload:
- Analytical: Polars (fastest), DuckDB, or Pandas Local
- Transactional: PostgreSQL (recommended) or SQLite
- Large scale: PostgreSQL, PySpark, or SparkSQL
- Multi-user/Team: PostgreSQL
- Fast analytics: Polars (recommended for single-machine workloads)
Processing vs Metadata Engine Separation
Use the same engine when:
- Simple setups where one engine is sufficient
- Development and testing environments
- Processing engine supports both use cases well (e.g., SQLite for both)
Use separate engines when:
- Processing large datasets with Spark/Pandas while keeping metadata in a lightweight database
- Using in-memory engines for processing but needing persistent metadata storage
- Isolating processing workloads from metadata operations
- Different engines optimized for different purposes
Example Scenarios:
-
Spark Processing + PostgreSQL Metadata
- Processing: Spark (for large-scale data processing)
- Metadata: PostgreSQL (for centralized metadata management)
-
Pandas Memory + SQLite Metadata
- Processing: Pandas Memory (for fast in-memory operations)
- Metadata: SQLite (for persistent metadata storage)
-
DuckDB Processing + Same DuckDB Metadata
- Processing: DuckDB (for analytical queries)
- Metadata: DuckDB (same engine, different database file)
Storage Backend Selection
- Local Development: Use local storage
- Production: Use S3 for remote files
- Credentials: Store securely using credential stores (see Credential Management) or environment variables (see Configuration Guide)
Repository Storage Selection
- Development: Use local storage (files in git)
- Production: Consider database storage for centralized management
- Schema History: Match your workflow (local for file-based, database for centralized)
Environment Variables
Use environment variables for sensitive values and environment-specific configuration:
[engine.config]
path = "${DB_PATH:-data/qarion-etl.db}"
See Configuration Guide for details.