Skip to main content

Configuration Guide

How to configure Qarion ETL for your needs.

Configuration File

Qarion ETL uses an qarion-etl.toml file for project configuration (default name). This file is created when you initialize a project.

Basic Configuration

Minimal Configuration:

# qarion-etl.toml
[app]
app = "Qarion ETL"
type = "project"
project_name = "my_project"

[engine]
name = "sqlite"
[engine.config]
path = "data/qarion-etl.db"

dataset_dir = "datasets"
migration_dir = "migrations"
flow_dir = "flows"
quality_dir = "data_quality"
schema_storage = "local"
dataset_storage = "local"
flow_storage = "local"
quality_storage = "local"

# Quality Store Configuration
[quality_store]
enabled = true
auto_calculate_metrics = true
results_table_name = "_quality_results"
metrics_table_name = "_quality_metrics"

Complete Configuration Example

Full Configuration with All Options:

# qarion-etl.toml
[app]
app = "Qarion ETL"
type = "project"
project_name = "my_project"
version = "1.0.0"

# Processing Engine
[engine]
name = "sqlite"
[engine.config]
path = "data/qarion-etl.db"

# Optional: Separate Metadata Engine
[metadata_engine]
name = "sqlite"
[metadata_engine.config]
path = "data/metadata.db"

# Storage Configuration
dataset_storage = "local"
flow_storage = "local"
schema_storage = "local"
quality_storage = "local"
dataset_dir = "datasets"
flow_dir = "flows"
migration_dir = "migrations"
quality_dir = "data_quality"
metadata_namespace = "xt"
default_namespace = "public"

# Quality Store Configuration
[quality_store]
enabled = true
auto_calculate_metrics = true
results_table_name = "_quality_results"
metrics_table_name = "_quality_metrics"

# Credential Store Configuration
[credential_store]
type = "local_keystore"
[credential_store.config]
keystore_path = "~/.qarion_etl/credentials.keystore"

# Quality Store Configuration
[quality_store]
enabled = true
auto_calculate_metrics = true
results_table_name = "_quality_results"
metrics_table_name = "_quality_metrics"

# Credential Definitions
[[credentials]]
id = "aws_prod_creds"
name = "AWS Production Credentials"
credential_type = "aws"
description = "AWS credentials for production S3 access"

[[credentials]]
id = "db_prod_creds"
name = "Database Production Credentials"
credential_type = "database"
description = "PostgreSQL credentials for production"

Storage Configuration

Qarion ETL has multiple storage layers:

  1. Storage Backends: For input file storage (local filesystem, S3)
  2. Repository Storage: For metadata storage (datasets, flows, migrations)

See Engines and Storage for detailed information.

Repository Storage

Local Storage

Store definitions in local files (default):

dataset_dir = "datasets"
flow_dir = "flows"
migration_dir = "migrations"
dataset_storage = "local"
flow_storage = "local"
schema_storage = "local"

Database Storage

Store definitions in database:

dataset_storage = "database"
flow_storage = "database"
schema_storage = "database"

Note: Database storage requires a configured engine and database service.

Storage Backends

Storage backends are automatically detected from file paths. For S3:

Using Inline Credentials (Not Recommended):

[properties.input_ingestion]
path = "s3://my-bucket/data/"
pattern = "orders_*.csv"
credentials = {
aws_access_key_id = "your-access-key"
aws_secret_access_key = "your-secret-key"
region_name = "us-east-1"
}

Using Credential Store (Recommended):

[properties.input_ingestion]
path = "s3://my-bucket/data/"
pattern = "orders_*.csv"
credentials = "${credential:my_aws_creds}"

See Credential Management for detailed information on managing credentials securely.

Engine Configuration

Engines are the execution environments where transformations run. Qarion ETL supports two types of engines:

  1. Processing Engine ([engine]): Required. Used for data transformations and processing.
  2. Metadata Engine ([metadata_engine]): Optional. Used for storing metadata in database storage. Defaults to processing engine if not specified.

See Engines and Storage for detailed information.

Processing Engine

The processing engine is configured in the [engine] section:

SQLite Engine

[engine]
name = "sqlite"
[engine.config]
path = "data/qarion-etl.db"

Pandas In-Memory Engine

[engine]
name = "pandas_memory"
[engine.config]
# No configuration required

Pandas Local Storage Engine

[engine]
name = "pandas_local"
[engine.config]
storage_dir = "data/pandas"

DuckDB Engine

[engine]
name = "duckdb"
[engine.config]
path = "data/qarion-etl.duckdb"

PySpark Engine

[engine]
name = "pyspark"
[engine.config]
app_name = "Qarion ETL"
master = "local[*]"
enable_hive_support = false

SparkSQL Engine

[engine]
name = "sparksql"
[engine.config]
app_name = "Qarion ETL-SQL"
master = "local[*]"
enable_hive_support = true

Polars Engine

[engine]
name = "polars"
[engine.config]
storage_dir = "data/polars" # Optional: for persistence

Or in-memory only:

[engine]
name = "polars"
[engine.config]
# No storage_dir = in-memory only

Metadata Engine

The metadata engine is optional and configured in the [metadata_engine] section. If not specified, the processing engine is used for metadata storage.

Example: Separate Metadata Engine

# Processing engine - for data transformations
[engine]
name = "pandas_memory"

# Metadata engine - for storing metadata in database
[metadata_engine]
name = "sqlite"
[metadata_engine.config]
path = "data/metadata.db"

# Use database storage
dataset_storage = "database"
flow_storage = "database"

When to use a separate metadata engine:

  • Using database storage for metadata
  • Want to separate processing workloads from metadata management
  • Using different engines optimized for different purposes (e.g., Spark for processing, PostgreSQL for metadata)

Schema Storage

Local Schema Storage

schema_storage = "local"
migration_dir = "migrations"

Database Schema Storage

schema_storage = "database"
metadata_namespace = "xt"

Note: Database schema storage requires a configured engine and database service.

Namespace Configuration

metadata_namespace = "xt"
default_namespace = "public"

The metadata_namespace is used as a prefix for metadata tables (e.g., xt_runs, xt_schemas). The default_namespace is the default namespace for datasets and flows.

Credential Management

Qarion ETL provides a credential store system for managing credentials securely. This is the recommended approach for storing sensitive data like passwords, API keys, and access tokens.

Key Benefits:

  • Define credentials once and reuse across configurations
  • Store credentials securely in databases, local keystores, or cloud key management services
  • Reference credentials without exposing sensitive data in configuration files
  • Support for multiple credential store backends

Quick Example:

[credential_store]
type = "local_keystore"

[[credentials]]
id = "my_aws_creds"
name = "AWS Production Credentials"
credential_type = "aws"

Then reference in configuration:

[properties.input_ingestion]
path = "s3://my-bucket/data/"
credentials = "${credential:my_aws_creds}"

See Credential Management Guide for complete documentation.

Quality Store Configuration

The quality store configuration controls how quality check results and metrics are stored and tracked.

Configuration Options

[quality_store]
enabled = true # Enable/disable automatic storage (default: true)
auto_calculate_metrics = true # Automatically calculate metrics (default: true)
results_table_name = "_quality_results" # Table name for results (default: "_quality_results")
metrics_table_name = "_quality_metrics" # Table name for metrics (default: "_quality_metrics")

Options:

  • enabled (boolean, default: true): Whether to enable automatic storage of quality check results and metrics. When disabled, results are not persisted to the database.
  • auto_calculate_metrics (boolean, default: true): Whether to automatically calculate and store aggregated metrics when storing results. Metrics include pass rate, failure count, average execution time, and total records checked.
  • results_table_name (string, default: "_quality_results"): Name of the table in the metadata engine where quality check execution results are stored.
  • metrics_table_name (string, default: "_quality_metrics"): Name of the table in the metadata engine where aggregated quality metrics are stored.

Example Configuration

# Enable quality store with custom table names
[quality_store]
enabled = true
auto_calculate_metrics = true
results_table_name = "quality_check_results"
metrics_table_name = "quality_metrics"

Disabling Quality Store

To disable quality results storage:

[quality_store]
enabled = false

When disabled, quality checks will still execute, but results will not be persisted to the database. This is useful if you only need real-time validation without historical tracking.

Integration

The quality store is automatically integrated with:

  • Quality check flows
  • Quality check tasks in standard flows
  • Quality check nodes
  • Automatic quality checks after transformations

Results are automatically stored when quality checks execute, using the configured table names and settings.

For more information, see the Data Quality Guide.

Environment Variables

Qarion ETL supports using environment variables directly in configuration files. This is useful for:

  • Keeping sensitive values (passwords, API keys) out of version control
  • Using different values across environments (dev, staging, production)
  • Sharing configuration across multiple projects

Note: For production environments, consider using the Credential Store instead of environment variables for better security and management.

Environment Variable Substitution

You can use environment variables in your config.toml file using two syntaxes:

Standard Syntax with Defaults

[engine]
name = "sqlite"
[engine.config]
path = "${DB_PATH:-data/qarion-etl.db}"

This will:

  • Use the value of DB_PATH environment variable if set
  • Fall back to data/qarion-etl.db if DB_PATH is not set

Simple Syntax

[engine]
name = "sqlite"
[engine.config]
path = "$DB_PATH"

This will:

  • Use the value of DB_PATH environment variable if set
  • Use an empty string if DB_PATH is not set (with a warning)

Examples

Database Credentials

[engine]
name = "postgres"
[engine.config]
host = "${DB_HOST:-localhost}"
port = "${DB_PORT:-5432}"
database = "$DB_NAME"
user = "$DB_USER"
password = "$DB_PASSWORD"

Set environment variables:

export DB_HOST=production-db.example.com
export DB_PORT=5432
export DB_NAME=mydb
export DB_USER=myuser
export DB_PASSWORD=secretpassword

S3 Credentials

Using Environment Variables:

[properties.input_ingestion]
path = "s3://${S3_BUCKET}/data/"
credentials = {
aws_access_key_id = "$AWS_ACCESS_KEY_ID"
aws_secret_access_key = "$AWS_SECRET_ACCESS_KEY"
region_name = "${AWS_REGION:-us-east-1}"
}

Using Credential Store (Recommended):

[properties.input_ingestion]
path = "s3://my-bucket/data/"
credentials = "${credential:my_aws_creds}"

See Credential Management for setting up credential stores.

File Paths

dataset_dir = "${DATASET_DIR:-datasets}"
migration_dir = "${MIGRATION_DIR:-migrations}"
flow_dir = "${FLOW_DIR:-flows}"

Configuration File Path

You can also specify the configuration file path using an environment variable:

export XTRANSACT_CONFIG_PATH=/path/to/custom/config.toml

This takes precedence over the default qarion-etl.toml file, but can be overridden by command-line arguments.

Complete Configuration Example

Production-Ready Configuration:

# qarion-etl.toml
[app]
app = "Qarion ETL"
type = "project"
project_name = "production_pipeline"
version = "1.0.0"

# Processing Engine - for data transformations
[engine]
name = "duckdb"
[engine.config]
path = "data/processing.duckdb"

# Metadata Engine - for storing metadata
[metadata_engine]
name = "sqlite"
[metadata_engine.config]
path = "data/metadata.db"

# Storage Configuration
dataset_storage = "database"
flow_storage = "database"
schema_storage = "database"
dataset_dir = "datasets"
flow_dir = "flows"
migration_dir = "migrations"
metadata_namespace = "xt"
default_namespace = "public"

# Credential Store
[credential_store]
type = "local_keystore"
[credential_store.config]
keystore_path = "~/.qarion_etl/credentials.keystore"

# Credential Definitions
[[credentials]]
id = "aws_prod_creds"
name = "AWS Production Credentials"
credential_type = "aws"
description = "AWS credentials for production S3 access"

[[credentials]]
id = "db_prod_creds"
name = "Database Production Credentials"
credential_type = "database"
description = "PostgreSQL credentials for production database"

# Fernet key for credential encryption (auto-generated)
fernet_key = "gAAAAABh..." # Never commit this to version control

Validation

Configuration is validated on load. Invalid configuration will raise errors with details about what's wrong.