Qarion ETL Architecture

Overview

Qarion ETL is built on a plugin-based, engine-agnostic architecture that separates concerns into distinct layers:

Configuration Layer: Flow and dataset definitions
Orchestration Layer: DAG generation and execution planning
Transformation Layer: Engine-agnostic transformation instructions
Execution Layer: Engine-specific executors
Storage Layer: Database and file system engines

Core Principles

1. Plugin Architecture

All flow-specific logic is encapsulated in plugins. This allows:

Easy addition of new flow types
Isolation of flow-specific concerns
Clear extension points

2. Engine Agnosticism

Transformations are described using engine-agnostic instructions that are translated to engine-specific code by executors.

3. Repository Pattern

Data access is abstracted through repositories, supporting multiple storage backends.

4. Separation of Concerns

What (transformations) vs How (execution)
Configuration vs Implementation
Flow Logic vs Engine Logic

Component Architecture

Flow System

Flow Definition (TOML)
    ↓
FlowPlugin (flow-specific logic)
    ↓
FlowDAG (execution plan)
    ↓
Code Generation / Execution

Key Components:

FlowPlugin: Abstract interface for flow types
FlowDAG: Execution graph with nodes and edges
DAGNode: Represents a single step (ingestion or transformation)

Transformation System

TransformationInstruction (engine-agnostic)
    ↓
TransformationExecutor (engine-specific)
    ↓
Engine (SQLite, Pandas, etc.)

Key Components:

TransformationInstruction: Describes what to transform
TransformationRule: Individual transformation rules
TransformationExecutor: Executes instructions on specific engines

Engine System

BaseEngine (abstract)
    ├── SQLiteEngine
    ├── PandasInMemoryEngine
    └── PandasLocalStorageEngine

BaseDialectAdapter (abstract)
    ├── SQLiteDialectAdapter
    └── PandasDialectAdapter

Key Components:

BaseEngine: Abstract engine interface
BaseDialectAdapter: Converts commands to engine-specific syntax
EngineLoader: Factory for creating engines from configuration

File Loading System

FileLoaderFactory
    └── Loader Plugins (loaders/plugins/)
        ├── CSVLoaderPlugin (contains CSVFileLoader)
        ├── JSONLoaderPlugin (contains JSONFileLoader)
        └── ParquetLoaderPlugin (contains ParquetFileLoader)

Key Components:

BaseFileLoader: Abstract loader interface
LoaderPlugin: Plugin interface for loaders
FileLoaderFactory: Factory for creating loaders (uses plugin system)
Format-specific loaders implemented as plugins (loader + plugin in same file)

File Export System

FileExporterFactory
    └── Exporter Plugins (exporters/plugins/)
        ├── CSVExporterPlugin (contains CSVFileExporter)
        ├── JSONExporterPlugin (contains JSONFileExporter)
        └── ParquetExporterPlugin (contains ParquetFileExporter)

Key Components:

BaseFileExporter: Abstract exporter interface
ExporterPlugin: Plugin interface for exporters
FileExporterFactory: Factory for creating exporters (uses plugin system)
ExportExecutor: Orchestrates export operations
Format-specific exporters implemented as plugins (exporter + plugin in same file)

Credential Store System

CredentialStoreFactory
    └── Credential Store Implementations
        ├── DatabaseCredentialStore (stores in database table)
        ├── LocalKeystoreCredentialStore (stores in encrypted file)
        └── AWSSSMCredentialStore (uses AWS SSM Parameter Store for secure credential storage)

Key Components:

CredentialStore: Abstract base class for credential stores
CredentialDefinition: Metadata about credentials
CredentialStoreFactory: Factory for creating credential stores
CredentialStoreRegistry: Registry for credential store types
Store implementations with encryption support

Data Flow

Typical Execution Flow

Flow Definition → Loaded from TOML
Dataset Generation → Flow plugin generates datasets
DAG Generation → Flow plugin generates execution DAG
Code Generation → Generate SQL/Python/DBT/Airflow code
Execution → Execute transformations using appropriate executor

Transformation Execution

Flow Plugin → Generates TransformationInstruction
TransformationExecutor → Translates instruction to engine-specific code
Engine → Executes the code
Result → Returns execution statistics

Extension Points

Adding a New Flow Type

Create plugin class inheriting FlowPlugin
Implement required methods:
- generate_datasets()
- generate_dag()
- generate_transformation_instructions()
Register plugin with register_flow_plugin()

Adding a New Engine

Create engine class inheriting BaseEngine
Implement required methods:
- connect()
- execute_query()
- get_dataframe()
Create dialect adapter inheriting BaseDialectAdapter
Register with EngineLoader

Adding a New Executor

Create executor class inheriting BaseTransformationExecutor
Implement execute_transformation_instruction()
Update TransformationService to use new executor

Adding a New File Loader

Create loader class inheriting BaseFileLoader
Implement load() method
Register with FileLoaderFactory

Adding a New File Exporter

Create exporter class inheriting BaseFileExporter
Implement export() method
Create exporter plugin class inheriting ExporterPlugin
Register with ExporterPluginRegistry
See Export System for detailed guide

Adding a New Credential Store

Create store class inheriting CredentialStore
Implement required methods:
- store_credential()
- get_credential()
- delete_credential()
- list_credentials()
- credential_exists()
Register with CredentialStoreRegistry
See Credential Store Architecture for detailed guide

Design Patterns

Qarion ETL uses several design patterns to achieve flexibility, maintainability, and extensibility. Each pattern addresses specific architectural challenges.

Strategy Pattern

Purpose: Encapsulate different transformation approaches and execution strategies.

Implementation:

Transformation Strategies: Different transformation approaches (landing→change_feed, change_feed→delta_transaction)
Execution Strategies: Different execution engines (SQL, Pandas, PySpark)

Why This Pattern?

Allows switching between different algorithms at runtime
Makes it easy to add new transformation types without modifying existing code
Separates the algorithm (strategy) from the context (transformation service)

Example: LandingToChangeFeedStrategy and ChangeFeedToDeltaTransactionStrategy can be swapped without changing the transformation service.

Factory Pattern

Purpose: Centralize object creation logic and hide implementation details.

Implementation:

EngineLoader: Creates engines from configuration
FileLoaderFactory: Creates loaders based on file type
Repository Factories: Creates repositories based on storage type

Why This Pattern?

Decouples object creation from usage
Makes it easy to swap implementations (local vs database storage)
Centralizes creation logic, making it easier to maintain
Supports configuration-driven creation

Example: create_dataset_repository(storage_type='local') vs create_dataset_repository(storage_type='database') returns different implementations without the caller knowing the details.

Repository Pattern

Purpose: Abstract data access and support multiple storage backends.

Implementation:

HistoryRepository: Schema history storage
DatasetRepository: Dataset definition storage
FlowRepository: Flow definition storage

Why This Pattern?

Separates business logic from data access
Makes it easy to switch storage backends (file-based vs database)
Enables testing with in-memory implementations
Provides a consistent interface regardless of storage implementation

Example: Business logic uses repository.get_flow_by_id() without knowing if it reads from files or database.

Plugin Pattern

Purpose: Enable extensibility without modifying core code.

Implementation:

FlowPlugin: Extensible flow type system
FileLoader: Extensible file format support
EnginePlugin: Extensible engine support
CodeGeneratorPlugin: Extensible code generation

Why This Pattern?

Open/Closed Principle: Open for extension, closed for modification
Separation of Concerns: Each plugin encapsulates its own logic
Testability: Plugins can be tested independently
Maintainability: Changes to one plugin don't affect others

Example: Adding a new flow type (e.g., sessionization) doesn't require modifying core flow orchestration code.

Template Method Pattern

Purpose: Define the skeleton of an algorithm, letting subclasses fill in details.

Implementation:

BaseCodeGenerator: Defines code generation workflow, subclasses implement format-specific logic
BaseTransformationExecutor: Defines execution workflow, subclasses implement engine-specific logic

Why This Pattern?

Ensures consistent workflow across implementations
Reduces code duplication
Makes it easy to add new formats/engines

Example: All code generators follow the same workflow (iterate nodes, generate code, write files), but SQL, DBT, and Airflow generators implement format-specific details.

Observer Pattern

Purpose: Notify multiple objects about state changes.

Implementation:

Plugin registry notifies about plugin registration
Event system for flow execution (future enhancement)

Why This Pattern?

Decouples event producers from consumers
Enables multiple listeners for the same event
Supports event-driven architectures

Builder Pattern

Purpose: Construct complex objects step by step.

Implementation:

FlowDAG: Built incrementally with nodes and edges
TransformationInstruction: Built with rules and operations

Why This Pattern?

Makes complex object construction more readable
Allows step-by-step construction
Supports validation at each step

Example: Building a FlowDAG by adding nodes and edges incrementally, with validation at each step.

Pattern Selection Rationale

Why Not Other Patterns?

Singleton Pattern: Not used because:

Makes testing difficult (shared state)
Violates dependency injection principles
Hard to mock in tests

Adapter Pattern: Used implicitly in:

Dialect Adapters: Convert SQL to engine-specific syntax
But not as a formal pattern - more of an abstraction layer

Decorator Pattern: Not needed because:

Plugin pattern provides better extensibility
Composition is preferred over decoration

Pattern Interactions

Patterns work together to create a cohesive architecture:

Factory + Repository: Factories create repositories, repositories use plugin pattern
Plugin + Strategy: Plugins implement strategies, strategies use executors
Template Method + Plugin: Base classes define workflow, plugins implement details
Builder + Factory: Builders construct complex objects, factories create builders

Benefits of This Architecture

Extensibility: Easy to add new flow types, engines, loaders
Testability: Each component can be tested in isolation
Maintainability: Clear separation of concerns
Flexibility: Can swap implementations without code changes
Scalability: Patterns support growth without architectural changes

Scalability Considerations

Current Strengths

Plugin Architecture: Easy to add new flow types
Engine Abstraction: Support for multiple execution engines
Repository Pattern: Flexible storage backends
Modular Structure: Clear separation of concerns

Future Enhancements

Code Generator Plugins: Support for DBT, Airflow code generation
Additional Engines: PySpark, DuckDB, etc.
Distributed Execution: Support for distributed processing
Caching Layer: Performance optimization

Testing Strategy

Unit Tests

Individual components tested in isolation
Mock dependencies where appropriate
Test edge cases and error handling

Integration Tests

End-to-end workflow tests
Real engine implementations
File system operations

Test Coverage

Core transformation logic: 90%+
Flow plugins: 85%+
Executors: 85%+
Engines: 80%+

Performance Considerations

Optimization Opportunities

Batch Processing: Process multiple records together
Lazy Evaluation: Defer computation until needed
Caching: Cache frequently accessed data
Parallel Execution: Execute independent steps in parallel

Current Limitations

Single-threaded Execution: No parallel processing yet
In-memory Processing: Limited by available memory
No Query Optimization: Relies on engine's optimization

Security Considerations

Input Validation: All inputs validated
SQL Injection Prevention: Parameterized queries
File System Access: Validated file paths
Configuration Validation: Schema validation for configs

Error Handling

Error Types

Configuration Errors: Invalid configuration
Validation Errors: Invalid data or schema
Execution Errors: Runtime errors during execution
File System Errors: File access issues

Error Recovery

Transaction Rollback: Automatic rollback on errors
Error Logging: Comprehensive error logging
Graceful Degradation: Continue processing when possible

Future Roadmap

DBT Code Generation: Generate DBT models
Airflow Code Generation: Generate Airflow DAGs
Additional Engines: PySpark, DuckDB support
Distributed Execution: Support for distributed processing
Real-time Processing: Stream processing support
Monitoring & Observability: Built-in monitoring

Overview​

Core Principles​

1. Plugin Architecture​

2. Engine Agnosticism​

3. Repository Pattern​

4. Separation of Concerns​

Component Architecture​

Flow System​

Transformation System​

Engine System​

File Loading System​

File Export System​

Credential Store System​

Data Flow​

Typical Execution Flow​

Transformation Execution​

Extension Points​

Adding a New Flow Type​

Adding a New Engine​

Adding a New Executor​

Adding a New File Loader​

Adding a New File Exporter​

Adding a New Credential Store​

Design Patterns​

Strategy Pattern​

Factory Pattern​

Repository Pattern​

Plugin Pattern​

Template Method Pattern​

Observer Pattern​

Builder Pattern​

Pattern Selection Rationale​

Why Not Other Patterns?​

Pattern Interactions​

Benefits of This Architecture​

Scalability Considerations​

Current Strengths​

Future Enhancements​

Testing Strategy​

Unit Tests​

Integration Tests​

Test Coverage​

Performance Considerations​

Optimization Opportunities​

Current Limitations​

Security Considerations​

Error Handling​

Error Types​

Error Recovery​

Future Roadmap​

Overview

Core Principles

1. Plugin Architecture

2. Engine Agnosticism

3. Repository Pattern

4. Separation of Concerns

Component Architecture

Flow System

Transformation System

Engine System

File Loading System

File Export System

Credential Store System

Data Flow

Typical Execution Flow

Transformation Execution

Extension Points

Adding a New Flow Type

Adding a New Engine

Adding a New Executor

Adding a New File Loader

Adding a New File Exporter

Adding a New Credential Store

Design Patterns

Strategy Pattern

Factory Pattern

Repository Pattern

Plugin Pattern

Template Method Pattern

Observer Pattern

Builder Pattern

Pattern Selection Rationale

Why Not Other Patterns?

Pattern Interactions

Benefits of This Architecture

Scalability Considerations

Current Strengths

Future Enhancements

Testing Strategy

Unit Tests

Integration Tests

Test Coverage

Performance Considerations

Optimization Opportunities

Current Limitations

Security Considerations

Error Handling

Error Types

Error Recovery

Future Roadmap