Skip to main content

Qarion ETL Architecture

Overview

Qarion ETL is built on a plugin-based, engine-agnostic architecture that separates concerns into distinct layers:

  1. Configuration Layer: Flow and dataset definitions
  2. Orchestration Layer: DAG generation and execution planning
  3. Transformation Layer: Engine-agnostic transformation instructions
  4. Execution Layer: Engine-specific executors
  5. Storage Layer: Database and file system engines

Core Principles

1. Plugin Architecture

All flow-specific logic is encapsulated in plugins. This allows:

  • Easy addition of new flow types
  • Isolation of flow-specific concerns
  • Clear extension points

2. Engine Agnosticism

Transformations are described using engine-agnostic instructions that are translated to engine-specific code by executors.

3. Repository Pattern

Data access is abstracted through repositories, supporting multiple storage backends.

4. Separation of Concerns

  • What (transformations) vs How (execution)
  • Configuration vs Implementation
  • Flow Logic vs Engine Logic

Component Architecture

Flow System

Flow Definition (TOML)

FlowPlugin (flow-specific logic)

FlowDAG (execution plan)

Code Generation / Execution

Key Components:

  • FlowPlugin: Abstract interface for flow types
  • FlowDAG: Execution graph with nodes and edges
  • DAGNode: Represents a single step (ingestion or transformation)

Transformation System

TransformationInstruction (engine-agnostic)

TransformationExecutor (engine-specific)

Engine (SQLite, Pandas, etc.)

Key Components:

  • TransformationInstruction: Describes what to transform
  • TransformationRule: Individual transformation rules
  • TransformationExecutor: Executes instructions on specific engines

Engine System

BaseEngine (abstract)
├── SQLiteEngine
├── PandasInMemoryEngine
└── PandasLocalStorageEngine

BaseDialectAdapter (abstract)
├── SQLiteDialectAdapter
└── PandasDialectAdapter

Key Components:

  • BaseEngine: Abstract engine interface
  • BaseDialectAdapter: Converts commands to engine-specific syntax
  • EngineLoader: Factory for creating engines from configuration

File Loading System

FileLoaderFactory
└── Loader Plugins (loaders/plugins/)
├── CSVLoaderPlugin (contains CSVFileLoader)
├── JSONLoaderPlugin (contains JSONFileLoader)
└── ParquetLoaderPlugin (contains ParquetFileLoader)

Key Components:

  • BaseFileLoader: Abstract loader interface
  • LoaderPlugin: Plugin interface for loaders
  • FileLoaderFactory: Factory for creating loaders (uses plugin system)
  • Format-specific loaders implemented as plugins (loader + plugin in same file)

File Export System

FileExporterFactory
└── Exporter Plugins (exporters/plugins/)
├── CSVExporterPlugin (contains CSVFileExporter)
├── JSONExporterPlugin (contains JSONFileExporter)
└── ParquetExporterPlugin (contains ParquetFileExporter)

Key Components:

  • BaseFileExporter: Abstract exporter interface
  • ExporterPlugin: Plugin interface for exporters
  • FileExporterFactory: Factory for creating exporters (uses plugin system)
  • ExportExecutor: Orchestrates export operations
  • Format-specific exporters implemented as plugins (exporter + plugin in same file)

Credential Store System

CredentialStoreFactory
└── Credential Store Implementations
├── DatabaseCredentialStore (stores in database table)
├── LocalKeystoreCredentialStore (stores in encrypted file)
└── AWSSSMCredentialStore (uses AWS SSM Parameter Store for secure credential storage)

Key Components:

  • CredentialStore: Abstract base class for credential stores
  • CredentialDefinition: Metadata about credentials
  • CredentialStoreFactory: Factory for creating credential stores
  • CredentialStoreRegistry: Registry for credential store types
  • Store implementations with encryption support

Data Flow

Typical Execution Flow

  1. Flow Definition → Loaded from TOML
  2. Dataset Generation → Flow plugin generates datasets
  3. DAG Generation → Flow plugin generates execution DAG
  4. Code Generation → Generate SQL/Python/DBT/Airflow code
  5. Execution → Execute transformations using appropriate executor

Transformation Execution

  1. Flow Plugin → Generates TransformationInstruction
  2. TransformationExecutor → Translates instruction to engine-specific code
  3. Engine → Executes the code
  4. Result → Returns execution statistics

Extension Points

Adding a New Flow Type

  1. Create plugin class inheriting FlowPlugin
  2. Implement required methods:
    • generate_datasets()
    • generate_dag()
    • generate_transformation_instructions()
  3. Register plugin with register_flow_plugin()

Adding a New Engine

  1. Create engine class inheriting BaseEngine
  2. Implement required methods:
    • connect()
    • execute_query()
    • get_dataframe()
  3. Create dialect adapter inheriting BaseDialectAdapter
  4. Register with EngineLoader

Adding a New Executor

  1. Create executor class inheriting BaseTransformationExecutor
  2. Implement execute_transformation_instruction()
  3. Update TransformationService to use new executor

Adding a New File Loader

  1. Create loader class inheriting BaseFileLoader
  2. Implement load() method
  3. Register with FileLoaderFactory

Adding a New File Exporter

  1. Create exporter class inheriting BaseFileExporter
  2. Implement export() method
  3. Create exporter plugin class inheriting ExporterPlugin
  4. Register with ExporterPluginRegistry
  5. See Export System for detailed guide

Adding a New Credential Store

  1. Create store class inheriting CredentialStore
  2. Implement required methods:
    • store_credential()
    • get_credential()
    • delete_credential()
    • list_credentials()
    • credential_exists()
  3. Register with CredentialStoreRegistry
  4. See Credential Store Architecture for detailed guide

Design Patterns

Qarion ETL uses several design patterns to achieve flexibility, maintainability, and extensibility. Each pattern addresses specific architectural challenges.

Strategy Pattern

Purpose: Encapsulate different transformation approaches and execution strategies.

Implementation:

  • Transformation Strategies: Different transformation approaches (landing→change_feed, change_feed→delta_transaction)
  • Execution Strategies: Different execution engines (SQL, Pandas, PySpark)

Why This Pattern?

  • Allows switching between different algorithms at runtime
  • Makes it easy to add new transformation types without modifying existing code
  • Separates the algorithm (strategy) from the context (transformation service)

Example: LandingToChangeFeedStrategy and ChangeFeedToDeltaTransactionStrategy can be swapped without changing the transformation service.

Factory Pattern

Purpose: Centralize object creation logic and hide implementation details.

Implementation:

  • EngineLoader: Creates engines from configuration
  • FileLoaderFactory: Creates loaders based on file type
  • Repository Factories: Creates repositories based on storage type

Why This Pattern?

  • Decouples object creation from usage
  • Makes it easy to swap implementations (local vs database storage)
  • Centralizes creation logic, making it easier to maintain
  • Supports configuration-driven creation

Example: create_dataset_repository(storage_type='local') vs create_dataset_repository(storage_type='database') returns different implementations without the caller knowing the details.

Repository Pattern

Purpose: Abstract data access and support multiple storage backends.

Implementation:

  • HistoryRepository: Schema history storage
  • DatasetRepository: Dataset definition storage
  • FlowRepository: Flow definition storage

Why This Pattern?

  • Separates business logic from data access
  • Makes it easy to switch storage backends (file-based vs database)
  • Enables testing with in-memory implementations
  • Provides a consistent interface regardless of storage implementation

Example: Business logic uses repository.get_flow_by_id() without knowing if it reads from files or database.

Plugin Pattern

Purpose: Enable extensibility without modifying core code.

Implementation:

  • FlowPlugin: Extensible flow type system
  • FileLoader: Extensible file format support
  • EnginePlugin: Extensible engine support
  • CodeGeneratorPlugin: Extensible code generation

Why This Pattern?

  • Open/Closed Principle: Open for extension, closed for modification
  • Separation of Concerns: Each plugin encapsulates its own logic
  • Testability: Plugins can be tested independently
  • Maintainability: Changes to one plugin don't affect others

Example: Adding a new flow type (e.g., sessionization) doesn't require modifying core flow orchestration code.

Template Method Pattern

Purpose: Define the skeleton of an algorithm, letting subclasses fill in details.

Implementation:

  • BaseCodeGenerator: Defines code generation workflow, subclasses implement format-specific logic
  • BaseTransformationExecutor: Defines execution workflow, subclasses implement engine-specific logic

Why This Pattern?

  • Ensures consistent workflow across implementations
  • Reduces code duplication
  • Makes it easy to add new formats/engines

Example: All code generators follow the same workflow (iterate nodes, generate code, write files), but SQL, DBT, and Airflow generators implement format-specific details.

Observer Pattern

Purpose: Notify multiple objects about state changes.

Implementation:

  • Plugin registry notifies about plugin registration
  • Event system for flow execution (future enhancement)

Why This Pattern?

  • Decouples event producers from consumers
  • Enables multiple listeners for the same event
  • Supports event-driven architectures

Builder Pattern

Purpose: Construct complex objects step by step.

Implementation:

  • FlowDAG: Built incrementally with nodes and edges
  • TransformationInstruction: Built with rules and operations

Why This Pattern?

  • Makes complex object construction more readable
  • Allows step-by-step construction
  • Supports validation at each step

Example: Building a FlowDAG by adding nodes and edges incrementally, with validation at each step.

Pattern Selection Rationale

Why Not Other Patterns?

Singleton Pattern: Not used because:

  • Makes testing difficult (shared state)
  • Violates dependency injection principles
  • Hard to mock in tests

Adapter Pattern: Used implicitly in:

  • Dialect Adapters: Convert SQL to engine-specific syntax
  • But not as a formal pattern - more of an abstraction layer

Decorator Pattern: Not needed because:

  • Plugin pattern provides better extensibility
  • Composition is preferred over decoration

Pattern Interactions

Patterns work together to create a cohesive architecture:

  1. Factory + Repository: Factories create repositories, repositories use plugin pattern
  2. Plugin + Strategy: Plugins implement strategies, strategies use executors
  3. Template Method + Plugin: Base classes define workflow, plugins implement details
  4. Builder + Factory: Builders construct complex objects, factories create builders

Benefits of This Architecture

  1. Extensibility: Easy to add new flow types, engines, loaders
  2. Testability: Each component can be tested in isolation
  3. Maintainability: Clear separation of concerns
  4. Flexibility: Can swap implementations without code changes
  5. Scalability: Patterns support growth without architectural changes

Scalability Considerations

Current Strengths

  1. Plugin Architecture: Easy to add new flow types
  2. Engine Abstraction: Support for multiple execution engines
  3. Repository Pattern: Flexible storage backends
  4. Modular Structure: Clear separation of concerns

Future Enhancements

  1. Code Generator Plugins: Support for DBT, Airflow code generation
  2. Additional Engines: PySpark, DuckDB, etc.
  3. Distributed Execution: Support for distributed processing
  4. Caching Layer: Performance optimization

Testing Strategy

Unit Tests

  • Individual components tested in isolation
  • Mock dependencies where appropriate
  • Test edge cases and error handling

Integration Tests

  • End-to-end workflow tests
  • Real engine implementations
  • File system operations

Test Coverage

  • Core transformation logic: 90%+
  • Flow plugins: 85%+
  • Executors: 85%+
  • Engines: 80%+

Performance Considerations

Optimization Opportunities

  1. Batch Processing: Process multiple records together
  2. Lazy Evaluation: Defer computation until needed
  3. Caching: Cache frequently accessed data
  4. Parallel Execution: Execute independent steps in parallel

Current Limitations

  1. Single-threaded Execution: No parallel processing yet
  2. In-memory Processing: Limited by available memory
  3. No Query Optimization: Relies on engine's optimization

Security Considerations

  1. Input Validation: All inputs validated
  2. SQL Injection Prevention: Parameterized queries
  3. File System Access: Validated file paths
  4. Configuration Validation: Schema validation for configs

Error Handling

Error Types

  1. Configuration Errors: Invalid configuration
  2. Validation Errors: Invalid data or schema
  3. Execution Errors: Runtime errors during execution
  4. File System Errors: File access issues

Error Recovery

  1. Transaction Rollback: Automatic rollback on errors
  2. Error Logging: Comprehensive error logging
  3. Graceful Degradation: Continue processing when possible

Future Roadmap

  1. DBT Code Generation: Generate DBT models
  2. Airflow Code Generation: Generate Airflow DAGs
  3. Additional Engines: PySpark, DuckDB support
  4. Distributed Execution: Support for distributed processing
  5. Real-time Processing: Stream processing support
  6. Monitoring & Observability: Built-in monitoring