Qarion ETL Architecture
Overview
Qarion ETL is built on a plugin-based, engine-agnostic architecture that separates concerns into distinct layers:
- Configuration Layer: Flow and dataset definitions
- Orchestration Layer: DAG generation and execution planning
- Transformation Layer: Engine-agnostic transformation instructions
- Execution Layer: Engine-specific executors
- Storage Layer: Database and file system engines
Core Principles
1. Plugin Architecture
All flow-specific logic is encapsulated in plugins. This allows:
- Easy addition of new flow types
- Isolation of flow-specific concerns
- Clear extension points
2. Engine Agnosticism
Transformations are described using engine-agnostic instructions that are translated to engine-specific code by executors.
3. Repository Pattern
Data access is abstracted through repositories, supporting multiple storage backends.
4. Separation of Concerns
- What (transformations) vs How (execution)
- Configuration vs Implementation
- Flow Logic vs Engine Logic
Component Architecture
Flow System
Flow Definition (TOML)
↓
FlowPlugin (flow-specific logic)
↓
FlowDAG (execution plan)
↓
Code Generation / Execution
Key Components:
FlowPlugin: Abstract interface for flow typesFlowDAG: Execution graph with nodes and edgesDAGNode: Represents a single step (ingestion or transformation)
Transformation System
TransformationInstruction (engine-agnostic)
↓
TransformationExecutor (engine-specific)
↓
Engine (SQLite, Pandas, etc.)
Key Components:
TransformationInstruction: Describes what to transformTransformationRule: Individual transformation rulesTransformationExecutor: Executes instructions on specific engines
Engine System
BaseEngine (abstract)
├── SQLiteEngine
├── PandasInMemoryEngine
└── PandasLocalStorageEngine
BaseDialectAdapter (abstract)
├── SQLiteDialectAdapter
└── PandasDialectAdapter
Key Components:
BaseEngine: Abstract engine interfaceBaseDialectAdapter: Converts commands to engine-specific syntaxEngineLoader: Factory for creating engines from configuration
File Loading System
FileLoaderFactory
└── Loader Plugins (loaders/plugins/)
├── CSVLoaderPlugin (contains CSVFileLoader)
├── JSONLoaderPlugin (contains JSONFileLoader)
└── ParquetLoaderPlugin (contains ParquetFileLoader)
Key Components:
BaseFileLoader: Abstract loader interfaceLoaderPlugin: Plugin interface for loadersFileLoaderFactory: Factory for creating loaders (uses plugin system)- Format-specific loaders implemented as plugins (loader + plugin in same file)
File Export System
FileExporterFactory
└── Exporter Plugins (exporters/plugins/)
├── CSVExporterPlugin (contains CSVFileExporter)
├── JSONExporterPlugin (contains JSONFileExporter)
└── ParquetExporterPlugin (contains ParquetFileExporter)
Key Components:
BaseFileExporter: Abstract exporter interfaceExporterPlugin: Plugin interface for exportersFileExporterFactory: Factory for creating exporters (uses plugin system)ExportExecutor: Orchestrates export operations- Format-specific exporters implemented as plugins (exporter + plugin in same file)
Credential Store System
CredentialStoreFactory
└── Credential Store Implementations
├── DatabaseCredentialStore (stores in database table)
├── LocalKeystoreCredentialStore (stores in encrypted file)
└── AWSSSMCredentialStore (uses AWS SSM Parameter Store for secure credential storage)
Key Components:
CredentialStore: Abstract base class for credential storesCredentialDefinition: Metadata about credentialsCredentialStoreFactory: Factory for creating credential storesCredentialStoreRegistry: Registry for credential store types- Store implementations with encryption support
Data Flow
Typical Execution Flow
- Flow Definition → Loaded from TOML
- Dataset Generation → Flow plugin generates datasets
- DAG Generation → Flow plugin generates execution DAG
- Code Generation → Generate SQL/Python/DBT/Airflow code
- Execution → Execute transformations using appropriate executor
Transformation Execution
- Flow Plugin → Generates
TransformationInstruction - TransformationExecutor → Translates instruction to engine-specific code
- Engine → Executes the code
- Result → Returns execution statistics
Extension Points
Adding a New Flow Type
- Create plugin class inheriting
FlowPlugin - Implement required methods:
generate_datasets()generate_dag()generate_transformation_instructions()
- Register plugin with
register_flow_plugin()
Adding a New Engine
- Create engine class inheriting
BaseEngine - Implement required methods:
connect()execute_query()get_dataframe()
- Create dialect adapter inheriting
BaseDialectAdapter - Register with
EngineLoader
Adding a New Executor
- Create executor class inheriting
BaseTransformationExecutor - Implement
execute_transformation_instruction() - Update
TransformationServiceto use new executor
Adding a New File Loader
- Create loader class inheriting
BaseFileLoader - Implement
load()method - Register with
FileLoaderFactory
Adding a New File Exporter
- Create exporter class inheriting
BaseFileExporter - Implement
export()method - Create exporter plugin class inheriting
ExporterPlugin - Register with
ExporterPluginRegistry - See Export System for detailed guide
Adding a New Credential Store
- Create store class inheriting
CredentialStore - Implement required methods:
store_credential()get_credential()delete_credential()list_credentials()credential_exists()
- Register with
CredentialStoreRegistry - See Credential Store Architecture for detailed guide
Design Patterns
Qarion ETL uses several design patterns to achieve flexibility, maintainability, and extensibility. Each pattern addresses specific architectural challenges.
Strategy Pattern
Purpose: Encapsulate different transformation approaches and execution strategies.
Implementation:
- Transformation Strategies: Different transformation approaches (landing→change_feed, change_feed→delta_transaction)
- Execution Strategies: Different execution engines (SQL, Pandas, PySpark)
Why This Pattern?
- Allows switching between different algorithms at runtime
- Makes it easy to add new transformation types without modifying existing code
- Separates the algorithm (strategy) from the context (transformation service)
Example: LandingToChangeFeedStrategy and ChangeFeedToDeltaTransactionStrategy can be swapped without changing the transformation service.
Factory Pattern
Purpose: Centralize object creation logic and hide implementation details.
Implementation:
- EngineLoader: Creates engines from configuration
- FileLoaderFactory: Creates loaders based on file type
- Repository Factories: Creates repositories based on storage type
Why This Pattern?
- Decouples object creation from usage
- Makes it easy to swap implementations (local vs database storage)
- Centralizes creation logic, making it easier to maintain
- Supports configuration-driven creation
Example: create_dataset_repository(storage_type='local') vs create_dataset_repository(storage_type='database') returns different implementations without the caller knowing the details.
Repository Pattern
Purpose: Abstract data access and support multiple storage backends.
Implementation:
- HistoryRepository: Schema history storage
- DatasetRepository: Dataset definition storage
- FlowRepository: Flow definition storage
Why This Pattern?
- Separates business logic from data access
- Makes it easy to switch storage backends (file-based vs database)
- Enables testing with in-memory implementations
- Provides a consistent interface regardless of storage implementation
Example: Business logic uses repository.get_flow_by_id() without knowing if it reads from files or database.
Plugin Pattern
Purpose: Enable extensibility without modifying core code.
Implementation:
- FlowPlugin: Extensible flow type system
- FileLoader: Extensible file format support
- EnginePlugin: Extensible engine support
- CodeGeneratorPlugin: Extensible code generation
Why This Pattern?
- Open/Closed Principle: Open for extension, closed for modification
- Separation of Concerns: Each plugin encapsulates its own logic
- Testability: Plugins can be tested independently
- Maintainability: Changes to one plugin don't affect others
Example: Adding a new flow type (e.g., sessionization) doesn't require modifying core flow orchestration code.
Template Method Pattern
Purpose: Define the skeleton of an algorithm, letting subclasses fill in details.
Implementation:
- BaseCodeGenerator: Defines code generation workflow, subclasses implement format-specific logic
- BaseTransformationExecutor: Defines execution workflow, subclasses implement engine-specific logic
Why This Pattern?
- Ensures consistent workflow across implementations
- Reduces code duplication
- Makes it easy to add new formats/engines
Example: All code generators follow the same workflow (iterate nodes, generate code, write files), but SQL, DBT, and Airflow generators implement format-specific details.
Observer Pattern
Purpose: Notify multiple objects about state changes.
Implementation:
- Plugin registry notifies about plugin registration
- Event system for flow execution (future enhancement)
Why This Pattern?
- Decouples event producers from consumers
- Enables multiple listeners for the same event
- Supports event-driven architectures
Builder Pattern
Purpose: Construct complex objects step by step.
Implementation:
- FlowDAG: Built incrementally with nodes and edges
- TransformationInstruction: Built with rules and operations
Why This Pattern?
- Makes complex object construction more readable
- Allows step-by-step construction
- Supports validation at each step
Example: Building a FlowDAG by adding nodes and edges incrementally, with validation at each step.
Pattern Selection Rationale
Why Not Other Patterns?
Singleton Pattern: Not used because:
- Makes testing difficult (shared state)
- Violates dependency injection principles
- Hard to mock in tests
Adapter Pattern: Used implicitly in:
- Dialect Adapters: Convert SQL to engine-specific syntax
- But not as a formal pattern - more of an abstraction layer
Decorator Pattern: Not needed because:
- Plugin pattern provides better extensibility
- Composition is preferred over decoration
Pattern Interactions
Patterns work together to create a cohesive architecture:
- Factory + Repository: Factories create repositories, repositories use plugin pattern
- Plugin + Strategy: Plugins implement strategies, strategies use executors
- Template Method + Plugin: Base classes define workflow, plugins implement details
- Builder + Factory: Builders construct complex objects, factories create builders
Benefits of This Architecture
- Extensibility: Easy to add new flow types, engines, loaders
- Testability: Each component can be tested in isolation
- Maintainability: Clear separation of concerns
- Flexibility: Can swap implementations without code changes
- Scalability: Patterns support growth without architectural changes
Scalability Considerations
Current Strengths
- Plugin Architecture: Easy to add new flow types
- Engine Abstraction: Support for multiple execution engines
- Repository Pattern: Flexible storage backends
- Modular Structure: Clear separation of concerns
Future Enhancements
- Code Generator Plugins: Support for DBT, Airflow code generation
- Additional Engines: PySpark, DuckDB, etc.
- Distributed Execution: Support for distributed processing
- Caching Layer: Performance optimization
Testing Strategy
Unit Tests
- Individual components tested in isolation
- Mock dependencies where appropriate
- Test edge cases and error handling
Integration Tests
- End-to-end workflow tests
- Real engine implementations
- File system operations
Test Coverage
- Core transformation logic: 90%+
- Flow plugins: 85%+
- Executors: 85%+
- Engines: 80%+
Performance Considerations
Optimization Opportunities
- Batch Processing: Process multiple records together
- Lazy Evaluation: Defer computation until needed
- Caching: Cache frequently accessed data
- Parallel Execution: Execute independent steps in parallel
Current Limitations
- Single-threaded Execution: No parallel processing yet
- In-memory Processing: Limited by available memory
- No Query Optimization: Relies on engine's optimization
Security Considerations
- Input Validation: All inputs validated
- SQL Injection Prevention: Parameterized queries
- File System Access: Validated file paths
- Configuration Validation: Schema validation for configs
Error Handling
Error Types
- Configuration Errors: Invalid configuration
- Validation Errors: Invalid data or schema
- Execution Errors: Runtime errors during execution
- File System Errors: File access issues
Error Recovery
- Transaction Rollback: Automatic rollback on errors
- Error Logging: Comprehensive error logging
- Graceful Degradation: Continue processing when possible
Future Roadmap
- DBT Code Generation: Generate DBT models
- Airflow Code Generation: Generate Airflow DAGs
- Additional Engines: PySpark, DuckDB support
- Distributed Execution: Support for distributed processing
- Real-time Processing: Stream processing support
- Monitoring & Observability: Built-in monitoring