Skip to main content

Typing Improvements Guide

This document outlines typing improvements that can be made across the Qarion ETL codebase to enhance type safety, IDE support, and code maintainability.

Current State Analysis

Issues Identified

  1. Excessive use of Any - Many functions use Any where more specific types could be used
  2. Missing return type annotations - Some methods lack return type hints
  3. Overuse of Dict[str, Any] - Could use TypedDict for structured data
  4. Missing Protocol types - Interfaces could use Protocol for better type checking
  5. Inconsistent Optional usage - Some fields use None without Optional
  6. Generic Union types - Could be more specific (e.g., Union[List[Dict[str, Any]], Any])

1. Replace Any with Specific Types

Current Issues

# flows/plugins/quality_check/execution.py
def execute_quality_check_suite(
suite: QualityCheckSuite,
engine: Any, # ❌ Should be BaseEngine
datasets: List[Dict[str, Any]],
quality_check_repository: Optional[QualityCheckRepository] = None,
dataset_repository: Optional[Any] = None # ❌ Should be specific type
) -> Dict[str, Any]:
from typing import TYPE_CHECKING
from engines import BaseEngine

if TYPE_CHECKING:
from repository import DatasetRepository
else:
DatasetRepository = 'DatasetRepository'

def execute_quality_check_suite(
suite: QualityCheckSuite,
engine: BaseEngine, # ✅ Specific type
datasets: List[Dict[str, Any]],
quality_check_repository: Optional[QualityCheckRepository] = None,
dataset_repository: Optional[DatasetRepository] = None # ✅ Specific type
) -> Dict[str, Any]:

2. Use TypedDict for Structured Dictionaries

Current Issues

# Many places use Dict[str, Any] for structured data
def generate_datasets(
self,
flow_definition: Dict[str, Any] # ❌ Could be more specific
) -> List[Dict[str, Any]]:
from typing import TypedDict, List, Optional

class FlowDefinition(TypedDict, total=False):
"""Type definition for flow configuration dictionaries."""
flow_type: str
name: str
description: Optional[str]
source: Dict[str, Any]
target: Dict[str, Any]
transformations: List[Dict[str, Any]]
# ... other fields

def generate_datasets(
self,
flow_definition: FlowDefinition # ✅ Type-safe
) -> List[Dict[str, Any]]:

3. Use Protocol for Interfaces

Current Issues

# engines/base.py
def get_dataframe(
self,
statement: str,
params: Optional[tuple] = None
) -> Union[List[Dict[str, Any]], Any]: # ❌ Too generic
from typing import Protocol, runtime_checkable
from typing_extensions import TypeAlias

# Define a protocol for DataFrame-like objects
@runtime_checkable
class DataFrameLike(Protocol):
"""Protocol for objects that behave like DataFrames."""
def to_dict(self, orient: str = 'records') -> List[Dict[str, Any]]: ...
def __len__(self) -> int: ...
def __getitem__(self, key: str) -> Any: ...

# Type alias for return values
DataFrameResult: TypeAlias = Union[List[Dict[str, Any]], DataFrameLike]

def get_dataframe(
self,
statement: str,
params: Optional[tuple] = None
) -> DataFrameResult: # ✅ More specific

4. Add Missing Return Type Annotations

Current Issues

# engines/base.py
def disconnect(self): # ❌ Missing return type
"""Closes the connection, if one exists."""
if self._connection:
logger.info(f"Disconnecting from {self.__class__.__name__}.")
self._connection = None

# loaders/executor.py
def __post_init__(self): # ❌ Missing return type
if self.metadata is None:
self.metadata = {}
def disconnect(self) -> None:  # ✅ Explicit None return
"""Closes the connection, if one exists."""
if self._connection:
logger.info(f"Disconnecting from {self.__class__.__name__}.")
self._connection = None

def __post_init__(self) -> None: # ✅ Explicit None return
if self.metadata is None:
self.metadata = {}

5. Fix Optional Type Annotations

Current Issues

# loaders/executor.py
@dataclass
class LoadExecutionResult:
metadata: Dict[str, Any] = None # ❌ Should be Optional[Dict[str, Any]]
@dataclass
class LoadExecutionResult:
metadata: Optional[Dict[str, Any]] = None # ✅ Proper Optional

6. Use Literal Types for Constants

Current Good Example (Already in use)

# storage/pattern_matching.py
PatternType = Literal["glob", "regexp"] # ✅ Good!

def match_pattern(
filename: str,
pattern: str,
pattern_type: PatternType = "glob"
) -> bool:

Areas to Apply

# Could use Literal for operation types, node types, etc.
from typing import Literal

LoadOperationType = Literal["file_load", "directory_scan", "batch_load"]
NodeType = Literal["ingestion", "transformation", "quality_check", "export"]

7. Use Generic Types for Collections

Current Issues

# Many places
def get_dataframe(...) -> Union[List[Dict[str, Any]], Any]:
from typing import TypeVar, Generic, List, Dict, Any

T = TypeVar('T')

class DataFrame(Generic[T]):
"""Generic DataFrame type."""
def to_dict(self) -> List[Dict[str, T]]: ...

8. Use TYPE_CHECKING for Forward References

Current Good Example (Already in use)

# flows/base.py
from typing import TYPE_CHECKING

if TYPE_CHECKING:
from flows.flow_orchestration import FlowDAG
from transformations import TransformationService
else:
FlowDAG = 'FlowDAG'
TransformationService = 'TransformationService'

Areas to Apply More Broadly

Many modules could benefit from TYPE_CHECKING to avoid circular imports while maintaining type safety.

Priority Improvements

High Priority

  1. Replace Any with specific types in function signatures

    • engine: Anyengine: BaseEngine
    • dataset_repository: Optional[Any]dataset_repository: Optional[DatasetRepository]
  2. Add missing return type annotations

    • All __post_init__ methods
    • All disconnect() and cleanup methods
    • Property getters
  3. Fix Optional annotations in dataclasses

    • metadata: Dict[str, Any] = Nonemetadata: Optional[Dict[str, Any]] = None

Medium Priority

  1. Create TypedDict for common structures

    • FlowDefinition
    • DatasetDefinition
    • QualityCheckConfig
    • TransformationConfig
  2. Use Protocol for interfaces

    • DataFrameLike protocol
    • Repository protocol
    • Connector protocol

Low Priority

  1. Add Literal types for enums

    • Operation types
    • Node types
    • Status values
  2. Use Generic types for collections

    • Generic DataFrame types
    • Generic repository types

Implementation Strategy

Phase 1: Quick Wins (1-2 days)

  • Add missing return type annotations (-> None)
  • Fix Optional annotations in dataclasses
  • Replace obvious Any types with specific classes

Phase 2: Structured Types (3-5 days)

  • Create TypedDict definitions for common structures
  • Update function signatures to use TypedDict
  • Add Protocol definitions for interfaces

Phase 3: Advanced Typing (1 week)

  • Implement Generic types where appropriate
  • Add Literal types for constants
  • Expand TYPE_CHECKING usage

Tools for Validation

Type Checkers

# Install mypy
pip install mypy

# Run type checking
mypy qarion-etl/qarion_etl/

# With strict mode
mypy --strict qarion-etl/qarion_etl/

IDE Support

  • PyCharm: Built-in type checking
  • VS Code: Pylance extension
  • Vim/Neovim: coc-pyright

CI/CD Integration

Add to .gitlab-ci.yml or similar:

type_check:
script:
- pip install mypy
- mypy qarion-etl/qarion_etl/ --ignore-missing-imports

Examples of Improved Code

Before

from typing import Dict, Any, List, Optional

def execute_quality_check(
suite: QualityCheckSuite,
engine: Any,
config: Dict[str, Any]
) -> Dict[str, Any]:
# ...

After

from typing import Dict, Any, List, Optional, TypedDict
from typing_extensions import NotRequired
from engines import BaseEngine

class QualityCheckConfig(TypedDict, total=False):
"""Configuration for quality check execution."""
stop_on_error: NotRequired[bool]
validate_input: NotRequired[bool]
timeout: NotRequired[int]

def execute_quality_check(
suite: QualityCheckSuite,
engine: BaseEngine,
config: QualityCheckConfig
) -> Dict[str, Any]:
# ...

Benefits

  1. Better IDE Support: Autocomplete, type hints, refactoring
  2. Early Error Detection: Catch type errors before runtime
  3. Self-Documenting Code: Types serve as documentation
  4. Refactoring Safety: Type checker catches breaking changes
  5. Team Collaboration: Clearer contracts between modules

Resources