Typing Improvements Guide
This document outlines typing improvements that can be made across the Qarion ETL codebase to enhance type safety, IDE support, and code maintainability.
Current State Analysis
Issues Identified
- Excessive use of
Any- Many functions useAnywhere more specific types could be used - Missing return type annotations - Some methods lack return type hints
- Overuse of
Dict[str, Any]- Could useTypedDictfor structured data - Missing Protocol types - Interfaces could use
Protocolfor better type checking - Inconsistent Optional usage - Some fields use
NonewithoutOptional - Generic Union types - Could be more specific (e.g.,
Union[List[Dict[str, Any]], Any])
Recommended Improvements
1. Replace Any with Specific Types
Current Issues
# flows/plugins/quality_check/execution.py
def execute_quality_check_suite(
suite: QualityCheckSuite,
engine: Any, # ❌ Should be BaseEngine
datasets: List[Dict[str, Any]],
quality_check_repository: Optional[QualityCheckRepository] = None,
dataset_repository: Optional[Any] = None # ❌ Should be specific type
) -> Dict[str, Any]:
Recommended Fix
from typing import TYPE_CHECKING
from engines import BaseEngine
if TYPE_CHECKING:
from repository import DatasetRepository
else:
DatasetRepository = 'DatasetRepository'
def execute_quality_check_suite(
suite: QualityCheckSuite,
engine: BaseEngine, # ✅ Specific type
datasets: List[Dict[str, Any]],
quality_check_repository: Optional[QualityCheckRepository] = None,
dataset_repository: Optional[DatasetRepository] = None # ✅ Specific type
) -> Dict[str, Any]:
2. Use TypedDict for Structured Dictionaries
Current Issues
# Many places use Dict[str, Any] for structured data
def generate_datasets(
self,
flow_definition: Dict[str, Any] # ❌ Could be more specific
) -> List[Dict[str, Any]]:
Recommended Fix
from typing import TypedDict, List, Optional
class FlowDefinition(TypedDict, total=False):
"""Type definition for flow configuration dictionaries."""
flow_type: str
name: str
description: Optional[str]
source: Dict[str, Any]
target: Dict[str, Any]
transformations: List[Dict[str, Any]]
# ... other fields
def generate_datasets(
self,
flow_definition: FlowDefinition # ✅ Type-safe
) -> List[Dict[str, Any]]:
3. Use Protocol for Interfaces
Current Issues
# engines/base.py
def get_dataframe(
self,
statement: str,
params: Optional[tuple] = None
) -> Union[List[Dict[str, Any]], Any]: # ❌ Too generic
Recommended Fix
from typing import Protocol, runtime_checkable
from typing_extensions import TypeAlias
# Define a protocol for DataFrame-like objects
@runtime_checkable
class DataFrameLike(Protocol):
"""Protocol for objects that behave like DataFrames."""
def to_dict(self, orient: str = 'records') -> List[Dict[str, Any]]: ...
def __len__(self) -> int: ...
def __getitem__(self, key: str) -> Any: ...
# Type alias for return values
DataFrameResult: TypeAlias = Union[List[Dict[str, Any]], DataFrameLike]
def get_dataframe(
self,
statement: str,
params: Optional[tuple] = None
) -> DataFrameResult: # ✅ More specific
4. Add Missing Return Type Annotations
Current Issues
# engines/base.py
def disconnect(self): # ❌ Missing return type
"""Closes the connection, if one exists."""
if self._connection:
logger.info(f"Disconnecting from {self.__class__.__name__}.")
self._connection = None
# loaders/executor.py
def __post_init__(self): # ❌ Missing return type
if self.metadata is None:
self.metadata = {}
Recommended Fix
def disconnect(self) -> None: # ✅ Explicit None return
"""Closes the connection, if one exists."""
if self._connection:
logger.info(f"Disconnecting from {self.__class__.__name__}.")
self._connection = None
def __post_init__(self) -> None: # ✅ Explicit None return
if self.metadata is None:
self.metadata = {}
5. Fix Optional Type Annotations
Current Issues
# loaders/executor.py
@dataclass
class LoadExecutionResult:
metadata: Dict[str, Any] = None # ❌ Should be Optional[Dict[str, Any]]
Recommended Fix
@dataclass
class LoadExecutionResult:
metadata: Optional[Dict[str, Any]] = None # ✅ Proper Optional
6. Use Literal Types for Constants
Current Good Example (Already in use)
# storage/pattern_matching.py
PatternType = Literal["glob", "regexp"] # ✅ Good!
def match_pattern(
filename: str,
pattern: str,
pattern_type: PatternType = "glob"
) -> bool:
Areas to Apply
# Could use Literal for operation types, node types, etc.
from typing import Literal
LoadOperationType = Literal["file_load", "directory_scan", "batch_load"]
NodeType = Literal["ingestion", "transformation", "quality_check", "export"]
7. Use Generic Types for Collections
Current Issues
# Many places
def get_dataframe(...) -> Union[List[Dict[str, Any]], Any]:
Recommended Fix
from typing import TypeVar, Generic, List, Dict, Any
T = TypeVar('T')
class DataFrame(Generic[T]):
"""Generic DataFrame type."""
def to_dict(self) -> List[Dict[str, T]]: ...
8. Use TYPE_CHECKING for Forward References
Current Good Example (Already in use)
# flows/base.py
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from flows.flow_orchestration import FlowDAG
from transformations import TransformationService
else:
FlowDAG = 'FlowDAG'
TransformationService = 'TransformationService'
Areas to Apply More Broadly
Many modules could benefit from TYPE_CHECKING to avoid circular imports while maintaining type safety.
Priority Improvements
High Priority
-
Replace
Anywith specific types in function signaturesengine: Any→engine: BaseEnginedataset_repository: Optional[Any]→dataset_repository: Optional[DatasetRepository]
-
Add missing return type annotations
- All
__post_init__methods - All
disconnect()and cleanup methods - Property getters
- All
-
Fix Optional annotations in dataclasses
metadata: Dict[str, Any] = None→metadata: Optional[Dict[str, Any]] = None
Medium Priority
-
Create TypedDict for common structures
FlowDefinitionDatasetDefinitionQualityCheckConfigTransformationConfig
-
Use Protocol for interfaces
DataFrameLikeprotocolRepositoryprotocolConnectorprotocol
Low Priority
-
Add Literal types for enums
- Operation types
- Node types
- Status values
-
Use Generic types for collections
- Generic DataFrame types
- Generic repository types
Implementation Strategy
Phase 1: Quick Wins (1-2 days)
- Add missing return type annotations (
-> None) - Fix Optional annotations in dataclasses
- Replace obvious
Anytypes with specific classes
Phase 2: Structured Types (3-5 days)
- Create TypedDict definitions for common structures
- Update function signatures to use TypedDict
- Add Protocol definitions for interfaces
Phase 3: Advanced Typing (1 week)
- Implement Generic types where appropriate
- Add Literal types for constants
- Expand TYPE_CHECKING usage
Tools for Validation
Type Checkers
# Install mypy
pip install mypy
# Run type checking
mypy qarion-etl/qarion_etl/
# With strict mode
mypy --strict qarion-etl/qarion_etl/
IDE Support
- PyCharm: Built-in type checking
- VS Code: Pylance extension
- Vim/Neovim: coc-pyright
CI/CD Integration
Add to .gitlab-ci.yml or similar:
type_check:
script:
- pip install mypy
- mypy qarion-etl/qarion_etl/ --ignore-missing-imports
Examples of Improved Code
Before
from typing import Dict, Any, List, Optional
def execute_quality_check(
suite: QualityCheckSuite,
engine: Any,
config: Dict[str, Any]
) -> Dict[str, Any]:
# ...
After
from typing import Dict, Any, List, Optional, TypedDict
from typing_extensions import NotRequired
from engines import BaseEngine
class QualityCheckConfig(TypedDict, total=False):
"""Configuration for quality check execution."""
stop_on_error: NotRequired[bool]
validate_input: NotRequired[bool]
timeout: NotRequired[int]
def execute_quality_check(
suite: QualityCheckSuite,
engine: BaseEngine,
config: QualityCheckConfig
) -> Dict[str, Any]:
# ...
Benefits
- Better IDE Support: Autocomplete, type hints, refactoring
- Early Error Detection: Catch type errors before runtime
- Self-Documenting Code: Types serve as documentation
- Refactoring Safety: Type checker catches breaking changes
- Team Collaboration: Clearer contracts between modules