Lineage and SQL Conversion Test Suite
Overview
Comprehensive test suite for SQL parsing, lineage extraction, SQL-to-instruction conversion, and execution on non-SQL engines (pandas).
Test Files
1. test_lineage_extraction.py (253 lines)
Purpose: Test lineage extraction from SQL queries
Test Classes:
TestSQLParsing: SQL parsing with various query typesTestLineageExtraction: Table and column-level lineage extractionTestTaskLineageExtraction: Lineage extraction from tasksTestFlowLineage: Flow-level lineage extraction
Coverage:
- Simple SELECT queries
- INSERT INTO ... SELECT statements
- JOIN queries
- Aggregation queries
- Complex queries with subqueries
- Dataset mapping
- Task-level lineage
- Flow-level lineage
2. test_sql_to_instruction_conversion.py (476 lines)
Purpose: Test SQL to transformation instruction conversion
Test Classes:
TestSQLToInstructionConversion: Basic conversion testsTestPandasExecution: Execution on pandas engineTestEndToEndSQLToPandas: Complete workflow tests
Coverage:
- Simple SELECT conversion
- INSERT INTO ... SELECT conversion
- WHERE clause extraction
- Aggregation handling
- Calculation extraction
- Primary key detection
- Numerical column detection
- Pandas execution
- End-to-end workflows
3. test_lineage_pandas_integration.py (381 lines)
Purpose: Integration tests for lineage with pandas execution
Test Classes:
TestSQLToPandasWorkflow: SQL -> Pandas execution workflowsTestLineageIntegration: Lineage integration with dependenciesTestComplexSQLPatterns: Complex SQL pattern handling
Coverage:
- Simple transformations on pandas
- Calculations on pandas
- Filtering on pandas
- Dependency inference from lineage
- CTE parsing
- Subquery parsing
- Multiple JOINs
- UNION queries
4. test_sql_pandas_end_to_end.py (342 lines)
Purpose: End-to-end tests for complete workflows
Test Classes:
TestSQLToPandasEndToEnd: Complete SQL -> Pandas workflowsTestLineageExtractionComprehensive: Comprehensive lineage testsTestSQLParserComprehensive: Comprehensive SQL parsing tests
Coverage:
- Simple SELECT with filtering
- Calculations
- Column selection and projection
- Multiple transformations in sequence
- Various SQL query types
5. test_lineage_comprehensive.py (342 lines)
Purpose: Comprehensive integration tests
Test Classes:
TestCompleteWorkflow: Complete workflow testsTestSQLExpressionEvaluation: SQL expression evaluation in pandasTestInstructionStructure: Instruction structure validation
Coverage:
- Complete SQL -> Lineage -> Instruction -> Pandas workflows
- Multiple transformations in sequence
- Expression evaluation
- Instruction serialization
Running Tests
Prerequisites
# Install required dependencies
pip install sqlglot pandas pytest
# Optional: For enhanced lineage analysis
pip install sqllineage
Run All Tests
# Run all lineage and SQL tests
pytest tests/test_lineage*.py tests/test_sql*.py -v
# Run with coverage
pytest tests/test_lineage*.py tests/test_sql*.py --cov=qarion-etl.sql --cov=qarion-etl.lineage -v
Run Specific Test Files
# Lineage extraction tests
pytest tests/test_lineage_extraction.py -v
# SQL to instruction conversion
pytest tests/test_sql_to_instruction_conversion.py -v
# Pandas integration
pytest tests/test_lineage_pandas_integration.py -v
# End-to-end tests
pytest tests/test_sql_pandas_end_to_end.py -v
# Comprehensive tests
pytest tests/test_lineage_comprehensive.py -v
Run Specific Test Classes
pytest tests/test_lineage_extraction.py::TestSQLParsing -v
pytest tests/test_sql_to_instruction_conversion.py::TestPandasExecution -v
Run Specific Tests
pytest tests/test_sql_pandas_end_to_end.py::TestSQLToPandasEndToEnd::test_simple_select_filter -v
Test Structure
Fixtures
sample_datasets: Sample dataset definitionspandas_engine: Pandas in-memory enginepandas_executor: Pandas transformation executorcomplete_setup: Complete setup (engine, executor, converter, extractor)
Test Patterns
@pytest.mark.skipif(not SQLGLOT_AVAILABLE, reason="SQLGlot not available")
@pytest.mark.skipif(not PANDAS_AVAILABLE, reason="Pandas not available")
def test_example(pandas_setup):
"""Test description."""
engine, executor = pandas_setup
# 1. Setup data
# 2. Convert SQL to instruction
# 3. Execute on pandas
# 4. Verify results
Test Coverage
SQL Parsing ✅
- Simple SELECT
- INSERT INTO ... SELECT
- JOIN queries
- Aggregation (GROUP BY)
- WHERE clauses
- CTEs (Common Table Expressions)
- Subqueries
- UNION queries
- Multiple JOINs
Lineage Extraction ✅
- Table-level lineage
- Column-level lineage
- Dataset mapping
- Task-level lineage
- Flow-level lineage
- Source/target identification
- Join table detection
SQL to Instruction Conversion ✅
- Simple SELECT
- INSERT statements
- WHERE condition extraction
- Column selection
- Calculations
- Aggregations
- Primary key detection
- Numerical column detection
Pandas Execution ✅
- Simple transformations
- Filtered transformations
- Column selection
- Calculations (multiplication, addition)
- Expression evaluation
- Multiple transformations in sequence
Integration ✅
- SQL -> Lineage -> Instruction -> Pandas
- Dependency inference
- End-to-end workflows
- Instruction serialization
Example Test Scenarios
Scenario 1: Simple Filtering
# SQL
sql = "SELECT id, name, amount FROM orders WHERE status = 'active'"
# Convert to instruction
instruction = converter.convert(sql, 'orders', 'staging_orders', batch_id=1)
# Execute on pandas
result = executor.execute_transformation_instruction(instruction)
# Verify: Only active orders in target
Scenario 2: Calculations
# SQL with calculations
sql = """
SELECT
id,
amount * 1.1 as adjusted_amount,
amount + tax as total_amount
FROM orders
"""
# Convert and execute
instruction = converter.convert(sql, 'orders', 'staging_orders', batch_id=1)
result = executor.execute_transformation_instruction(instruction)
# Verify: Calculations applied correctly
Scenario 3: Lineage Extraction
# Extract lineage
extractor = LineageExtractor()
lineage = extractor.extract_from_sql(sql, datasets)
# Verify: Source and target datasets identified
assert "orders" in lineage.source_datasets
assert lineage.target_dataset == "staging_orders"
Expected Test Results
When all dependencies are installed:
tests/test_lineage_extraction.py::TestSQLParsing::test_parse_simple_select PASSED
tests/test_lineage_extraction.py::TestLineageExtraction::test_extract_table_lineage_simple PASSED
tests/test_sql_to_instruction_conversion.py::TestSQLToInstructionConversion::test_convert_simple_select PASSED
tests/test_sql_pandas_end_to_end.py::TestSQLToPandasEndToEnd::test_simple_select_filter PASSED
tests/test_lineage_comprehensive.py::TestCompleteWorkflow::test_complete_sql_to_pandas_workflow PASSED
Troubleshooting
Tests Skipped
If tests are skipped:
- Check dependencies:
pip install sqlglot pandas pytest - Verify imports: Check for import errors in test output
- Check fixtures: Ensure fixtures are properly defined
Import Errors
If you see import errors:
- Ensure you're in the project root directory
- Check that
qarion-etlis in Python path - Verify all dependencies are installed
Pandas Execution Errors
If pandas tests fail:
- Ensure pandas is installed:
pip install pandas - Check that
PandasInMemoryEngineis available - Verify engine initialization in fixtures
SQL Parsing Errors
If SQL parsing fails:
- Ensure SQLGlot is installed:
pip install sqlglot - Check SQL syntax (SQLGlot may not support all dialects)
- Verify SQL query is valid
Continuous Integration
These tests are designed to run in CI/CD pipelines:
# Example GitHub Actions
- name: Install dependencies
run: |
pip install sqlglot pandas pytest
- name: Run lineage tests
run: |
pytest tests/test_lineage*.py tests/test_sql*.py -v
Performance Considerations
- SQL parsing is fast (SQLGlot is optimized)
- Lineage extraction adds minimal overhead
- Pandas execution is in-memory (suitable for test data)
- Large datasets may require optimization
Future Enhancements
- Add tests for PySpark execution
- Add tests for complex aggregations
- Add tests for window functions
- Add performance benchmarks
- Add tests for cross-flow lineage
- Add tests for lineage visualization