Skip to main content

Lineage and SQL Conversion Test Suite

Overview

Comprehensive test suite for SQL parsing, lineage extraction, SQL-to-instruction conversion, and execution on non-SQL engines (pandas).

Test Files

1. test_lineage_extraction.py (253 lines)

Purpose: Test lineage extraction from SQL queries

Test Classes:

  • TestSQLParsing: SQL parsing with various query types
  • TestLineageExtraction: Table and column-level lineage extraction
  • TestTaskLineageExtraction: Lineage extraction from tasks
  • TestFlowLineage: Flow-level lineage extraction

Coverage:

  • Simple SELECT queries
  • INSERT INTO ... SELECT statements
  • JOIN queries
  • Aggregation queries
  • Complex queries with subqueries
  • Dataset mapping
  • Task-level lineage
  • Flow-level lineage

2. test_sql_to_instruction_conversion.py (476 lines)

Purpose: Test SQL to transformation instruction conversion

Test Classes:

  • TestSQLToInstructionConversion: Basic conversion tests
  • TestPandasExecution: Execution on pandas engine
  • TestEndToEndSQLToPandas: Complete workflow tests

Coverage:

  • Simple SELECT conversion
  • INSERT INTO ... SELECT conversion
  • WHERE clause extraction
  • Aggregation handling
  • Calculation extraction
  • Primary key detection
  • Numerical column detection
  • Pandas execution
  • End-to-end workflows

3. test_lineage_pandas_integration.py (381 lines)

Purpose: Integration tests for lineage with pandas execution

Test Classes:

  • TestSQLToPandasWorkflow: SQL -> Pandas execution workflows
  • TestLineageIntegration: Lineage integration with dependencies
  • TestComplexSQLPatterns: Complex SQL pattern handling

Coverage:

  • Simple transformations on pandas
  • Calculations on pandas
  • Filtering on pandas
  • Dependency inference from lineage
  • CTE parsing
  • Subquery parsing
  • Multiple JOINs
  • UNION queries

4. test_sql_pandas_end_to_end.py (342 lines)

Purpose: End-to-end tests for complete workflows

Test Classes:

  • TestSQLToPandasEndToEnd: Complete SQL -> Pandas workflows
  • TestLineageExtractionComprehensive: Comprehensive lineage tests
  • TestSQLParserComprehensive: Comprehensive SQL parsing tests

Coverage:

  • Simple SELECT with filtering
  • Calculations
  • Column selection and projection
  • Multiple transformations in sequence
  • Various SQL query types

5. test_lineage_comprehensive.py (342 lines)

Purpose: Comprehensive integration tests

Test Classes:

  • TestCompleteWorkflow: Complete workflow tests
  • TestSQLExpressionEvaluation: SQL expression evaluation in pandas
  • TestInstructionStructure: Instruction structure validation

Coverage:

  • Complete SQL -> Lineage -> Instruction -> Pandas workflows
  • Multiple transformations in sequence
  • Expression evaluation
  • Instruction serialization

Running Tests

Prerequisites

# Install required dependencies
pip install sqlglot pandas pytest

# Optional: For enhanced lineage analysis
pip install sqllineage

Run All Tests

# Run all lineage and SQL tests
pytest tests/test_lineage*.py tests/test_sql*.py -v

# Run with coverage
pytest tests/test_lineage*.py tests/test_sql*.py --cov=qarion-etl.sql --cov=qarion-etl.lineage -v

Run Specific Test Files

# Lineage extraction tests
pytest tests/test_lineage_extraction.py -v

# SQL to instruction conversion
pytest tests/test_sql_to_instruction_conversion.py -v

# Pandas integration
pytest tests/test_lineage_pandas_integration.py -v

# End-to-end tests
pytest tests/test_sql_pandas_end_to_end.py -v

# Comprehensive tests
pytest tests/test_lineage_comprehensive.py -v

Run Specific Test Classes

pytest tests/test_lineage_extraction.py::TestSQLParsing -v
pytest tests/test_sql_to_instruction_conversion.py::TestPandasExecution -v

Run Specific Tests

pytest tests/test_sql_pandas_end_to_end.py::TestSQLToPandasEndToEnd::test_simple_select_filter -v

Test Structure

Fixtures

  • sample_datasets: Sample dataset definitions
  • pandas_engine: Pandas in-memory engine
  • pandas_executor: Pandas transformation executor
  • complete_setup: Complete setup (engine, executor, converter, extractor)

Test Patterns

@pytest.mark.skipif(not SQLGLOT_AVAILABLE, reason="SQLGlot not available")
@pytest.mark.skipif(not PANDAS_AVAILABLE, reason="Pandas not available")
def test_example(pandas_setup):
"""Test description."""
engine, executor = pandas_setup

# 1. Setup data
# 2. Convert SQL to instruction
# 3. Execute on pandas
# 4. Verify results

Test Coverage

SQL Parsing ✅

  • Simple SELECT
  • INSERT INTO ... SELECT
  • JOIN queries
  • Aggregation (GROUP BY)
  • WHERE clauses
  • CTEs (Common Table Expressions)
  • Subqueries
  • UNION queries
  • Multiple JOINs

Lineage Extraction ✅

  • Table-level lineage
  • Column-level lineage
  • Dataset mapping
  • Task-level lineage
  • Flow-level lineage
  • Source/target identification
  • Join table detection

SQL to Instruction Conversion ✅

  • Simple SELECT
  • INSERT statements
  • WHERE condition extraction
  • Column selection
  • Calculations
  • Aggregations
  • Primary key detection
  • Numerical column detection

Pandas Execution ✅

  • Simple transformations
  • Filtered transformations
  • Column selection
  • Calculations (multiplication, addition)
  • Expression evaluation
  • Multiple transformations in sequence

Integration ✅

  • SQL -> Lineage -> Instruction -> Pandas
  • Dependency inference
  • End-to-end workflows
  • Instruction serialization

Example Test Scenarios

Scenario 1: Simple Filtering

# SQL
sql = "SELECT id, name, amount FROM orders WHERE status = 'active'"

# Convert to instruction
instruction = converter.convert(sql, 'orders', 'staging_orders', batch_id=1)

# Execute on pandas
result = executor.execute_transformation_instruction(instruction)

# Verify: Only active orders in target

Scenario 2: Calculations

# SQL with calculations
sql = """
SELECT
id,
amount * 1.1 as adjusted_amount,
amount + tax as total_amount
FROM orders
"""

# Convert and execute
instruction = converter.convert(sql, 'orders', 'staging_orders', batch_id=1)
result = executor.execute_transformation_instruction(instruction)

# Verify: Calculations applied correctly

Scenario 3: Lineage Extraction

# Extract lineage
extractor = LineageExtractor()
lineage = extractor.extract_from_sql(sql, datasets)

# Verify: Source and target datasets identified
assert "orders" in lineage.source_datasets
assert lineage.target_dataset == "staging_orders"

Expected Test Results

When all dependencies are installed:

tests/test_lineage_extraction.py::TestSQLParsing::test_parse_simple_select PASSED
tests/test_lineage_extraction.py::TestLineageExtraction::test_extract_table_lineage_simple PASSED
tests/test_sql_to_instruction_conversion.py::TestSQLToInstructionConversion::test_convert_simple_select PASSED
tests/test_sql_pandas_end_to_end.py::TestSQLToPandasEndToEnd::test_simple_select_filter PASSED
tests/test_lineage_comprehensive.py::TestCompleteWorkflow::test_complete_sql_to_pandas_workflow PASSED

Troubleshooting

Tests Skipped

If tests are skipped:

  1. Check dependencies: pip install sqlglot pandas pytest
  2. Verify imports: Check for import errors in test output
  3. Check fixtures: Ensure fixtures are properly defined

Import Errors

If you see import errors:

  1. Ensure you're in the project root directory
  2. Check that qarion-etl is in Python path
  3. Verify all dependencies are installed

Pandas Execution Errors

If pandas tests fail:

  1. Ensure pandas is installed: pip install pandas
  2. Check that PandasInMemoryEngine is available
  3. Verify engine initialization in fixtures

SQL Parsing Errors

If SQL parsing fails:

  1. Ensure SQLGlot is installed: pip install sqlglot
  2. Check SQL syntax (SQLGlot may not support all dialects)
  3. Verify SQL query is valid

Continuous Integration

These tests are designed to run in CI/CD pipelines:

# Example GitHub Actions
- name: Install dependencies
run: |
pip install sqlglot pandas pytest

- name: Run lineage tests
run: |
pytest tests/test_lineage*.py tests/test_sql*.py -v

Performance Considerations

  • SQL parsing is fast (SQLGlot is optimized)
  • Lineage extraction adds minimal overhead
  • Pandas execution is in-memory (suitable for test data)
  • Large datasets may require optimization

Future Enhancements

  • Add tests for PySpark execution
  • Add tests for complex aggregations
  • Add tests for window functions
  • Add performance benchmarks
  • Add tests for cross-flow lineage
  • Add tests for lineage visualization