Skip to main content

Data Contract Validation

Overview

Data contracts in Qarion ETL provide a way to define and enforce data quality expectations at the schema and data level. Contracts specify expected column types, constraints, value ranges, and business rules that data must satisfy.

Contracts can be validated automatically after ingestion, ensuring that incoming data conforms to your defined standards before it enters your data pipeline.

What Are Data Contracts?

A data contract is a specification that defines:

  • Schema Requirements: Expected columns and their types
  • Constraints: NOT NULL, UNIQUE, and other constraints
  • Value Ranges: Minimum and maximum values for numeric columns
  • Patterns: Regex patterns for string validation
  • Business Rules: Custom validation logic
  • Validation Mode: How strictly to enforce the contract (strict, lenient, monitor)

Contract Modes

Contracts support three validation modes:

Strict Mode

  • All violations are treated as errors
  • Flow execution stops on contract validation failures
  • Use when data quality is critical and failures should be immediate
[properties.contract]
mode = "strict"

Lenient Mode

  • Warnings for non-critical violations
  • Errors for critical violations (e.g., missing required columns)
  • Flow continues but logs warnings
  • Use when you want to monitor issues but not fail immediately
[properties.contract]
mode = "lenient"

Monitor Mode

  • All violations are logged but don't fail the flow
  • Useful for monitoring data quality trends
  • Use when you want visibility without blocking execution
[properties.contract]
mode = "monitor"

Defining Contracts

Inline Contract Definition

Define contracts directly in dataset properties:

Complete Dataset with Contract:

# datasets/orders.toml
id = "orders"
name = "Orders Dataset"
namespace = "production"
description = "Customer orders with data contract validation"

[columns]
[columns.order_id]
schema_type = "integer"
required = true
primary_key = true

[columns.customer_id]
schema_type = "integer"
required = true

[columns.amount]
schema_type = "float"
required = true

[columns.order_date]
schema_type = "timestamp"
required = true

[columns.status]
schema_type = "string"
required = true

[properties]
table_type = "landing"

# Contract validation configuration
[properties.contract]
id = "orders_contract"
name = "Orders Data Contract"
mode = "strict"
enabled = true
version = "1.0.0"

[[properties.contract.columns]]
name = "order_id"
schema_type = "integer"
required = true
nullable = false
description = "Unique order identifier"

[[properties.contract.columns]]
name = "customer_id"
schema_type = "integer"
required = true
nullable = false
description = "Customer identifier"

[[properties.contract.columns]]
name = "amount"
schema_type = "float"
required = true
nullable = false
min_value = 0
max_value = 1000000
description = "Order amount in dollars"

[[properties.contract.columns]]
name = "order_date"
schema_type = "timestamp"
required = true
nullable = false
description = "Order creation date"

[[properties.contract.columns]]
name = "status"
schema_type = "string"
required = true
nullable = false
enum_values = ["pending", "completed", "cancelled"]
description = "Order status"

Contract ID Reference

Reference a contract by ID (useful for reusable contracts):

Dataset with Contract Reference:

# datasets/orders.toml
id = "orders"
name = "Orders Dataset"
namespace = "production"

[columns]
[columns.order_id]
schema_type = "integer"
required = true
primary_key = true

[columns.customer_id]
schema_type = "integer"
required = true

[columns.amount]
schema_type = "float"
required = true

[properties]
table_type = "landing"
# Reference contract by ID (contract definition stored elsewhere)
contract = "orders_contract"

Note: When using contract ID references, the contract definition should be stored in a contract registry or defined in a shared location. For inline contracts, define them directly in dataset properties as shown above.

Column Constraints

Required and Nullable

[[properties.contract.columns]]
name = "order_id"
required = true # Column must be present
nullable = false # Column cannot be NULL

Value Ranges

[[properties.contract.columns]]
name = "amount"
schema_type = "float"
min_value = 0
max_value = 1000000

String Patterns

[[properties.contract.columns]]
name = "email"
schema_type = "string"
pattern = "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"

Enum Values

[[properties.contract.columns]]
name = "status"
schema_type = "string"
enum_values = ["pending", "completed", "cancelled"]

String Length

[[properties.contract.columns]]
name = "product_code"
schema_type = "string"
min_length = 5
max_length = 20

Constraints

[[properties.contract.columns]]
name = "order_id"
constraints = ["UNIQUE", "NOT NULL"]

Automatic Validation

Contracts are automatically validated after ingestion if:

  1. A contract is configured in the dataset properties
  2. Data is successfully loaded into the dataset
  3. The contract validation feature is enabled

Validation Process:

  1. After successful ingestion, Qarion ETL extracts target tables
  2. For each loaded dataset, it checks for contract configuration
  3. If found, validates the dataset schema against the contract
  4. Logs results and handles failures based on contract mode

Example Flow with Contract Validation:

Flow Definition:

# flows/orders_flow.toml
id = "orders_flow"
name = "Orders Processing Flow"
flow_type = "change_feed"
namespace = "production"

[input]
primary_key = ["order_id"]
columns = [
{name = "order_id", schema_type = "string", required = true},
{name = "customer_id", schema_type = "string", required = true},
{name = "amount", schema_type = "float", required = true},
{name = "order_date", schema_type = "date", required = true},
{name = "status", schema_type = "string", required = true}
]

[properties]
change_detection_columns = ["amount", "status"]

[properties.load]
source_path = "data/orders"
file_pattern = "orders_*.csv"
format = "csv"
loader_config = {
delimiter = ","
header = true
encoding = "utf-8"
}

Landing Dataset with Contract:

# datasets/orders_flow_landing.toml
id = "orders_flow_landing"
name = "Orders Landing Table"
namespace = "production"

[columns]
[columns.order_id]
schema_type = "string"
required = true
primary_key = true

[columns.customer_id]
schema_type = "string"
required = true

[columns.amount]
schema_type = "float"
required = true

[columns.order_date]
schema_type = "date"
required = true

[columns.status]
schema_type = "string"
required = true

[properties]
table_type = "landing"

# Contract validation configuration
[properties.contract]
id = "orders_contract"
name = "Orders Data Contract"
mode = "strict"
enabled = true

[[properties.contract.columns]]
name = "order_id"
schema_type = "string"
required = true
nullable = false
min_length = 1
max_length = 50

[[properties.contract.columns]]
name = "customer_id"
schema_type = "string"
required = true
nullable = false

[[properties.contract.columns]]
name = "amount"
schema_type = "float"
required = true
nullable = false
min_value = 0
max_value = 1000000

[[properties.contract.columns]]
name = "order_date"
schema_type = "date"
required = true
nullable = false

[[properties.contract.columns]]
name = "status"
schema_type = "string"
required = true
nullable = false
enum_values = ["pending", "completed", "cancelled"]

Validation Results

When contract validation runs, you'll see logs like:

INFO: Contract validation passed for dataset orders_flow_landing (contract: orders_contract)

Or if validation fails:

ERROR: Contract validation failed for dataset orders_flow_landing (contract: orders_contract): 2 errors, 0 warnings

In strict mode, the flow will fail with detailed error messages:

ValueError: Contract validation failed for dataset orders_flow_landing (contract: orders_contract):
amount: Value 1500000 exceeds maximum value 1000000,
status: Value 'invalid' is not in allowed enum values ['pending', 'completed', 'cancelled']

Best Practices

  1. Start with Monitor Mode: Use monitor mode initially to understand data quality issues without blocking execution
  2. Define Clear Contracts: Be specific about requirements (ranges, patterns, enums)
  3. Use Strict Mode for Critical Data: Apply strict mode to datasets where data quality is essential
  4. Document Contracts: Add descriptions to contract columns for clarity
  5. Reuse Contracts: Use contract ID references for common contract definitions
  6. Validate Early: Configure contracts on landing tables to catch issues at ingestion