Data Contract Validation
Overview
Data contracts in Qarion ETL provide a way to define and enforce data quality expectations at the schema and data level. Contracts specify expected column types, constraints, value ranges, and business rules that data must satisfy.
Contracts can be validated automatically after ingestion, ensuring that incoming data conforms to your defined standards before it enters your data pipeline.
What Are Data Contracts?
A data contract is a specification that defines:
- Schema Requirements: Expected columns and their types
- Constraints: NOT NULL, UNIQUE, and other constraints
- Value Ranges: Minimum and maximum values for numeric columns
- Patterns: Regex patterns for string validation
- Business Rules: Custom validation logic
- Validation Mode: How strictly to enforce the contract (strict, lenient, monitor)
Contract Modes
Contracts support three validation modes:
Strict Mode
- All violations are treated as errors
- Flow execution stops on contract validation failures
- Use when data quality is critical and failures should be immediate
[properties.contract]
mode = "strict"
Lenient Mode
- Warnings for non-critical violations
- Errors for critical violations (e.g., missing required columns)
- Flow continues but logs warnings
- Use when you want to monitor issues but not fail immediately
[properties.contract]
mode = "lenient"
Monitor Mode
- All violations are logged but don't fail the flow
- Useful for monitoring data quality trends
- Use when you want visibility without blocking execution
[properties.contract]
mode = "monitor"
Defining Contracts
Inline Contract Definition
Define contracts directly in dataset properties:
Complete Dataset with Contract:
# datasets/orders.toml
id = "orders"
name = "Orders Dataset"
namespace = "production"
description = "Customer orders with data contract validation"
[columns]
[columns.order_id]
schema_type = "integer"
required = true
primary_key = true
[columns.customer_id]
schema_type = "integer"
required = true
[columns.amount]
schema_type = "float"
required = true
[columns.order_date]
schema_type = "timestamp"
required = true
[columns.status]
schema_type = "string"
required = true
[properties]
table_type = "landing"
# Contract validation configuration
[properties.contract]
id = "orders_contract"
name = "Orders Data Contract"
mode = "strict"
enabled = true
version = "1.0.0"
[[properties.contract.columns]]
name = "order_id"
schema_type = "integer"
required = true
nullable = false
description = "Unique order identifier"
[[properties.contract.columns]]
name = "customer_id"
schema_type = "integer"
required = true
nullable = false
description = "Customer identifier"
[[properties.contract.columns]]
name = "amount"
schema_type = "float"
required = true
nullable = false
min_value = 0
max_value = 1000000
description = "Order amount in dollars"
[[properties.contract.columns]]
name = "order_date"
schema_type = "timestamp"
required = true
nullable = false
description = "Order creation date"
[[properties.contract.columns]]
name = "status"
schema_type = "string"
required = true
nullable = false
enum_values = ["pending", "completed", "cancelled"]
description = "Order status"
Contract ID Reference
Reference a contract by ID (useful for reusable contracts):
Dataset with Contract Reference:
# datasets/orders.toml
id = "orders"
name = "Orders Dataset"
namespace = "production"
[columns]
[columns.order_id]
schema_type = "integer"
required = true
primary_key = true
[columns.customer_id]
schema_type = "integer"
required = true
[columns.amount]
schema_type = "float"
required = true
[properties]
table_type = "landing"
# Reference contract by ID (contract definition stored elsewhere)
contract = "orders_contract"
Note: When using contract ID references, the contract definition should be stored in a contract registry or defined in a shared location. For inline contracts, define them directly in dataset properties as shown above.
Column Constraints
Required and Nullable
[[properties.contract.columns]]
name = "order_id"
required = true # Column must be present
nullable = false # Column cannot be NULL
Value Ranges
[[properties.contract.columns]]
name = "amount"
schema_type = "float"
min_value = 0
max_value = 1000000
String Patterns
[[properties.contract.columns]]
name = "email"
schema_type = "string"
pattern = "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
Enum Values
[[properties.contract.columns]]
name = "status"
schema_type = "string"
enum_values = ["pending", "completed", "cancelled"]
String Length
[[properties.contract.columns]]
name = "product_code"
schema_type = "string"
min_length = 5
max_length = 20
Constraints
[[properties.contract.columns]]
name = "order_id"
constraints = ["UNIQUE", "NOT NULL"]
Automatic Validation
Contracts are automatically validated after ingestion if:
- A contract is configured in the dataset properties
- Data is successfully loaded into the dataset
- The contract validation feature is enabled
Validation Process:
- After successful ingestion, Qarion ETL extracts target tables
- For each loaded dataset, it checks for contract configuration
- If found, validates the dataset schema against the contract
- Logs results and handles failures based on contract mode
Example Flow with Contract Validation:
Flow Definition:
# flows/orders_flow.toml
id = "orders_flow"
name = "Orders Processing Flow"
flow_type = "change_feed"
namespace = "production"
[input]
primary_key = ["order_id"]
columns = [
{name = "order_id", schema_type = "string", required = true},
{name = "customer_id", schema_type = "string", required = true},
{name = "amount", schema_type = "float", required = true},
{name = "order_date", schema_type = "date", required = true},
{name = "status", schema_type = "string", required = true}
]
[properties]
change_detection_columns = ["amount", "status"]
[properties.load]
source_path = "data/orders"
file_pattern = "orders_*.csv"
format = "csv"
loader_config = {
delimiter = ","
header = true
encoding = "utf-8"
}
Landing Dataset with Contract:
# datasets/orders_flow_landing.toml
id = "orders_flow_landing"
name = "Orders Landing Table"
namespace = "production"
[columns]
[columns.order_id]
schema_type = "string"
required = true
primary_key = true
[columns.customer_id]
schema_type = "string"
required = true
[columns.amount]
schema_type = "float"
required = true
[columns.order_date]
schema_type = "date"
required = true
[columns.status]
schema_type = "string"
required = true
[properties]
table_type = "landing"
# Contract validation configuration
[properties.contract]
id = "orders_contract"
name = "Orders Data Contract"
mode = "strict"
enabled = true
[[properties.contract.columns]]
name = "order_id"
schema_type = "string"
required = true
nullable = false
min_length = 1
max_length = 50
[[properties.contract.columns]]
name = "customer_id"
schema_type = "string"
required = true
nullable = false
[[properties.contract.columns]]
name = "amount"
schema_type = "float"
required = true
nullable = false
min_value = 0
max_value = 1000000
[[properties.contract.columns]]
name = "order_date"
schema_type = "date"
required = true
nullable = false
[[properties.contract.columns]]
name = "status"
schema_type = "string"
required = true
nullable = false
enum_values = ["pending", "completed", "cancelled"]
Validation Results
When contract validation runs, you'll see logs like:
INFO: Contract validation passed for dataset orders_flow_landing (contract: orders_contract)
Or if validation fails:
ERROR: Contract validation failed for dataset orders_flow_landing (contract: orders_contract): 2 errors, 0 warnings
In strict mode, the flow will fail with detailed error messages:
ValueError: Contract validation failed for dataset orders_flow_landing (contract: orders_contract):
amount: Value 1500000 exceeds maximum value 1000000,
status: Value 'invalid' is not in allowed enum values ['pending', 'completed', 'cancelled']
Best Practices
- Start with Monitor Mode: Use monitor mode initially to understand data quality issues without blocking execution
- Define Clear Contracts: Be specific about requirements (ranges, patterns, enums)
- Use Strict Mode for Critical Data: Apply strict mode to datasets where data quality is essential
- Document Contracts: Add descriptions to contract columns for clarity
- Reuse Contracts: Use contract ID references for common contract definitions
- Validate Early: Configure contracts on landing tables to catch issues at ingestion
Related Documentation
- Data Ingestion - How data is loaded into datasets
- Data Quality - Quality check system
- Flows - Flow execution and validation