Skip to main content

DQ Config (YAML)

Qarion supports configuration-driven data quality through YAML files. Rather than creating checks one-by-one through the UI or API, you define all your quality rules in a single version-controlled file, then use the CLI or SDK to sync them to the platform and execute them.

This approach brings several benefits:

  • Version control — Quality rules live alongside your data transformation code, so every change is tracked in Git with a full audit trail.
  • Code review — Rule changes go through pull requests, ensuring that thresholds, queries, and schedules are reviewed before deployment.
  • Reproducibility — A new environment can be bootstrapped by running qarion quality apply against the same config file.
  • CI/CD integration — Quality gates can be embedded directly into deployment pipelines.

File Format

A DQ config file is a YAML document with the following top-level keys:

KeyTypeRequiredDescription
versionstringnoConfig schema version (default "1.0")
spacestringyesTarget space slug — all checks will be created in this space
defaultsobjectnoDefault values inherited by every check in the file
checkslistyesOne or more check definitions (minimum 1)

Defaults

The defaults block lets you set values that every check inherits unless explicitly overridden. This avoids repeating the same connector or schedule across dozens of checks.

KeyTypeDescription
connectorstringDefault connector slug for execution
schedulestringDefault cron schedule expression

Check Definition

Each entry in the checks list defines a single quality rule:

KeyTypeRequiredAPI MappingDescription
slugstringyesslugUnique identifier within the space (URL-safe)
namestringyesnameHuman-readable display name
typestringyescheck_typeCheck type (see Supported Types)
descriptionstringnodescriptionExplains the purpose of the check
querystringnoquerySQL query for SQL-based check types
connectorstringnoConnector slug (overrides the default; resolved at apply time)
productstringnoproduct_slugTarget data product slug to link the check to
schedulestringnoschedule_cronCron expression (overrides the default)
thresholdsobjectnothreshold_configPass/fail threshold configuration (see Thresholds)
configurationobjectnoconfigurationType-specific configuration (see per-type docs below)
parameterslistnoparametersParameterized query variable definitions (see Parameters)

Supported Check Types

The type field accepts any of the following values, organized into logical groups.

SQL Checks

These checks execute arbitrary SQL against your data source.

sql_metric

Runs a query that returns a single numeric value, then evaluates it against a threshold. This is the most flexible check type for custom business rules.

- slug: orders-row-count
name: Orders Row Count
type: sql_metric
query: "SELECT COUNT(*) FROM analytics.orders"
product: orders-table
thresholds:
operator: gte
value: 1000

sql_condition

Runs a query and fails if any rows are returned. Useful for asserting "this should never happen" conditions.

- slug: no-negative-amounts
name: No Negative Order Amounts
type: sql_condition
query: "SELECT * FROM analytics.orders WHERE amount < 0"
product: orders-table

Field-Level Checks

Field-level checks target a single column in a table. They require a configuration block with field_name and table_name.

null_check

Fails if any null values exist in the specified column. Result is expressed as a percentage (0% nulls = pass).

- slug: users-email-not-null
name: Users Email Not Null
type: null_check
product: users-table
configuration:
field_name: email
table_name: analytics.users

uniqueness

Fails if duplicate values exist in the column. Measures the count of duplicates (0 = pass).

- slug: users-id-unique
name: Users ID Uniqueness
type: uniqueness
product: users-table
configuration:
field_name: id
table_name: analytics.users

type_check

Validates that values in a column conform to an expected data type or format.

- slug: orders-amount-numeric
name: Orders Amount Type Check
type: type_check
product: orders-table
configuration:
field_name: amount
table_name: analytics.orders
expected_type: numeric

range_check

Validates that numeric values fall within an expected range.

- slug: orders-amount-range
name: Orders Amount Range
type: range_check
product: orders-table
configuration:
field_name: amount
table_name: analytics.orders
min_value: 0
max_value: 1000000

pattern_check

Validates that string values match a regular expression pattern.

- slug: users-email-format
name: Email Format Validation
type: pattern_check
product: users-table
configuration:
field_name: email
table_name: analytics.users
pattern: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"

enum_check

Validates that values belong to an allowed set.

- slug: orders-status-enum
name: Order Status Validation
type: enum_check
product: orders-table
configuration:
field_name: status
table_name: analytics.orders
allowed_values:
- pending
- confirmed
- shipped
- delivered
- cancelled

length_check

Validates that string lengths fall within expected bounds.

- slug: users-name-length
name: User Name Length
type: length_check
product: users-table
configuration:
field_name: display_name
table_name: analytics.users
min_length: 1
max_length: 255

freshness_check

Validates that the most recent timestamp in a column is not older than a threshold.

- slug: orders-freshness
name: Orders Table Freshness
type: freshness_check
product: orders-table
configuration:
field_name: updated_at
table_name: analytics.orders
max_age_hours: 24

Composite Checks

field_checks

Bundles multiple field-level assertions against a single table into one check. The platform executes all assertions in a single table scan for optimal performance.

- slug: users-field-suite
name: Users Field Quality Suite
type: field_checks
product: users-table
configuration:
table_name: analytics.users
checks:
- field: id
assertion: uniqueness
- field: email
assertion: not_null
- field: created_at
assertion: not_null

reconciliation

Compares results from two SQL queries across the same or different data sources. Supports tolerance-based comparison for floating point values.

- slug: revenue-reconciliation
name: Revenue Reconciliation
type: reconciliation
configuration:
source_query: "SELECT SUM(amount) FROM staging.revenue"
target_query: "SELECT SUM(amount) FROM prod.revenue"
comparison_mode: percentage # exact | percentage | absolute
tolerance: 0.01

Other Check Types

TypeDescription
anomalyStatistical anomaly detection on a metric time series
customFully custom check logic (typically used with external execution)
manualHuman-entered value — prompts for manual input when triggered

Thresholds

The thresholds object defines the pass/fail criteria for checks that produce a numeric value (primarily sql_metric). If omitted, the platform uses type-specific default evaluation logic.

KeyTypeDescription
operatorstringComparison operator
valuenumberThreshold value
warnnumberOptional warning threshold (produces warning instead of fail)

Supported Operators

OperatorMeaningExample
eqEqual toValue must be exactly 100
gteGreater than or equalValue must be ≥ 1000
lteLess than or equalValue must be ≤ 5
gtGreater thanValue must be > 0
ltLess thanValue must be < 100
betweenWithin rangeRequires min and max instead of value

Example with Warning Threshold

thresholds:
operator: gte
value: 1000 # fail below this
warn: 5000 # warn below this but above 1000

In this example, a value of 800 produces a fail, 3000 produces a warning, and 6000 produces a pass.


Parameters

Parameterised queries let you define variables that are substituted at execution time. This is useful for date-partitioned checks or environment-specific values.

- slug: daily-row-count
name: Daily Row Count
type: sql_metric
query: "SELECT COUNT(*) FROM analytics.events WHERE event_date = '{{run_date}}'"
thresholds:
operator: gte
value: 10000
parameters:
- name: run_date
type: string
default: "2024-01-15"
description: "Target date partition"

Variables use double-brace syntax ({{variable_name}}) in the query and are resolved from the parameters list at runtime. You can override parameter values when triggering a run through the SDK or CLI.


Examples

Minimal Config

The simplest possible config file defines a single check:

version: "1.0"
space: acme-analytics

checks:
- slug: orders-exist
name: Orders Table Has Rows
type: sql_metric
query: "SELECT COUNT(*) FROM analytics.orders"
thresholds:
operator: gte
value: 1

A production-grade config file with defaults, multiple check types, and shared connector:

version: "1.0"
space: acme-analytics

defaults:
connector: warehouse-snowflake
schedule: "0 6 * * *"

checks:
# SQL metric with thresholds
- slug: orders-row-count
name: Orders Row Count
type: sql_metric
description: "Ensure orders table is populated"
query: "SELECT COUNT(*) FROM analytics.orders"
product: orders-table
schedule: "0 8 * * *" # overrides the default 6 AM
thresholds:
operator: gte
value: 1000
warn: 500

# Field-level null check
- slug: users-no-null-emails
name: Users Email Null Check
type: null_check
product: users-table
configuration:
field_name: email
table_name: analytics.users

# Condition check — fails if any rows returned
- slug: no-orphaned-orders
name: No Orphaned Orders
type: sql_condition
query: >
SELECT o.id
FROM analytics.orders o
LEFT JOIN analytics.customers c ON o.customer_id = c.id
WHERE c.id IS NULL

# Cross-source reconciliation
- slug: revenue-reconciliation
name: Revenue Reconciliation
type: reconciliation
configuration:
source_query: "SELECT SUM(amount) FROM staging.revenue"
target_query: "SELECT SUM(amount) FROM prod.revenue"
comparison_mode: percentage
tolerance: 0.01

Multi-Product Config

You can define checks across multiple products in the same file, as long as they share the same space:

version: "1.0"
space: acme-analytics

defaults:
connector: warehouse-snowflake

checks:
- slug: customers-freshness
name: Customers Freshness
type: freshness_check
product: customers-table
schedule: "0 7 * * *"
configuration:
field_name: updated_at
table_name: analytics.customers
max_age_hours: 24

- slug: orders-freshness
name: Orders Freshness
type: freshness_check
product: orders-table
schedule: "0 8 * * *"
configuration:
field_name: created_at
table_name: analytics.orders
max_age_hours: 12

- slug: events-uniqueness
name: Events ID Uniqueness
type: uniqueness
product: events-stream
configuration:
field_name: event_id
table_name: analytics.events

Workflow

A typical workflow involves three steps: validate, apply, and run.

1. Validate

Check the file for structural errors and verify that referenced connectors and products exist on the platform:

qarion quality validate -f qarion-dq.yaml

This command parses the YAML, validates all field types, and checks that the target space, connectors, and products are resolvable. It does not create or modify anything.

2. Apply

Sync definitions to the platform — creates missing checks and updates existing ones:

qarion quality apply -f qarion-dq.yaml

The apply command is idempotent. Running it multiple times with the same config produces no changes. It matches checks by slug within the target space.

3. Run

Execute all checks defined in the config and record results:

qarion quality run-config -f qarion-dq.yaml

Use --no-record to execute checks without persisting results to the platform (useful for local testing):

qarion quality run-config -f qarion-dq.yaml --no-record

SDK Usage

The same workflow is available programmatically through the Python SDK:

from qarion import QarionSyncClient
from qarion.models.dq_config import DqConfig

# Parse the YAML file
config = DqConfig.from_yaml("qarion-dq.yaml")

client = QarionSyncClient(api_key="qk_...")

# Step 1: Validate
errors = client.quality.validate_config(config)
if errors:
for err in errors:
print(f"Error: {err}")

# Step 2: Apply (upsert)
summary = client.quality.apply_config(config)
print(summary) # {"created": [...], "updated": [...], "unchanged": [...]}

# Step 3: Run
results = client.quality.run_config(config)
for r in results:
print(f"{r.status}: {r.value}")

Use record_results=False to skip recording:

results = client.quality.run_config(config, record_results=False)

CI/CD Integration

GitHub Actions

Add a quality gate to your deployment pipeline that validates and runs checks after each push:

name: Data Quality Gate

on:
push:
branches: [main]
paths:
- "qarion-dq.yaml"
- "dbt/**"

jobs:
quality:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: "3.11"

- run: pip install qarion-cli

- name: Validate config
run: qarion quality validate -f qarion-dq.yaml
env:
QARION_API_KEY: ${{ secrets.QARION_API_KEY }}

- name: Apply check definitions
run: qarion quality apply -f qarion-dq.yaml
env:
QARION_API_KEY: ${{ secrets.QARION_API_KEY }}

- name: Run quality checks
run: qarion quality run-config -f qarion-dq.yaml
env:
QARION_API_KEY: ${{ secrets.QARION_API_KEY }}

Airflow / Orchestrator

Trigger a config-driven quality gate as a task in your pipeline:

from airflow.operators.python import PythonOperator

def run_quality_gate():
from qarion import QarionSyncClient
from qarion.models.dq_config import DqConfig

config = DqConfig.from_yaml("/opt/airflow/dags/qarion-dq.yaml")
client = QarionSyncClient(api_key="qk_...")

results = client.quality.run_config(config)
failed = [r for r in results if not r.is_passed]
if failed:
raise Exception(f"{len(failed)} quality check(s) failed")

quality_gate = PythonOperator(
task_id="quality_gate",
python_callable=run_quality_gate,
)