Skip to main content

Data Quality Engine

The Data Quality (DQ) Engine is the automated heart of the platform, responsible for executing checks, detecting anomalies, and generating alerts.

Architecture

1. Provider-Based Execution

The engine supports multiple execution backends via a Provider Architecture.

  • SQL Provider: Executes SQL queries directly against a target data warehouse (Snowflake, BigQuery, Postgres).
  • dbt Provider: Integrates with dbt tests to import results.
  • Python Provider: Runs custom Python logic for complex validations.

2. Execution Model

Checks can be executed:

  • Scheduled: Via cron expressions managed by the internal scheduler.
  • Ad-hoc: Triggered manually by users.
  • Event-driven: Triggered via API or Webhook.

3. Asynchronous Processing

Check execution is always asynchronous.

  1. Request: API receives a request to run a check.
  2. Queue: A task is pushed to the Redis queue (Arq).
  3. Worker: A worker picks up the task, instantiates the appropriate provider, and executes the logic.
  4. Result: Results are written to the database and analyzed for failure conditions.

Alerting Integration

When a check fails, the engine integrates with the Unified Annotation System.

  • Alert Generation: A DQAlert is created.
  • Notification: The Notification Service is triggered to dispatch alerts to configured channels (Slack, Email).
  • Remediation: Users can annotate the alert, assign it to a user, or link it to a Jira ticket.

Consistency & Isolation

Standard #111-A: Strict Scoping

All execution logic is strictly scoped to the Space and Dataset to prevent cross-tenant data leakage. The Lazy='raise' standard ensures that we do not accidentally fetch unrelated data during execution.