Data Quality Engine
The Data Quality (DQ) Engine is the automated heart of the platform, responsible for executing checks, detecting anomalies, and generating alerts.
Architecture
1. Provider-Based Execution
The engine supports multiple execution backends via a Provider Architecture.
- SQL Provider: Executes SQL queries directly against a target data warehouse (Snowflake, BigQuery, Postgres).
- dbt Provider: Integrates with dbt tests to import results.
- Python Provider: Runs custom Python logic for complex validations.
2. Execution Model
Checks can be executed:
- Scheduled: Via cron expressions managed by the internal scheduler.
- Ad-hoc: Triggered manually by users.
- Event-driven: Triggered via API or Webhook.
3. Asynchronous Processing
Check execution is always asynchronous.
- Request: API receives a request to run a check.
- Queue: A task is pushed to the Redis queue (Arq).
- Worker: A worker picks up the task, instantiates the appropriate provider, and executes the logic.
- Result: Results are written to the database and analyzed for failure conditions.
Alerting Integration
When a check fails, the engine integrates with the Unified Annotation System.
- Alert Generation: A
DQAlertis created. - Notification: The Notification Service is triggered to dispatch alerts to configured channels (Slack, Email).
- Remediation: Users can annotate the alert, assign it to a user, or link it to a Jira ticket.
Consistency & Isolation
Standard #111-A: Strict Scoping
All execution logic is strictly scoped to the Space and Dataset to prevent cross-tenant data leakage. The Lazy='raise' standard ensures that we do not accidentally fetch unrelated data during execution.