Worker Resilience
Production-grade error recovery architecture for Qarion's background task processing system.
Architecture Overview
Dead-Letter Queue (DLQ)
When a task exhausts all retries, it is captured in the DeadLetterTask table with:
- Task name and arguments — Full context for debugging
- Error message and traceback — Complete failure details
- Retry count — How many attempts were made
- Created timestamp — When the task was captured
Admins can review failed tasks, understand the failure, and manually re-drive them.
Stalled Task Recovery
The cleanup_stalled_tasks cron job detects tasks stuck in RUNNING state beyond configurable thresholds:
- Default threshold: 30 minutes for standard tasks
- Stalled tasks are marked as failed and captured in the DLQ
- Notifications sent to admins when tasks are detected as stalled
Circuit Breakers
Per-task-type circuit breaker pattern prevents cascading failures:
| State | Behavior |
|---|---|
| Closed | Normal operation — tasks execute |
| Open | Tasks immediately fail-fast to DLQ |
| Half-open | Single probe attempt to test recovery |
Configurable failure thresholds determine when the circuit opens.
Key Files
| File | Purpose |
|---|---|
app/models/dead_letter_task.py | DeadLetterTask model |
app/services/dead_letter_service.py | DLQ management and re-drive |
app/worker/tasks/cleanup_dead_letter.py | DLQ expiry cleanup cron |
app/worker/tasks/cleanup_stalled_tasks.py | Stalled task detection cron |