Skip to main content

Worker Resilience

Production-grade error recovery architecture for Qarion's background task processing system.

Architecture Overview

Dead-Letter Queue (DLQ)

When a task exhausts all retries, it is captured in the DeadLetterTask table with:

  • Task name and arguments — Full context for debugging
  • Error message and traceback — Complete failure details
  • Retry count — How many attempts were made
  • Created timestamp — When the task was captured

Admins can review failed tasks, understand the failure, and manually re-drive them.

Stalled Task Recovery

The cleanup_stalled_tasks cron job detects tasks stuck in RUNNING state beyond configurable thresholds:

  • Default threshold: 30 minutes for standard tasks
  • Stalled tasks are marked as failed and captured in the DLQ
  • Notifications sent to admins when tasks are detected as stalled

Circuit Breakers

Per-task-type circuit breaker pattern prevents cascading failures:

StateBehavior
ClosedNormal operation — tasks execute
OpenTasks immediately fail-fast to DLQ
Half-openSingle probe attempt to test recovery

Configurable failure thresholds determine when the circuit opens.

Key Files

FilePurpose
app/models/dead_letter_task.pyDeadLetterTask model
app/services/dead_letter_service.pyDLQ management and re-drive
app/worker/tasks/cleanup_dead_letter.pyDLQ expiry cleanup cron
app/worker/tasks/cleanup_stalled_tasks.pyStalled task detection cron