A dead letter queue (DLQ) is a holding area for jobs that have exhausted all retry attempts and still failed. Instead of silently dropping these jobs, the system moves them to a separate queue where engineers can inspect them, diagnose the problem, and decide what to do next.
Not every failure is transient. Some jobs fail because the endpoint is temporarily down — and retries fix that. But others fail because the payload is malformed, the target URL was deleted, or the API changed its contract. These jobs will never succeed no matter how many times you retry them.
Without a DLQ, you have two bad options: retry forever (wasting resources and potentially creating cascading failures) or drop the job silently (losing data and never knowing something went wrong). A DLQ gives you a third option: stop retrying, preserve the job, and flag it for human attention.
Look at the job payload, the error message, and the response from the last attempt. Most DLQ systems preserve the full history of attempts so you can see what went wrong.
Is the endpoint down? Fix it. Is the payload malformed? Correct the data. Did the API change? Update your integration. The DLQ tells you what failed — you still need to figure out why.
Once the root cause is fixed, replay the job from the DLQ. This re-enqueues it for processing. Make sure the job is idempotent — if it partially succeeded before failing, replaying it shouldn't create duplicates.
Some DLQ jobs are genuinely stale — the data is outdated, the event no longer matters, or a newer job supersedes it. In these cases, acknowledge and discard.
Error logs tell you something went wrong. A DLQ preserves the actual job so you can do something about it. You can replay a DLQ job. You can't replay a log line.
DLQs also give you metrics: how many jobs are failing permanently? Which endpoints are the worst offenders? Is the DLQ growing or shrinking? These signals help you spot systemic issues before they become outages.
Your app processes payments via a background job. The payment provider's API goes down. The job fails, retries with exponential backoff — 1 second, 2 seconds, 4 seconds, 8 seconds, 16 seconds. After 5 attempts over about 30 seconds, the job lands in the DLQ.
Your monitoring alerts you. You check the DLQ, see 47 payment jobs waiting. The provider comes back online 10 minutes later. You replay all 47 jobs. Every payment processes successfully because they're idempotent (each uses a unique payment intent ID).
A dead letter queue is a special queue where jobs go after they've failed all their retry attempts. It preserves failed jobs so you can investigate, fix the underlying issue, and replay them — instead of losing them forever.
Most queue systems let you move messages from the DLQ back to the main queue for reprocessing. The exact mechanism depends on your system — SQS has a "redrive" feature, Sidekiq has a retry button, and Recuro lets you replay failed jobs from the execution log.
There's no universal answer — it depends on the job's importance and the likely failure mode. Common configurations range from 3 to 10 retries. Combine with exponential backoff so retries are spaced out over minutes or hours, not seconds.
Dead letter queues are part of the job queue system. Jobs are retried with exponential backoff before hitting the DLQ. Replaying DLQ jobs safely requires idempotency, and background jobs that exhaust retries land here automatically.
Recuro handles cron scheduling, retries, alerts, and execution logs -- so you can focus on building your product.
No credit card required