Recuro.

Dead Letter Queue

A dead letter queue (DLQ) is a holding area for jobs that have exhausted all retry attempts and still failed. Instead of silently dropping these jobs, the system moves them to a separate queue where engineers can inspect them, diagnose the problem, and decide what to do next.

Why dead letter queues exist

Not every failure is transient. Some jobs fail because the endpoint is temporarily down — and retries fix that. But others fail because the payload is malformed, the target URL was deleted, or the API changed its contract. These jobs will never succeed no matter how many times you retry them.

Without a DLQ, you have two bad options: retry forever (wasting resources and potentially creating cascading failures) or drop the job silently (losing data and never knowing something went wrong). A DLQ gives you a third option: stop retrying, preserve the job, and flag it for human attention.

How jobs end up in a DLQ

What to do with DLQ jobs

Inspect

Look at the job payload, the error message, and the response from the last attempt. Most DLQ systems preserve the full history of attempts so you can see what went wrong.

Fix the root cause

Is the endpoint down? Fix it. Is the payload malformed? Correct the data. Did the API change? Update your integration. The DLQ tells you what failed — you still need to figure out why.

Replay

Once the root cause is fixed, replay the job from the DLQ. This re-enqueues it for processing. Make sure the job is idempotent — if it partially succeeded before failing, replaying it shouldn't create duplicates.

Discard

Some DLQ jobs are genuinely stale — the data is outdated, the event no longer matters, or a newer job supersedes it. In these cases, acknowledge and discard.

DLQ vs just logging errors

Error logs tell you something went wrong. A DLQ preserves the actual job so you can do something about it. You can replay a DLQ job. You can't replay a log line.

DLQs also give you metrics: how many jobs are failing permanently? Which endpoints are the worst offenders? Is the DLQ growing or shrinking? These signals help you spot systemic issues before they become outages.

Real example

Your app processes payments via a background job. The payment provider's API goes down. The job fails, retries with exponential backoff — 1 second, 2 seconds, 4 seconds, 8 seconds, 16 seconds. After 5 attempts over about 30 seconds, the job lands in the DLQ.

Your monitoring alerts you. You check the DLQ, see 47 payment jobs waiting. The provider comes back online 10 minutes later. You replay all 47 jobs. Every payment processes successfully because they're idempotent (each uses a unique payment intent ID).

FAQ

What is a dead letter queue?

A dead letter queue is a special queue where jobs go after they've failed all their retry attempts. It preserves failed jobs so you can investigate, fix the underlying issue, and replay them — instead of losing them forever.

How do I replay dead letter queue messages?

Most queue systems let you move messages from the DLQ back to the main queue for reprocessing. The exact mechanism depends on your system — SQS has a "redrive" feature, Sidekiq has a retry button, and Recuro lets you replay failed jobs from the execution log.

How many retries before DLQ?

There's no universal answer — it depends on the job's importance and the likely failure mode. Common configurations range from 3 to 10 retries. Combine with exponential backoff so retries are spaced out over minutes or hours, not seconds.

Dead letter queues are part of the job queue system. Jobs are retried with exponential backoff before hitting the DLQ. Replaying DLQ jobs safely requires idempotency, and background jobs that exhaust retries land here automatically.

Stop managing infrastructure. Start scheduling jobs.

Recuro handles cron scheduling, retries, alerts, and execution logs -- so you can focus on building your product.

No credit card required