Retry Policy

Quick Summary — TL;DR

A retry policy defines how many times to retry a failed job, how long to wait between attempts, and which errors are worth retrying.
Use exponential backoff with jitter as the default strategy — it prevents thundering herds and gives failing services time to recover.
Only retry transient errors (5xx, 408, 429); permanent errors (400, 401, 404) should fail immediately. Exhausted retries go to a dead letter queue.

A retry policy defines how your system handles failed jobs: how many times to retry, how long to wait between attempts, and which failures are worth retrying. Without a retry policy, a single transient failure — a network blip, a brief API outage — becomes a permanent data loss.

Components of a retry policy

Setting	What it controls	Typical value
Max attempts	How many times to retry before giving up	3 – 5
Backoff strategy	How the delay between retries grows	Exponential backoff
Base delay	The initial wait before the first retry	15 – 60 seconds
Max delay	The ceiling on how long any single retry waits	5 – 30 minutes
Retryable errors	Which status codes or error types trigger a retry	5xx, 408, 429, connection errors

Backoff strategies

Fixed delay

Retry every N seconds regardless of attempt number. Simple but dangerous — it can overload a recovering server with evenly spaced retries.

Exponential backoff

Double the delay between each attempt (e.g., 15s, 30s, 60s, 120s). This is the standard for most background job systems. See exponential backoff for details.

Exponential backoff with jitter

Add randomness to the exponential delay to prevent synchronized retries across many jobs — a pattern known as the thundering herd problem. This is the best general-purpose strategy.

Which errors to retry

Not all failures are transient. Retrying a 400 Bad Request wastes resources because the request will always fail. A good retry policy distinguishes between:

Transient (retry) — 500, 502, 503, 504, 408, 429, connection refused, DNS failure, timeouts
Permanent (don't retry) — 400, 401, 403, 404, 422 — fix the request instead

What happens after max retries

When a job exhausts all retry attempts, it should go to a dead letter queue — not disappear. The DLQ preserves the job so you can investigate, fix the root cause, and replay it. Dropping failed jobs silently is the single biggest source of data loss in background processing.

Per-queue vs per-job policies

Some systems let you set retry policies at the queue level (all jobs in this queue share the same policy) or per-job (each job specifies its own retry behavior). Queue-level policies are simpler; per-job policies give finer control.

Recuro uses per-queue retry configuration: set the retry count and the queue handles backoff automatically for every job in that queue.

FAQ

What is a retry policy?

A retry policy is a set of rules that defines how many times a failed job should be retried, how long to wait between attempts, and which types of failures are worth retrying.

How many retries should I configure?

3 to 5 retries is standard for HTTP jobs. Critical operations (payments, data sync) may warrant more. With exponential backoff, 5 retries with a 30-second base delay span about 16 minutes — enough for most transient outages.

Should I retry 4xx errors?

Generally no. 4xx errors mean the request itself is wrong — bad payload, missing auth, wrong URL. Retrying won't fix the problem. The exceptions are 408 (timeout) and 429 (rate limit), which are transient and retryable.

A retry policy works with exponential backoff and jitter to space out attempts and dead letter queues to catch permanent failures. Jobs should be idempotent to survive retries safely — most queue systems provide at-least-once delivery, meaning retried jobs may execute more than once. When an endpoint fails consistently, a circuit breaker can stop retries entirely until the service recovers.