Quick Summary — TL;DR
A retry policy defines how your system handles failed jobs: how many times to retry, how long to wait between attempts, and which failures are worth retrying. Without a retry policy, a single transient failure — a network blip, a brief API outage — becomes a permanent data loss.
| Setting | What it controls | Typical value |
|---|---|---|
| Max attempts | How many times to retry before giving up | 3 – 5 |
| Backoff strategy | How the delay between retries grows | Exponential backoff |
| Base delay | The initial wait before the first retry | 15 – 60 seconds |
| Max delay | The ceiling on how long any single retry waits | 5 – 30 minutes |
| Retryable errors | Which status codes or error types trigger a retry | 5xx, 408, 429, connection errors |
Retry every N seconds regardless of attempt number. Simple but dangerous — it can overload a recovering server with evenly spaced retries.
Double the delay between each attempt (e.g., 15s, 30s, 60s, 120s). This is the standard for most background job systems. See exponential backoff for details.
Add randomness to the exponential delay to prevent synchronized retries across many jobs — a pattern known as the thundering herd problem. This is the best general-purpose strategy.
Not all failures are transient. Retrying a 400 Bad Request wastes resources because the request will always fail. A good retry policy distinguishes between:
When a job exhausts all retry attempts, it should go to a dead letter queue — not disappear. The DLQ preserves the job so you can investigate, fix the root cause, and replay it. Dropping failed jobs silently is the single biggest source of data loss in background processing.
Some systems let you set retry policies at the queue level (all jobs in this queue share the same policy) or per-job (each job specifies its own retry behavior). Queue-level policies are simpler; per-job policies give finer control.
Recuro uses per-queue retry configuration: set the retry count and the queue handles backoff automatically for every job in that queue.
A retry policy is a set of rules that defines how many times a failed job should be retried, how long to wait between attempts, and which types of failures are worth retrying.
3 to 5 retries is standard for HTTP jobs. Critical operations (payments, data sync) may warrant more. With exponential backoff, 5 retries with a 30-second base delay span about 16 minutes — enough for most transient outages.
Generally no. 4xx errors mean the request itself is wrong — bad payload, missing auth, wrong URL. Retrying won't fix the problem. The exceptions are 408 (timeout) and 429 (rate limit), which are transient and retryable.
A retry policy works with exponential backoff and jitter to space out attempts and dead letter queues to catch permanent failures. Jobs should be idempotent to survive retries safely — most queue systems provide at-least-once delivery, meaning retried jobs may execute more than once. When an endpoint fails consistently, a circuit breaker can stop retries entirely until the service recovers.
Recuro handles cron scheduling, retries, alerts, and execution logs -- so you can focus on building your product.
No credit card required