Recuro.

Thundering Herd

Quick Summary — TL;DR

  • The thundering herd problem occurs when many clients simultaneously retry or request the same resource, overwhelming it right when it is trying to recover.
  • Common triggers: cache expiry, service recovery, coordinated retries without jitter, and DNS failover events.
  • Prevent it with jitter on retries, request coalescing, cache locking, staggered expiry times, and circuit breakers.

The thundering herd problem occurs when a large number of clients simultaneously send requests to the same resource at the same moment — typically right after a failure, cache expiry, or service restart. The sudden flood of traffic overwhelms the resource before it has a chance to recover, often causing the very outage it was recovering from to start again.

How it happens

The pattern is always the same: something triggers many clients to act in unison. Here are the most common triggers.

Coordinated retries

A service goes down for 5 seconds. During that window, 1,000 clients accumulate failed requests. When the service comes back, all 1,000 retry at once. Without jitter, exponential backoff alone does not help — all clients computed the same delay and retry in synchronized waves. The recovering service gets hit with 10x its normal traffic before it has fully stabilized.

Cache expiry (cache stampede)

A popular cache entry expires. The next 500 requests all miss the cache simultaneously and hit the database to regenerate the value. The database, which normally handles 5 requests per second for this query (because the cache absorbs the rest), suddenly faces 500 concurrent queries. This variant is so common it has its own name: cache stampede.

Service restart

A service restarts after a deployment. Health checks pass, the load balancer starts routing traffic, and the full production load hits a cold service with empty caches and unwarmed connection pools. The service buckles under load and fails its health checks, triggering another restart — a crash loop.

DNS or failover events

A DNS change or failover event redirects all traffic from one server to another. The target server receives the combined load of both servers instantly. If it was already running at 60% capacity, the sudden doubling pushes it past its limits.

Why it is dangerous

The thundering herd is self-reinforcing. The flood of traffic causes failures, which cause retries, which cause more traffic, which cause more failures. Without intervention, this feedback loop can keep a system down for far longer than the original trigger.

It is also deceptive. Each individual client is behaving correctly — retrying a failed request is the right thing to do. The problem is coordination: they all do the right thing at the same time, and the aggregate effect is destructive.

Prevention strategies

Add jitter to retries

Jitter adds randomness to retry delays, breaking synchronization. Instead of 1,000 clients retrying at second 4, they retry at random times between second 0 and second 4. This turns a spike into a smooth ramp. Full jitter is the simplest and most effective approach.

Request coalescing

When multiple clients request the same resource simultaneously, only one request actually executes. The rest wait for the result and share it. This is particularly effective for cache stampedes — instead of 500 database queries, you execute 1 and distribute the result to all 500 waiters.

Cache locking

When a cache entry expires, the first client to request it acquires a lock and regenerates the value. Other clients either wait for the lock to release (and then read the fresh cache) or receive a slightly stale value. This limits the thundering herd to exactly one cache-miss query.

Staggered expiry times

Instead of setting every cache entry to expire at exactly 60 seconds, add a random offset: 55 to 65 seconds. This prevents mass simultaneous expiry and spreads cache regeneration over time.

Circuit breakers

A circuit breaker detects when a service is failing and stops sending requests entirely. This prevents the retry flood from reaching the recovering service, giving it time to stabilize before traffic is gradually reintroduced via the half-open state.

Gradual traffic ramp-up

After a service restart or failover, route traffic gradually — 10%, then 25%, then 50%, then 100% — instead of sending the full load immediately. This lets the service warm up its caches and connection pools before facing full production traffic.

Real-world example

Your application sends webhook notifications via background jobs. The webhook receiver goes down for 2 minutes. During that window, 3,000 jobs fail and enter retry queues. The receiver comes back online. Without jitter, all 3,000 jobs retry within the same second. The receiver — which normally handles 50 requests per second — is hit with 3,000 requests instantly. It crashes again. Jobs fail again. The cycle repeats.

With full jitter and a circuit breaker: the circuit breaker trips after detecting sustained failures, stopping all retries for 60 seconds. When it moves to half-open, 3 test requests go through and succeed. The breaker closes. Now 3,000 jobs retry with jittered delays spread across several minutes. The receiver handles the load comfortably.

FAQ

What is the thundering herd problem?

The thundering herd problem is when many clients simultaneously send requests to the same resource — typically after a failure, cache expiry, or service restart — overwhelming it with a sudden traffic spike. The name comes from a stampede: individually harmless, collectively destructive.

What is a cache stampede?

A cache stampede is a specific form of thundering herd that happens when a popular cache entry expires and many clients simultaneously query the backend to regenerate it. Prevent it with cache locking (only one client regenerates), request coalescing (share the result), or staggered expiry times.

How does jitter prevent thundering herd?

Jitter adds randomness to retry delays so clients do not all retry at the same instant. Instead of a synchronized spike, retries are spread across the delay window. This converts a stampede into a manageable, distributed trickle of requests.

The thundering herd is prevented primarily by adding jitter to exponential backoff delays, ensuring retries are spread over time rather than synchronized. Circuit breakers provide a second layer of defense by halting retries during sustained outages. A well-designed retry policy combines all three — backoff, jitter, and circuit breaking — to make thundering herds nearly impossible.

Stop managing infrastructure. Start scheduling jobs.

Recuro handles cron scheduling, retries, alerts, and execution logs -- so you can focus on building your product.

No credit card required