A cache stampede (also called thundering herd or dogpile) happens when a hot cache key expires and a flood of concurrent requests all miss the cache at the same instant. Every one of them races to the DB to rebuild the entry. Your DB melts.
In simple language — imagine 10,000 people at a closed store waiting for it to open. The doors unlock and they all stampede inside at once. The shelves get cleared, the staff gets overwhelmed, chaos. That’s your DB when a hot key expires.
The problem
Stampede on key expiry
t=0 cache key "homepage" expires
t=0.01 req1 misses → SELECT * FROM ...
t=0.02 req2 misses → SELECT * FROM ...
t=0.03 req3 misses → SELECT * FROM ...
...
t=0.05 req1000 misses → SELECT * FROM ...
DB receives 1000 identical queries in 50ms.
Connection pool exhausted. Latency spikes. App goes down.
The pathological case — the query takes 2 seconds. While the first request is still computing, 100,000 more requests arrive, all of them miss, all of them queue up to do the exact same expensive query. Now your DB has to do it 100,000 times instead of once.
Three solid solutions. Let’s walk through each.
Fix 1: Mutex / Lock
Only one request is allowed to rebuild the cache at a time. Everyone else either waits or serves stale.
def get_homepage():
data = redis.get("homepage")
if data:
return data
# try to acquire a short-lived lock
lock_acquired = redis.set("lock:homepage", "1", nx=True, ex=10)
if lock_acquired:
try:
data = db.query("...") # expensive
redis.set("homepage", data, ex=60)
return data
finally:
redis.delete("lock:homepage")
else:
# someone else is rebuilding — wait briefly or serve stale
time.sleep(0.05)
return redis.get("homepage") or fallback
SET key value NX EX 10 is atomic — only one client gets the lock. The 10-second TTL on the lock is a safety net in case the rebuilder crashes.
Pros: simple, dramatically reduces DB load.
Cons: other clients block or get nothing. If the lock holder crashes, you wait for the TTL to release.
Fix 2: Probabilistic Early Expiry (XFetch)
The clever one. Instead of waiting for the key to actually expire, each reader rolls a dice — the closer the key is to expiry, the more likely the roller decides to refresh early. By the time the key truly expires, it’s usually already been refreshed by some lucky reader.
The XFetch algorithm:
import random, math, time
def get_with_xfetch(key, ttl, beta=1.0):
value, computed_at, delta = redis.hmget(key, "value", "computed_at", "delta")
now = time.time()
expiry = computed_at + ttl
# delta = how long the recompute took last time
# we randomly trigger early refresh near expiry
if now - delta * beta * math.log(random.random()) >= expiry:
# we volunteered to refresh
start = time.time()
value = db.query("...")
new_delta = time.time() - start
redis.hset(key, mapping={
"value": value,
"computed_at": now,
"delta": new_delta,
})
redis.expire(key, ttl)
return value
The math — -delta * beta * log(random()) produces a small random offset. When now + offset >= expiry, the request refreshes. The closer to expiry, the more likely the offset wins the comparison. So refreshes happen just before expiry, spread out over many requests.
Pros: no locks, no blocking, naturally spreads refresh load.
Cons: more complex, you do slightly more refresh work overall.
Fix 3: Request Coalescing (Single-Flight)
If 1000 requests for the same key arrive at the same process at the same time, only let one of them actually hit the DB. The others wait on the in-flight result and share it.
Single-flight coalescing
req1 ─┐
req2 ─┤
req3 ─┼──▶ ONE actual DB call ──▶ shared result
req4 ─┤
req5 ─┘
// Node.js single-flight using a Map of in-flight promises
const inFlight = new Map();
async function getCached(key) {
const cached = await redis.get(key);
if (cached) return cached;
if (inFlight.has(key)) return inFlight.get(key); // share the promise
const promise = (async () => {
try {
const value = await db.query(/* ... */);
await redis.set(key, value, "EX", 60);
return value;
} finally {
inFlight.delete(key);
}
})();
inFlight.set(key, promise);
return promise;
}
Go’s singleflight package does exactly this. It only helps within one process — across many app servers, you’d still need a distributed lock (Fix 1).
Pros: zero-cost, per-process, instant.
Cons: doesn’t help cross-process. Pair with a lock for full coverage.
Bonus fixes
- Stale-while-revalidate — serve the expired value to clients while a background job refreshes it. Users see slightly stale data instead of waiting.
- Pre-warm before expiry — a cron job rebuilds hot keys before they expire. Predictable load.
- Jittered TTLs — if many keys were populated together (after a deploy), they’ll all expire together. Add randomness:
ex=60 + random(0, 30).
Which to pick?
Single hot key, single app server:
→ Single-flight (Fix 3)
Single hot key, many app servers:
→ Distributed lock (Fix 1) — plus single-flight per-process for free
Many hot keys, want minimum complexity:
→ Probabilistic early expiry (Fix 2) or jittered TTLs
Read-heavy site where staleness is fine:
→ Stale-while-revalidate
In real systems you’d often combine — single-flight in-process, distributed lock cross-process, and jittered TTLs as a baseline defense. Stampedes are a “you don’t notice until production” kind of bug. Worth designing for upfront.