Distributed Locks (Redlock)

advanced redis locks distributed-systems concurrency

You have a job that must run on exactly one server at a time. Or you’re decrementing inventory and don’t want two processes selling the last item. You need a distributed lock — a flag in a shared place that says “I’m in, you wait”.

Redis is the obvious tool for this. It’s fast, it’s already in your stack, and the pattern looks deceptively simple. In simple language: we set a key with NX (only if it doesn’t exist) and a TTL. The key being there means “locked”. When done, we delete it.

But this topic is famous for one reason: the Martin Kleppmann critique of Redlock. Knowing this debate signals senior-level understanding. We’ll cover the simple pattern, the Redlock algorithm, and why it’s controversial.

The basic single-instance lock

# Acquire
SET lock:order:123 "owner-uuid" NX EX 30
# OK if acquired, nil if someone else holds it

# Do work...

# Release - DON'T just DEL, see below

Three critical pieces:

  • NX — only set if not exists (atomic check-and-set)
  • EX 30 — auto-expire so a crashed holder doesn’t deadlock the system
  • A unique value (UUID) — so we only delete our own lock

Why the unique value matters

Imagine: you acquire the lock with 30s TTL. Your work takes 35s (GC pause, slow query, whatever). The lock expired at 30s. Someone else acquired it at 31s. At 35s you finish and call DEL lock:order:123 — you just deleted their lock. Now a third client can acquire it. Chaos.

Fix: only delete if the value is still ours. This needs Lua because GET-then-DEL isn’t atomic:

-- KEYS[1] = lock key, ARGV[1] = our UUID
if redis.call("get", KEYS[1]) == ARGV[1] then
  return redis.call("del", KEYS[1])
end
return 0
const token = crypto.randomUUID();
const acquired = await redis.set(`lock:order:${id}`, token, "NX", "EX", 30);
if (!acquired) return false;

try {
  await doCriticalWork();
} finally {
  await redis.eval(safeReleaseScript, 1, `lock:order:${id}`, token);
}

This pattern is good enough for most real-world use cases — leader election, cron deduplication, cache rebuild coordination.

Redlock: the multi-instance algorithm

A single Redis is a single point of failure. If it dies, no one can lock. Replicas don’t fully help because replication is asynchronous — a master can ack a lock then die before replicating it, and a promoted replica won’t know.

Antirez (Redis’s creator) proposed Redlock: acquire the lock on a majority of N independent Redis masters (typically 5).

Redlock acquire (need majority = 3/5)
R1 OK
R2 OK
R3 OK
R4 FAIL
R5 FAIL
3 of 5 = majority → lock granted

Steps:

  1. Get current time T1.
  2. Try SET NX EX on all N masters, with a per-instance timeout (~5-50ms).
  3. Lock acquired only if a majority succeeded AND total time elapsed < TTL.
  4. Effective lock validity = original TTL - elapsed time - clock drift.
  5. If failed, release on all instances (even ones that failed — they might have succeeded silently).

The Kleppmann critique

In 2016, distributed systems researcher Martin Kleppmann published “How to do distributed locking” arguing Redlock is unsafe for any use case requiring correctness. The key arguments:

1. Process pauses break it. A client acquires the lock, gets paused (GC, swapping, VM migration) for longer than the TTL, then resumes and continues work — thinking it still holds the lock. Meanwhile someone else grabbed it. Now two clients act as lock holders.

2. Clock drift breaks it. Redlock assumes bounded clock skew. If one Redis server’s clock jumps forward (NTP correction, VM clock weirdness), its key expires early and the algorithm’s safety property collapses.

3. Network delays break it. Similar to pauses — packets delayed past TTL boundary.

Kleppmann’s prescription: locks for correctness need a fencing token — a monotonically increasing number returned with the lock. Every downstream operation must include the token; the storage system rejects writes with stale tokens. Redis Redlock doesn’t provide this.

His summary: use Redis locks for efficiency (avoiding duplicate work), use something like Zookeeper or etcd for correctness (preventing concurrent access to shared resources).

Antirez pushed back, arguing the assumptions Kleppmann attacked aren’t unique to Redlock and apply to any system without fencing tokens — and that real-world Redlock with sane clocks is fine. Worth reading both posts.

Practical guidance

For a typical interview answer:

  • For most app-level use (cron deduplication, cache rebuild, queue worker coordination), the simple SET NX EX + UUID + Lua-release pattern is fine.
  • For multi-instance HA, Redlock works but understand its assumptions.
  • When correctness is critical (financial transactions, exclusive access to a resource), use fencing tokens at the resource layer or pick a CP system like etcd / Zookeeper.
  • Always set a TTL. Always release with a token check. Always have a plan for what happens when the lock expires mid-work.

Quick recap

  • Single-instance: SET key uuid NX EX ttl + Lua release. Good enough most of the time.
  • Redlock: majority acquire across N masters. Better availability, complex assumptions.
  • Kleppmann’s point: without fencing tokens, no distributed lock is safe against process pauses. Use Redis for efficiency, stronger systems for correctness.
  • TTL + unique token + Lua release. Always.