Distributed Consensus

advanced 7+ YOE distributed-systems consensus Raft leader-election ZooKeeper etcd

Here’s a problem that sounds simple but is incredibly hard: how do a bunch of servers agree on something? Which node is the leader? Is this transaction committed? What’s the current configuration? When networks are unreliable and servers can crash at any moment, getting everyone on the same page is one of the hardest problems in computer science.

That’s what distributed consensus solves.

The Problem

Imagine we have 5 database replicas. A write comes in. We need all (or most) of them to agree that the write happened. Sounds easy — just tell all 5, right?

But what if:

  • Node 3 is temporarily unreachable (network partition)
  • Node 5 crashed and restarted with stale data
  • Two nodes both think they’re the leader
  • Messages arrive out of order

Without a consensus protocol, we’d end up with nodes disagreeing about the state of the world. One node thinks the value is “A,” another thinks it’s “B.” That’s data corruption.

Leader Election

Most consensus systems work by electing a leader. One node is in charge. It receives all writes, decides the order of operations, and tells followers what to do. If the leader dies, a new one is elected.

Why not just let any node accept writes? Because coordinating writes across multiple nodes simultaneously is much harder than funneling everything through one node that makes decisions.

The tricky part: electing a leader that everyone agrees on, especially when the network is flaky.

Raft Consensus Algorithm

Raft was designed to be understandable (unlike Paxos, the OG consensus algorithm that’s famously confusing). Most modern systems use Raft or something based on it.

The Three States

Every node in a Raft cluster is in one of three states:

Raft: Leader Election Flow
Follower
Passive. Listens to leader.
Candidate
Requesting votes.
Leader
In charge. Sends heartbeats.
Follower —(timeout, no heartbeat)→ Candidate
Candidate —(gets majority votes)→ Leader
Candidate —(loses or times out)→ Follower
Leader —(discovers higher term)→ Follower

How Election Works (Simplified)

  1. All nodes start as followers
  2. Followers expect regular heartbeats from the leader
  3. If a follower doesn’t hear from the leader for a while (election timeout), it assumes the leader is dead
  4. It becomes a candidate and starts a new term (election round)
  5. It votes for itself and asks other nodes for their vote
  6. If it gets votes from a majority (quorum), it becomes the new leader
  7. The new leader starts sending heartbeats, and everyone falls in line

The election timeout is randomised (e.g., 150-300ms) so that not all followers become candidates at the same time. This avoids vote splitting.

How Writes Work

Once we have a leader:

  1. Client sends a write to the leader
  2. Leader appends it to its log
  3. Leader sends the entry to all followers
  4. Followers append it and acknowledge
  5. Once a majority acknowledges, the entry is committed
  6. Leader applies it and responds to the client

The key: writes are committed only when a majority of nodes have the data. Even if some nodes are down, as long as the majority is alive, the system works.

The Split-Brain Problem

This is the nightmare scenario. Two nodes both think they’re the leader. Both accept writes. Now we have conflicting data and no way to merge it.

How does this happen? Network partitions. If the network splits a 5-node cluster into a group of 3 and a group of 2, both groups might try to elect a leader.

Quorum Prevents Split-Brain

The solution is the quorum — a majority requirement. To do anything (elect a leader, commit a write), we need agreement from more than half the nodes.

In a 5-node cluster, the quorum is 3. If the network splits into groups of 3 and 2:

  • The group of 3 can elect a leader (3 >= quorum)
  • The group of 2 cannot elect a leader (2 < quorum)

There can only ever be one group with a majority. Math prevents split-brain.

This is why consensus clusters use odd numbers of nodes (3, 5, 7). With 4 nodes and a 2-2 split, neither side has a majority and the system is stuck.

Fault Tolerance

A Raft cluster of N nodes can tolerate (N-1)/2 failures:

Cluster SizeQuorumTolerated Failures
321
532
743

Going from 3 to 5 nodes gives us tolerance for one more failure. Going from 5 to 7 gives another. But more nodes means more coordination overhead. Most production systems use 3 or 5 nodes.

Where Consensus Is Used

Distributed databases — CockroachDB, TiDB, and YugabyteDB use Raft to replicate data across nodes. Every write is committed through consensus.

Configuration management — etcd (used by Kubernetes) and ZooKeeper store cluster configuration. When config changes, all nodes must agree on the new value.

Leader election for other services — Services that need exactly one active instance (like a scheduler) use consensus to pick the leader. If the leader dies, a new one is elected.

Distributed locks — Only one service can hold a lock at a time. Consensus ensures everyone agrees on who holds it.

Tools That Implement Consensus

etcd

A distributed key-value store that uses Raft. It’s the backbone of Kubernetes — all cluster state (pods, services, configs) lives in etcd. Written in Go. Simple API. Very reliable.

ZooKeeper

The older, battle-tested option. Used by Kafka (older versions), Hadoop, and HBase. Uses a protocol called ZAB (ZooKeeper Atomic Broadcast), which is similar to Raft. Written in Java. More complex to operate than etcd.

Consul

By HashiCorp. Uses Raft. Focuses on service discovery and health checking in addition to key-value storage. Often used in microservice architectures.

Consensus vs Eventual Consistency

Consensus gives us strong consistency — all nodes agree before we respond. The tradeoff is latency. Every write requires a network round trip to the majority.

Eventual consistency (like DynamoDB or Cassandra) is the opposite — writes are fast because we don’t wait for agreement. Nodes will eventually converge. Faster, but we might read stale data.

Most systems mix both. Use consensus for the things that absolutely must be correct (leader election, config, financial transactions). Use eventual consistency for things where speed matters more (feeds, analytics, caches).

In simple language, distributed consensus is how servers vote on the truth. Raft elects a leader, the leader coordinates writes, and everything needs a majority to proceed. The quorum rule guarantees we can never have two conflicting leaders. It’s slower than just writing to one server, but it’s the price we pay for systems that survive failures without losing data.