Failover and Redundancy - High-Level Design

Things break. Servers crash, disks fail, networks go down, entire data centers lose power. The question isn’t if something will fail — it’s when. Failover and redundancy are how we design systems that keep working even when parts of them die.

What Is Redundancy?

Redundancy means having backup copies of everything critical. If one server dies, another takes over. If one database crashes, a replica is ready to go. We eliminate single points of failure.

Think of it like having a spare tire in our car. We don’t plan to get a flat, but when it happens, we’re not stuck on the side of the road.

What Is Failover?

Failover is the process of switching from a failed component to a healthy backup. It’s the mechanism that actually makes redundancy useful.

Without failover, having a backup database is like having a spare tire but no jack — the backup exists but we can’t switch to it.

Active-Passive Failover

The most common pattern. One server (the active/primary) handles all traffic. One or more backup servers (the passive/standby) sit idle, receiving replicated data but serving no traffic.

When the primary dies, a standby gets promoted to primary and starts handling traffic.

Active-Passive vs Active-Active

Active-Passive

Traffic → [Primary ✓]

replication ↓

[Standby 💤]

Primary fails ↓

Traffic → [Standby → New Primary ✓]

Active-Active

↗ [Server A ✓]

Traffic

↘ [Server B ✓]

Server A fails ↓

All Traffic → [Server B ✓]

Pros:

Simple to set up
No data conflict issues (only one node writes at a time)
Standby can also serve read queries (read replicas)

Cons:

The standby is wasted capacity — it sits idle most of the time
Failover isn’t instant — there’s a brief downtime during switchover
Risk of data loss if the primary dies before replicating its latest writes

Active-Active Failover

All servers handle traffic simultaneously. A load balancer splits requests across them. If one dies, the others absorb the extra load.

Pros:

No wasted capacity — every server is doing useful work
Better performance — load is distributed
Faster failover — no promotion step, traffic just routes around the dead node

Cons:

More complex — we need to handle data conflicts (what if two servers accept conflicting writes?)
Need a load balancer in front
Data synchronization is harder

Best for: Stateless application servers (easy), databases with multi-master replication (hard but possible with Cassandra, CockroachDB).

Redundancy at Every Layer

A chain is only as strong as its weakest link. We need redundancy at every layer of our stack:

Layer	Redundancy Strategy
DNS	Multiple DNS providers, DNS failover
Load Balancer	Pair of LBs in active-passive
Application Servers	Multiple instances behind LB
Database	Primary + read replicas + standby
Cache	Redis Sentinel or Redis Cluster
Storage	S3 (11 nines durability, built-in replication)
Data Center	Multi-AZ or multi-region deployment

The most overlooked one? The load balancer itself. If we put all our servers behind a single load balancer and that LB dies, everything is down. Always have redundant LBs.

The Nines of Availability

When we talk about uptime, we use “nines”:

Availability	Downtime/Year	Downtime/Month	Downtime/Week
99% (two nines)	3.65 days	7.31 hours	1.68 hours
99.9% (three nines)	8.77 hours	43.8 minutes	10.1 minutes
99.99% (four nines)	52.6 minutes	4.38 minutes	1.01 minutes
99.999% (five nines)	5.26 minutes	26.3 seconds	6.05 seconds

Going from 99.9% to 99.99% is exponentially harder and more expensive. Most web apps aim for three nines (99.9%). Banks and critical infra aim for four or five nines.

SLA vs SLO vs SLI

These terms come up in interviews:

SLI (Service Level Indicator) — The actual metric. “Our p99 latency is 200ms.”
SLO (Service Level Objective) — Our internal target. “We aim for 99.9% uptime.”
SLA (Service Level Agreement) — A legal promise to customers. “If we drop below 99.9%, we give you credits.”

The SLA is always looser than the SLO. We target 99.95% internally (SLO) so we don’t breach our 99.9% customer promise (SLA).

Health Checks and Heartbeats

How does the system know something is dead? Two main approaches:

Health checks (pull-based): A monitor pings each server periodically. “Are you alive? Return 200 OK.” If a server doesn’t respond after several retries, it’s considered dead.

Heartbeats (push-based): Each server periodically sends a “I’m alive” signal. If the monitor doesn’t hear from a server within a timeout, it’s considered dead.

Most load balancers use health checks. Most cluster managers (like Kubernetes) use a combination of both.

Health Check: Monitor → "GET /health" → Server → "200 OK"
             Monitor → "GET /health" → Server → (no response)
             Monitor → "GET /health" → Server → (no response)
             Monitor: "Server is dead, removing from pool"

Failover Strategies for Databases

Database failover is the trickiest because data is involved:

Cold standby — Standby is off, brought online from backups. Slowest recovery (minutes to hours).
Warm standby — Standby receives replicated data but doesn’t serve traffic. Moderate recovery (seconds to minutes).
Hot standby — Standby is fully synced and ready to take over immediately. Fastest recovery (seconds).

Most cloud-managed databases (RDS, Cloud SQL) use warm or hot standby with automatic failover.

Key Takeaway

In simple language, redundancy is having backups of everything, and failover is the process of switching to those backups when something breaks. Active-passive is simpler (one server works, one waits). Active-active is more efficient (all servers work, survivors absorb the load). Build redundancy at every layer, understand our uptime target in nines, and make sure our health checks actually work. The goal is making our system boring — it keeps running no matter what breaks.