Things break. Servers crash, disks fail, networks go down, entire data centers lose power. The question isn’t if something will fail — it’s when. Failover and redundancy are how we design systems that keep working even when parts of them die.
What Is Redundancy?
Redundancy means having backup copies of everything critical. If one server dies, another takes over. If one database crashes, a replica is ready to go. We eliminate single points of failure.
Think of it like having a spare tire in our car. We don’t plan to get a flat, but when it happens, we’re not stuck on the side of the road.
What Is Failover?
Failover is the process of switching from a failed component to a healthy backup. It’s the mechanism that actually makes redundancy useful.
Without failover, having a backup database is like having a spare tire but no jack — the backup exists but we can’t switch to it.
Active-Passive Failover
The most common pattern. One server (the active/primary) handles all traffic. One or more backup servers (the passive/standby) sit idle, receiving replicated data but serving no traffic.
When the primary dies, a standby gets promoted to primary and starts handling traffic.
Pros:
- Simple to set up
- No data conflict issues (only one node writes at a time)
- Standby can also serve read queries (read replicas)
Cons:
- The standby is wasted capacity — it sits idle most of the time
- Failover isn’t instant — there’s a brief downtime during switchover
- Risk of data loss if the primary dies before replicating its latest writes
Active-Active Failover
All servers handle traffic simultaneously. A load balancer splits requests across them. If one dies, the others absorb the extra load.
Pros:
- No wasted capacity — every server is doing useful work
- Better performance — load is distributed
- Faster failover — no promotion step, traffic just routes around the dead node
Cons:
- More complex — we need to handle data conflicts (what if two servers accept conflicting writes?)
- Need a load balancer in front
- Data synchronization is harder
Best for: Stateless application servers (easy), databases with multi-master replication (hard but possible with Cassandra, CockroachDB).
Redundancy at Every Layer
A chain is only as strong as its weakest link. We need redundancy at every layer of our stack:
| Layer | Redundancy Strategy |
|---|---|
| DNS | Multiple DNS providers, DNS failover |
| Load Balancer | Pair of LBs in active-passive |
| Application Servers | Multiple instances behind LB |
| Database | Primary + read replicas + standby |
| Cache | Redis Sentinel or Redis Cluster |
| Storage | S3 (11 nines durability, built-in replication) |
| Data Center | Multi-AZ or multi-region deployment |
The most overlooked one? The load balancer itself. If we put all our servers behind a single load balancer and that LB dies, everything is down. Always have redundant LBs.
The Nines of Availability
When we talk about uptime, we use “nines”:
| Availability | Downtime/Year | Downtime/Month | Downtime/Week |
|---|---|---|---|
| 99% (two nines) | 3.65 days | 7.31 hours | 1.68 hours |
| 99.9% (three nines) | 8.77 hours | 43.8 minutes | 10.1 minutes |
| 99.99% (four nines) | 52.6 minutes | 4.38 minutes | 1.01 minutes |
| 99.999% (five nines) | 5.26 minutes | 26.3 seconds | 6.05 seconds |
Going from 99.9% to 99.99% is exponentially harder and more expensive. Most web apps aim for three nines (99.9%). Banks and critical infra aim for four or five nines.
SLA vs SLO vs SLI
These terms come up in interviews:
- SLI (Service Level Indicator) — The actual metric. “Our p99 latency is 200ms.”
- SLO (Service Level Objective) — Our internal target. “We aim for 99.9% uptime.”
- SLA (Service Level Agreement) — A legal promise to customers. “If we drop below 99.9%, we give you credits.”
The SLA is always looser than the SLO. We target 99.95% internally (SLO) so we don’t breach our 99.9% customer promise (SLA).
Health Checks and Heartbeats
How does the system know something is dead? Two main approaches:
Health checks (pull-based): A monitor pings each server periodically. “Are you alive? Return 200 OK.” If a server doesn’t respond after several retries, it’s considered dead.
Heartbeats (push-based): Each server periodically sends a “I’m alive” signal. If the monitor doesn’t hear from a server within a timeout, it’s considered dead.
Most load balancers use health checks. Most cluster managers (like Kubernetes) use a combination of both.
Health Check: Monitor → "GET /health" → Server → "200 OK"
Monitor → "GET /health" → Server → (no response)
Monitor → "GET /health" → Server → (no response)
Monitor: "Server is dead, removing from pool"
Failover Strategies for Databases
Database failover is the trickiest because data is involved:
- Cold standby — Standby is off, brought online from backups. Slowest recovery (minutes to hours).
- Warm standby — Standby receives replicated data but doesn’t serve traffic. Moderate recovery (seconds to minutes).
- Hot standby — Standby is fully synced and ready to take over immediately. Fastest recovery (seconds).
Most cloud-managed databases (RDS, Cloud SQL) use warm or hot standby with automatic failover.
Key Takeaway
In simple language, redundancy is having backups of everything, and failover is the process of switching to those backups when something breaks. Active-passive is simpler (one server works, one waits). Active-active is more efficient (all servers work, survivors absorb the load). Build redundancy at every layer, understand our uptime target in nines, and make sure our health checks actually work. The goal is making our system boring — it keeps running no matter what breaks.