High Availability and Disaster Recovery - DevOps

Everything fails eventually. Servers crash. Disks die. Entire data centers go offline (it happened to AWS us-east-1 multiple times). High Availability (HA) is about designing systems that keep running even when individual components fail.

In simple language, HA means our users don’t notice when something breaks because something else picks up the slack immediately.

Single Points of Failure

The first step to HA is finding our single points of failure (SPOF) — components where, if they go down, the entire system goes down. Common ones:

One database server with no replica
One load balancer
One data center / availability zone
One DNS provider
One person who knows how the system works (the “bus factor”)

For every SPOF we find, the fix is the same: redundancy.

Redundancy Patterns

Active-Active — multiple instances all serve traffic simultaneously. If one dies, the others keep going. This is what we do with multiple web servers behind a load balancer. More capacity, more resilient, but we need to handle shared state carefully.

Active-Passive — one instance handles all traffic. A standby instance waits and takes over if the primary fails. Simpler to reason about, but the standby is sitting idle. Common with databases (primary-replica setup).

RPO and RTO

These two metrics define our disaster recovery requirements. They’re the first questions any interviewer will ask about DR.

RPO and RTO on a Timeline

Last Backup

DISASTER

System Restored

← RPO →

How much data we lose

← RTO →

How long until we're back

RPO (Recovery Point Objective) — how much data can we afford to lose? If our RPO is 1 hour, we need backups at least every hour. RPO = 0 means we can’t lose any data (requires real-time replication).

RTO (Recovery Time Objective) — how fast must we recover? If our RTO is 30 minutes, we need to be back online within 30 minutes of a failure.

Lower RPO and RTO = more expensive. A bank might need RPO = 0 and RTO = 5 minutes. A personal blog? RPO = 24 hours and RTO = “whenever we get around to it” is fine.

Backup Strategies

Full backup — copy everything. Simple but slow and storage-heavy. Usually done weekly.
Incremental backup — only copy what changed since the last backup. Fast and small, but recovery requires replaying all increments in order.
Differential backup — copy what changed since the last full backup. Bigger than incremental but simpler to restore.

The golden rule: test our backups regularly. An untested backup is not a backup. We should be doing restore drills, not just hoping the backup works when disaster strikes.

# PostgreSQL backup example
# Full backup
pg_dump -h localhost -U admin mydb > backup_full_2024-03-15.sql

# Automated daily backup with compression
pg_dump -h localhost -U admin mydb | gzip > backup_$(date +%Y%m%d).sql.gz

# Restore from backup
gunzip -c backup_20240315.sql.gz | psql -h localhost -U admin mydb

Multi-Region Architecture

For serious HA, we run our system across multiple regions (e.g., Mumbai + Singapore). If an entire region goes down, the other takes over. This involves:

Database replication across regions (async is common, sync is slow over distance)
DNS failover — Route 53 health checks or similar, redirecting traffic to the healthy region
Data consistency trade-offs — with async replication, a failover might lose the last few seconds of writes (this ties back to RPO)

Health Checks and Failover

Load balancers and orchestrators use health checks to detect failures:

Liveness check — “is the process alive?” If not, restart it.
Readiness check — “can it handle traffic?” A server might be alive but still warming up its cache.

When a health check fails, traffic is automatically routed away from the unhealthy instance. In Kubernetes, failed liveness probes trigger a pod restart. Failed readiness probes remove the pod from the service endpoint.

Chaos Engineering

How do we know our HA setup actually works? We intentionally break things in a controlled way and see if the system recovers.

Netflix pioneered this with Chaos Monkey — a tool that randomly kills production instances during business hours. If the system handles it gracefully, great. If not, we found a weakness before our users did.

Other chaos experiments:

Kill a database primary and see if failover works
Inject network latency between services
Fill up a disk to 100%
Simulate an entire availability zone failure

The point isn’t to cause outages — it’s to build confidence that our systems can handle real failures. Start small (kill one pod), build up to bigger experiments (simulate a region failure) as confidence grows.

In simple language, HA is about redundancy (so there’s always a backup), RPO/RTO define how much failure we can tolerate, backups save us from data loss (but only if we test them), and chaos engineering proves it all actually works before a real disaster does.