Load Balancing Basics

intermediate load-balancing scaling nginx high-availability

One server can only handle so much. At some point, our app gets enough traffic that a single machine maxes out its CPU, memory, or network bandwidth. Load balancing solves this by distributing incoming requests across multiple servers, so no single server gets overwhelmed.

Why We Need Load Balancing

Without a load balancer, we have a single point of failure. If that one server crashes, our entire app goes down. With a load balancer and multiple backend servers:

  • Reliability — if one server dies, traffic routes to the healthy ones
  • Performance — requests spread across machines, each handling a fraction of the load
  • Scalability — need more capacity? Add another server behind the load balancer
User A User B User C User D
↓ ↓ ↓ ↓
Load Balancer
distributes traffic
↙   ↓   ↘
Server 1
A, D → here
Server 2
B → here
Server 3
C → here

Load Balancing Algorithms

The algorithm determines how the load balancer picks which server gets the next request.

Round Robin — each server gets a turn in order: 1, 2, 3, 1, 2, 3, … Simple and works well when all servers are identical.

Least Connections — sends the request to whichever server currently has the fewest active connections. Better when requests take varying amounts of time to process.

IP Hash — hashes the client’s IP address to always route that client to the same server. Useful when we need sticky sessions (more on that below).

Weighted Round Robin — like round robin, but some servers get more traffic. If server 1 has 16 GB RAM and server 2 has 8 GB, we give server 1 double the weight.

AlgorithmBest ForDownside
Round RobinEqual servers, stateless appsIgnores server load
Least ConnectionsVarying request timesSlightly more overhead to track
IP HashSession stickinessUneven distribution if many users share IPs
WeightedMixed server specsManual config, doesn’t adapt dynamically

Layer 4 vs Layer 7 Load Balancing

Load balancers operate at different network layers.

Layer 4 (Transport — TCP/UDP) — looks at IP addresses and ports only. It doesn’t inspect the actual HTTP content. Very fast because it just shuffles TCP packets. Think of it as a mail sorter that only reads the address on the envelope.

Layer 7 (Application — HTTP) — inspects the full HTTP request: URL path, headers, cookies, body. Can make smart routing decisions like “send /api/* to the API servers and /static/* to the CDN.” Slightly more overhead, but much more flexible.

Most modern setups use Layer 7 because we want that content-aware routing.

Health Checks

The load balancer regularly pings each backend server to make sure it’s still alive. If a server stops responding, the load balancer removes it from the pool until it recovers.

upstream backend {
    server 10.0.0.1:3000 max_fails=3 fail_timeout=30s;
    server 10.0.0.2:3000 max_fails=3 fail_timeout=30s;
    server 10.0.0.3:3000 max_fails=3 fail_timeout=30s;
    # If a server fails 3 health checks in 30s, stop sending traffic to it
}

Without health checks, the load balancer would keep sending requests to a dead server, and users would get errors.

Sticky Sessions

Normally a load balancer doesn’t care which server handled our previous request. But some apps store session data in memory on the server (instead of in a database or Redis). If our next request goes to a different server, our session is lost.

Sticky sessions (session affinity) fix this by always routing the same user to the same server. IP hash is one way. Cookie-based stickiness is another — the load balancer sets a cookie that identifies which backend to use.

The better solution is to make our app stateless — store sessions in Redis or a database, so any server can handle any request. Then we don’t need sticky sessions at all.

Nginx as a Load Balancer

# Define our group of backend servers
upstream api_servers {
    least_conn;                       # use least connections algorithm
    server 10.0.0.1:3000 weight=3;   # this server gets 3x more traffic
    server 10.0.0.2:3000 weight=1;
    server 10.0.0.3:3000 weight=1;
}

server {
    listen 80;
    server_name api.example.com;

    location / {
        proxy_pass http://api_servers;    # forward to the upstream group
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Horizontal vs Vertical Scaling

These are the two ways to handle more traffic.

Vertical scaling (scale up) — get a bigger server. More CPU, more RAM, faster disk. Simple, but there’s a ceiling — we can’t infinitely upgrade one machine. And it’s still a single point of failure.

Horizontal scaling (scale out) — add more servers behind a load balancer. No theoretical ceiling. If we need more capacity, spin up another identical server. This is what load balancers enable.

Most production systems use horizontal scaling. It’s more resilient (no single point of failure) and more cost-effective at scale (many cheap servers vs one expensive one).

Real-World Tools

  • Nginx / HAProxy — self-managed, run on our own servers
  • AWS ALB (Application Load Balancer) — managed Layer 7 load balancer, integrates with ECS/EKS
  • AWS NLB (Network Load Balancer) — managed Layer 4, ultra-low latency
  • Cloudflare Load Balancing — DNS-level load balancing with global health checks

In simple language, a load balancer is a traffic cop that spreads requests across multiple servers — it keeps our app fast, available, and resilient even when one server goes down.