Rate Limiting - High-Level Design

Rate limiting is controlling how many requests a client can make to our system in a given time period. Think of it like a bouncer at a club — only so many people can get in per hour, no matter how eager they are.

Every production system needs it. Without rate limiting, one misbehaving client (or an attacker) can overwhelm our servers and ruin the experience for everyone else.

Why We Need It

Prevent abuse — Stop DDoS attacks and brute-force login attempts
Protect resources — Our servers, databases, and third-party APIs have finite capacity
Fair usage — Make sure one noisy client doesn’t hog everything
Cost control — If we’re paying per API call to a downstream service, we don’t want runaway costs
Stability — Graceful degradation is better than a total crash

Rate Limiting Algorithms

There are five main algorithms. Each has tradeoffs.

1. Token Bucket

The most popular algorithm. Used by AWS, Stripe, and most API gateways.

Imagine a bucket that holds tokens. A refiller adds tokens at a fixed rate (say 10 per second). Each request takes one token. If the bucket is empty, the request is rejected. The bucket has a max capacity, so tokens don’t accumulate forever.

Token Bucket Algorithm

Refiller (10 tokens/sec)

↓

Bucket (max: 20)

T T T T T T

6 tokens available

Request arrives

↓

Has token? → Allow

Empty? → Reject (429)

Why it’s popular: It allows bursts (up to the bucket size) while maintaining a steady average rate. Simple to implement and memory efficient.

2. Leaking Bucket

Similar to token bucket, but requests go INTO a bucket (a FIFO queue) and leak out at a fixed rate. If the bucket overflows, new requests are dropped.

The only difference from token bucket: leaking bucket enforces a perfectly smooth outflow rate. Token bucket allows bursts. Leaking bucket doesn’t.

Good for: When we need a perfectly steady processing rate (like sending API calls to a strict third-party rate limit).

3. Fixed Window Counter

Divide time into fixed windows (say 1-minute windows). Count requests in each window. If the count exceeds the limit, reject.

Window: 10:00-10:01 → 95 requests (limit: 100) ✓
Window: 10:01-10:02 → 45 requests (limit: 100) ✓

Problem: Burst at the boundary. If 90 requests come at 10:00:55 and 90 more at 10:01:05, that’s 180 requests in 10 seconds even though the limit is 100/minute. Both windows pass individually, but the actual rate is way over.

4. Sliding Window Log

Keep a timestamped log of every request. When a new request arrives, remove entries older than the window size, then count what’s left.

Fixes the boundary burst problem. But it’s memory-hungry — we’re storing a timestamp for every single request.

5. Sliding Window Counter

A hybrid of fixed window and sliding window log. We keep counts for the current and previous windows, then calculate a weighted count based on where we are in the current window.

Previous window: 84 requests
Current window (40% through): 36 requests
Weighted count: 84 × 0.6 + 36 = 86.4 → Under 100, allow!

Best of both worlds: Smooth like sliding window, memory-efficient like fixed window. This is what most real implementations use.

HTTP Headers

When we rate limit, we should tell clients what’s going on. Standard headers:

Header	Meaning
`X-RateLimit-Limit`	Max requests allowed in the window
`X-RateLimit-Remaining`	How many requests the client has left
`X-RateLimit-Reset`	When the window resets (Unix timestamp)
`Retry-After`	How many seconds to wait before trying again (sent with 429)

The HTTP status code for “you’ve been rate limited” is 429 Too Many Requests.

Where to Implement

API Gateway level — The most common place. Rate limit before requests even hit our services. Tools like Kong, NGINX, and AWS API Gateway have this built in.

Middleware level — Inside the application. More flexible — we can rate limit based on user tier, endpoint, or any custom logic.

Per-service level — Each microservice protects itself. Useful when different services have different capacity limits.

In practice, most systems do it at the API gateway with additional per-service limits for extra protection.

Client-Side vs Server-Side

Server-side rate limiting is what we’ve been discussing — the server decides when to reject requests. This is the important one. We never trust the client.

Client-side rate limiting is when the client self-throttles (e.g., backing off after receiving a 429). It’s a nice optimization — reduces wasted requests — but it’s not a security mechanism. A malicious client can just ignore it.

Good API clients implement exponential backoff: wait 1s, then 2s, then 4s, then 8s between retries. This prevents a stampede when the server recovers.

Rate Limiting in Distributed Systems

When we have multiple servers, we need a shared counter. Each server can’t just keep its own count — a client could hit different servers and bypass the limit.

The solution: use a centralized store like Redis. Redis is fast enough for this (single-digit millisecond lookups), and commands like INCR and EXPIRE make implementing rate limiters straightforward.

INCR user:123:rate_limit    → 1
EXPIRE user:123:rate_limit 60  → expires in 60s

In simple language, rate limiting is putting a speed limit on our API. Without it, one bad actor can crash the party for everyone. Token bucket is the go-to algorithm — it’s simple, allows bursts, and is easy to implement with Redis. Always return proper HTTP headers so good clients know where they stand.