Design a Rate Limiter - High-Level Design

A rate limiter controls how many requests a client can make in a given time window. Think of it like a bouncer at a club — only so many people get in per hour. If we exceed the limit, the request gets rejected with a 429 Too Many Requests.

Every major API has one. Without it, a single misbehaving client can take down our entire service.

Step 1: Requirements

Functional Requirements

Limit the number of requests a client can make within a time window
Support different rate limiting rules (e.g., 100 req/min for free tier, 1000 req/min for paid)
Return clear error responses when rate limited (429 status + retry-after header)
Support multiple limiting strategies (by user ID, IP address, API key)

Non-Functional Requirements

Low latency — the rate limiter sits in the request path, so it must be fast (< 1ms overhead)
Distributed — must work across multiple servers (not just in-memory on one box)
Highly available — if the rate limiter goes down, we’d rather let all requests through than block everything
Accurate — in a distributed setup, we shouldn’t let significantly more requests through than the limit

Step 2: Estimation

Assumptions:

We’re protecting an API that handles 10,000 QPS across all clients
1 million unique clients (user IDs or API keys)
Each rate limit check should add < 1ms of latency

Storage:

Each client needs a counter + timestamp ≈ 20 bytes
1M clients × 20 bytes = 20 MB

That’s nothing. The entire state fits easily in Redis.

QPS on the rate limiter:

Every incoming request triggers a rate limit check.
10,000 QPS → 10,000 Redis operations/sec

A single Redis instance handles 100K+ operations/sec. We’re well within limits. For higher scale, we shard by client ID.

Step 3: High-Level Design

Rate Limiter — Request Flow

Client Request

│

API Gateway / Load Balancer

│

Rate Limiter Middleware

│

Allowed

↓

App Servers

│

Rejected

↓

429 Too Many Requests

Redis (Counters) Rules Config (DB/YAML)

Rate limiter checks Redis on every request. Rules define limits per client/endpoint.

Where does the rate limiter live?

We have a few options:

API Gateway — Cloud providers (AWS API Gateway, Kong) have built-in rate limiting. Easiest to set up.
Middleware — A thin layer in our application code, before the request hits the business logic.
Sidecar — A separate process alongside our app (like Envoy proxy).

For most systems, putting it at the API Gateway level is the right call. It’s centralized, handles it before the request even reaches our servers, and most gateways support it out of the box.

Step 4: API Design

The rate limiter doesn’t have its own API per se — it’s middleware. But the response headers tell clients about their rate limit status:

# On every response (allowed or rejected):
X-RateLimit-Limit: 100        # max requests allowed in the window
X-RateLimit-Remaining: 42     # requests left in current window
X-RateLimit-Reset: 1735689600 # Unix timestamp when the window resets

# On rejection:
HTTP/1.1 429 Too Many Requests
Retry-After: 30               # seconds until client should retry
Content-Type: application/json
{ "error": "Rate limit exceeded. Try again in 30 seconds." }

For configuring rate limit rules, we’d have an internal admin API:

POST /admin/rate-rules
Body: {
  "client_type": "free_tier",
  "endpoint": "/api/v1/search",
  "max_requests": 100,
  "window_seconds": 60
}

Step 5: Data Model

Rate limiters usually don’t use a traditional database. Redis is the go-to because it’s in-memory and supports atomic operations.

# Redis key pattern for counters
rate_limit:{client_id}:{endpoint}:{window}

# Example: user 42, search endpoint, per-minute window
rate_limit:user_42:/api/search:202603301430

# Value: current request count (integer)
# TTL: set to expire when the window ends

For the rules configuration:

# Rate limit rules (stored in DB or config file)
rules:
  - client_type: "free_tier"
    limits:
      - endpoint: "/api/v1/*"
        max_requests: 100
        window: 60        # seconds

  - client_type: "paid_tier"
    limits:
      - endpoint: "/api/v1/*"
        max_requests: 1000
        window: 60

  - client_type: "default"
    limits:
      - endpoint: "*"
        max_requests: 50
        window: 60

Step 6: Deep Dives

Deep Dive 1: Rate Limiting Algorithms

There are four main algorithms. Each has different tradeoffs.

1. Token Bucket

Think of a bucket that holds tokens. It starts full. Each request takes one token. The bucket refills at a steady rate. If the bucket is empty, the request is rejected.

Bucket: max 10 tokens, refill 1 token/second

Time 0:  [10 tokens] → Request → [9 tokens] ✓
Time 0:  [9 tokens]  → Request → [8 tokens] ✓
...
Time 0:  [1 token]   → Request → [0 tokens] ✓
Time 0:  [0 tokens]  → Request → REJECTED    ✗
Time 1:  [1 token]   → Request → [0 tokens] ✓  (refilled)

Pros: Allows short bursts (up to bucket size). Smooth over time. Amazon and Stripe use this. Cons: Two parameters to tune (bucket size + refill rate).

2. Sliding Window Log

We store the timestamp of every request. To check the limit, we count how many timestamps fall within the last N seconds.

Window: 3 requests per 60 seconds

Timestamps: [1:00:15, 1:00:30, 1:00:45]
New request at 1:00:50 → 3 requests in window → REJECTED
New request at 1:01:20 → remove 1:00:15 (expired) → 2 in window → ALLOWED

Pros: Very accurate — no boundary issues. Cons: Stores every timestamp, uses more memory. Not great for high limits.

3. Sliding Window Counter

A clever hybrid. We keep counters for the current and previous windows, then calculate a weighted count based on where we are in the current window.

Previous window (1:00-1:01): 84 requests
Current window  (1:01-1:02): 36 requests
We're 25% into the current window.

Weighted count = 84 × 0.75 + 36 = 99
Limit = 100 → ALLOWED (barely!)

Pros: Memory-efficient (just two counters). Smooth enough for most use cases. Cons: Not perfectly accurate — it’s an approximation. But it’s close enough.

The winner for most systems: Token Bucket or Sliding Window Counter. Token Bucket if we want to allow bursts. Sliding Window Counter if we want simplicity.

Deep Dive 2: Distributed Rate Limiting with Redis

Here’s the tricky part. Our API has multiple servers behind a load balancer. If we just count requests in memory on each server, a client can send 100 requests to Server A and 100 to Server B, effectively getting 200 when the limit is 100.

We need a centralized counter, and Redis is perfect for this.

The basic approach with the Fixed Window Counter algorithm:

# Pseudocode for each incoming request:

key = f"rate:{client_id}:{current_minute}"
count = INCR(key)              # atomic increment in Redis
if count == 1:
    EXPIRE(key, 60)            # set TTL on first request

if count > limit:
    return 429                 # rejected
else:
    forward request            # allowed

But there’s a subtle race condition. The INCR and EXPIRE are two separate commands. If our server crashes between them, the key never expires and the client is blocked forever.

The fix: use a Lua script to make it atomic.

-- Atomic rate limit check in Redis (Lua script)
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])

local current = redis.call('INCR', key)
if current == 1 then
    redis.call('EXPIRE', key, window)
end

if current > limit then
    return 0  -- rejected
else
    return 1  -- allowed
end

Lua scripts in Redis execute atomically — no other command can run in between. Problem solved.

For Token Bucket in Redis, we store the last refill timestamp and current token count. Each request runs a Lua script that calculates how many tokens to add since the last refill, subtracts one, and returns allow/deny.

Deep Dive 3: Client Identification and Rules

How do we identify who’s making the request? We have several options, and the right choice depends on our use case:

API Key — Most common for authenticated APIs. Each key maps to a tier with specific limits.
User ID — For logged-in users. Tied to their account regardless of IP changes.
IP Address — For unauthenticated requests. Problem: many users behind the same NAT/VPN share one IP.
Combination — Use API key when available, fall back to IP for unauthenticated requests.

Granularity matters too. We can rate limit at different levels:

Global:     1M requests/min across all clients
Per-client: 100 requests/min per API key
Per-endpoint: 10 requests/min on POST /api/v1/upload per client

Layering these gives us defense in depth. A client might be within their per-endpoint limit but hitting the global limit, and we’d still throttle them.

Rules engine: We load rules from a config file or database and cache them in memory on each server. A background job refreshes the cache every few seconds so we can update limits without redeploying.

Step 7: Scaling

Redis scaling:

A single Redis instance handles our 10K QPS easily
For higher scale, shard by client ID using Redis Cluster
Use Redis replicas for high availability — if the primary goes down, the replica takes over
If Redis is completely down, fail open (let all requests through). Briefly no rate limiting is better than blocking all requests.

Multi-region:

If our API is global, we need rate limiters in each region
Option 1: Each region has its own Redis → limits are per-region (simpler, but a client could get N times the limit across N regions)
Option 2: Sync counters across regions → accurate global limits but adds latency
Option 3: Use a global Redis with cross-region replication → eventual consistency means small over-counting

For most systems, per-region limiting is good enough. If we need strict global limits, we accept the slight inaccuracy of eventual consistency.

Handling bursty traffic:

Token Bucket naturally handles bursts (that’s its superpower)
Set the burst size to 2-3x the per-second rate for reasonable burst allowance
Add a request queue for soft rate limiting — instead of rejecting immediately, hold the request for a short time and retry

Monitoring and alerting:

Track how many requests get rate limited (by client, endpoint, rule)
Alert if a large percentage of requests are being rejected — might mean our limits are too low
Dashboard showing top rate-limited clients — helps identify abuse patterns

In simple language, a rate limiter is a counter that says “nope, too many” when a client goes too fast. The hard parts are making it distributed (Redis + Lua scripts for atomicity) and choosing the right algorithm (Token Bucket for burst tolerance, Sliding Window Counter for simplicity). Keep it fast, keep it centralized, and always fail open.