High-Level Design

All 45 notes on one page

Foundations & Approach

1

What Is System Design?

beginner 0-2 YOE system-design interviews fundamentals

System design is the process of defining the architecture, components, and data flow of a system to meet specific requirements. Think of it like being an architect for software — we decide what pieces we need, how they talk to each other, and how the whole thing holds up when millions of people use it.

Why Do Companies Ask System Design Questions?

Coding interviews test if we can write correct code. System design interviews test if we can think big.

At 3+ years of experience, companies expect us to do more than just write functions. They want to know:

  • Can we break a vague problem into concrete pieces?
  • Can we think about trade-offs (there’s no perfect answer)?
  • Can we communicate our thought process clearly?
  • Do we understand how real-world systems actually work?

The only difference between a coding interview and a system design interview is this: coding has a “right answer,” system design does not. It’s a conversation, not a test.

What Interviewers Actually Look For

Here’s what most people get wrong — interviewers are NOT looking for the “perfect architecture.” They’re evaluating:

  1. Communication — Can we explain our thinking clearly?
  2. Problem breakdown — Can we go from vague to specific?
  3. Trade-off awareness — Do we know why we picked X over Y?
  4. Breadth of knowledge — Do we know what tools exist?
  5. Depth when needed — Can we dive deep into one area?

A candidate who designs a simpler system but explains trade-offs well will beat someone who draws a complex diagram but can’t explain why.

The Journey: Single Server to Distributed System

From Simple to Scaled
Stage 1: Single server does everything
  └── App + DB + Files all on one machine
        ↓ traffic grows...
Stage 2: Separate the database
  └── App Server ←→ Database Server
        ↓ more traffic...
Stage 3: Add a load balancer + multiple app servers
  └── LB → [Server 1, Server 2, Server 3] → DB
        ↓ even more traffic...
Stage 4: Add caching, CDN, message queues, DB replicas
  └── The full distributed system

This is basically what system design is about — understanding when and why we move from one stage to the next. We don’t need a distributed system for 100 users. But we do need to know how to build one for 100 million.

When Does System Design Matter?

  • Junior roles (0-2 YOE): Rarely asked. Focus on coding.
  • Mid-level (3-5 YOE): Common. Expected to know the building blocks.
  • Senior+ (5+ YOE): Critical. Expected to lead design discussions and make architectural decisions.

Even if we’re not interviewing, understanding system design makes us better engineers. We start seeing why things are built the way they are.

Key Takeaway

System design isn’t about memorizing architectures. It’s about understanding building blocks (load balancers, caches, queues, databases) and knowing when to use what. The rest of this collection covers exactly that — one building block at a time.


2

How to Approach System Design

beginner 0-2 YOE system-design interviews framework approach

The biggest mistake in system design interviews? Jumping straight to drawing boxes. We need a framework — a repeatable process that works for any problem, from “Design a URL Shortener” to “Design Netflix.”

The 5-Step Framework

45-Minute System Design Framework
Step 1 (5 min): Requirements & Scope
  └── Ask questions. Narrow the problem down.
Step 2 (5 min): Back-of-the-Envelope Estimation
  └── QPS, storage, bandwidth. Ballpark numbers.
Step 3 (15 min): High-Level Design
  └── Draw the core components. APIs. Data flow.
Step 4 (15 min): Deep Dive
  └── Pick 2-3 areas. Go deep. Discuss trade-offs.
Step 5 (5 min): Wrap Up & Improvements
  └── Bottlenecks, scaling, monitoring, future work.

Let’s walk through each step.

Step 1: Requirements & Scope (5 min)

Never start designing without asking questions first. The interviewer gives us a vague prompt on purpose — they want to see if we can narrow it down.

Ask about:

  • Users: How many? What kind?
  • Features: Which ones are core? Which can we skip?
  • Scale: Is this for 1,000 users or 1 billion?
  • Constraints: Latency requirements? Availability needs?

For “Design Twitter,” we might say: “Let’s focus on the tweet feed, posting tweets, and following users. We’ll target 200M daily active users.”

Step 2: Back-of-the-Envelope Estimation (5 min)

Quick math to understand the scale. We’re not aiming for exact numbers — just the right order of magnitude.

  • QPS (Queries Per Second): How many requests per second?
  • Storage: How much data do we store per day/year?
  • Bandwidth: How much data flows in and out?

This tells us whether we need one server or a thousand. We cover this in detail in a separate note.

Step 3: High-Level Design (15 min)

This is where we draw the architecture. Start simple and add complexity only when needed.

  • Draw the core components (clients, servers, databases, caches)
  • Define the API endpoints (what calls what)
  • Show the data flow (how a request travels through the system)
  • Pick the database type (SQL vs NoSQL, and why)

Keep it simple. We can always add more later.

Step 4: Deep Dive (15 min)

The interviewer will say something like “Let’s dig into X” or we can proactively pick 2-3 areas. This is where we show depth.

Common deep-dive areas:

  • Database schema and indexing
  • Caching strategy
  • How we handle failures
  • Data partitioning / sharding
  • Consistency vs availability trade-offs

This is the most important step. It’s where trade-off discussions happen.

Step 5: Wrap Up (5 min)

Summarize and show we’re thinking about the future:

  • Bottlenecks: What could break first?
  • Monitoring: How would we know something is wrong?
  • Scaling: What changes at 10x or 100x current scale?
  • Nice-to-haves: Features we scoped out but could add later

Common Mistakes

  1. Jumping to the solution — Not asking questions first. This is the #1 red flag.
  2. Over-engineering — Designing for Google scale when the problem says “small startup.”
  3. Silence — Not talking while thinking. The interviewer can’t read our mind.
  4. No trade-offs — Every decision has pros and cons. Always mention them.
  5. Ignoring the interviewer — If they’re nudging us somewhere, follow the hint.

The Golden Rule

In simple language, system design interviews are a conversation. We talk, the interviewer responds, we adjust. It’s collaborative, not adversarial. The framework gives us structure, but we should stay flexible and follow where the discussion goes.


3

Requirements Gathering

beginner 0-2 YOE system-design requirements functional non-functional scalability

Every system design problem starts with the same question: what exactly are we building? The interviewer gives us a vague prompt like “Design a URL shortener” — it’s our job to turn that into a clear list of requirements before drawing a single box.

Two Types of Requirements

Functional vs Non-Functional Requirements
Functional (WHAT)
What the system does
─────────────────
• Shorten a long URL
• Redirect short URL to original
• Custom short links
• Analytics / click tracking
• Link expiration
Non-Functional (HOW WELL)
How well the system does it
─────────────────
• Low latency (<100ms redirect)
• High availability (99.9%)
• Scalable (billions of URLs)
• Durable (links don't disappear)
• Eventual consistency is OK

Functional requirements describe features — what the user can do. Non-functional requirements describe quality — how well it works under the hood.

Key Non-Functional Requirements

These come up in almost every system design interview. We need to know them cold:

Scalability — Can the system handle growth? From 1,000 to 1,000,000 users without a rewrite.

Availability — Is the system up when users need it? Measured in “nines” — 99.9% means ~8.7 hours of downtime per year, 99.99% means ~52 minutes.

Consistency — Does every user see the same data at the same time? Strong consistency means yes. Eventual consistency means “give it a second.”

Latency — How fast does the system respond? A search engine needs <200ms. A batch report can take minutes.

Durability — Once we store data, does it stay stored? We can never lose a user’s photo or a bank transaction.

How to Ask the Right Questions

Don’t just sit there and guess. Ask questions like:

  • “How many users are we designing for?”
  • “What’s the read-to-write ratio?”
  • “Do we need real-time or is a few seconds of delay OK?”
  • “What’s more important — consistency or availability?”
  • “Should we handle this feature, or can we scope it out?”

The interviewer expects us to ask. Not asking is a red flag.

Example: URL Shortener Requirements

Let’s say the interviewer says “Design a URL shortener like bit.ly.” Here’s how we’d gather requirements:

Questions we’d ask:

  • How many URLs shortened per day? → “100 million”
  • Read-heavy or write-heavy? → “100:1 read-to-write ratio”
  • Do short URLs expire? → “Optional, default no expiry”
  • Custom short URLs? → “Nice to have, not core”
  • Analytics? → “Not in scope for now”

Functional requirements (after discussion):

  1. Given a long URL, generate a short URL
  2. Given a short URL, redirect to the original
  3. Short URLs should be as short as possible

Non-functional requirements:

  1. Very low latency for redirects (<100ms)
  2. High availability — links must always work
  3. Shortened URLs should not be predictable (security)

Now we have a clear scope. We know what to build and what quality bar to hit. Everything after this — estimation, design, deep dives — flows from these requirements.

The Trade-Off Question

Here’s a pro tip: non-functional requirements often conflict with each other. We can’t have perfect consistency AND perfect availability (that’s the CAP theorem, which we’ll cover later). When we state our requirements, we should also state which one wins when they clash.

For a URL shortener: “Availability is more important than consistency. If a link takes a few seconds to propagate globally, that’s fine. But a link should never be down.”

In simple language, requirements gathering is about turning a vague question into a concrete plan. It takes 5 minutes in an interview but sets the direction for everything that follows. Skip it, and we’re building blind.


4

Back-of-the-Envelope Estimation

beginner 2-4 YOE system-design estimation QPS storage bandwidth

Back-of-the-envelope estimation is quick, rough math to figure out the scale of our system. We’re not trying to be exact — we just need to know if we’re dealing with thousands or billions, megabytes or petabytes. That changes everything about our design.

Why Estimation Matters

If our system gets 10 requests per second, a single server is fine. If it gets 100,000 requests per second, we need load balancers, caching, sharding, and a whole different architecture. Estimation tells us which world we’re in.

Numbers Every Engineer Should Know

WhatHow Much
L1 cache reference0.5 ns
L2 cache reference7 ns
RAM reference100 ns
SSD random read150 μs
HDD seek10 ms
Round trip within same datacenter0.5 ms
Round trip CA to Netherlands150 ms

And for storage:

UnitBytesPractical Example
1 KB1,000A short email
1 MB1,000,000A high-res photo
1 GB1,000,000,000A movie
1 TB10^121,000 movies
1 PB10^151 million movies

Handy shortcut: There are about 86,400 seconds in a day. For quick math, round to ~100,000 (10^5). A month is about 2.5 million seconds.

The Four Key Calculations

1. QPS (Queries Per Second)

QPS = Daily Active Users × Queries per User / 86,400

Peak QPS = QPS × 2  (or ×3 for spiky traffic)

Example: 10 million DAU, each user makes 5 requests/day.

QPS = 10M × 5 / 86,400 ≈ 580 QPS
Peak QPS ≈ 1,160 QPS

2. Storage

Storage = Daily New Records × Record Size × Retention Period

Example: 100M new URLs per day, each URL record is 500 bytes, keep for 5 years.

Daily  = 100M × 500 bytes = 50 GB/day
Yearly = 50 GB × 365 ≈ 18 TB/year
5 years = ~90 TB total

3. Bandwidth

Incoming = QPS × Request Size
Outgoing = QPS × Response Size

4. Memory for Cache

We usually cache the hot data — the most frequently accessed items. A common rule of thumb is to cache 20% of daily requests (the 80/20 rule: 20% of data handles 80% of traffic).

Cache Memory = Daily Requests × 0.2 × Average Response Size

Full Example: Estimate Twitter’s Storage

Let’s say Twitter has:

  • 300M monthly active users, 50% are daily → 150M DAU
  • Each user posts 2 tweets/day on average
  • Each tweet: 140 chars (~280 bytes) + metadata (~200 bytes) = ~500 bytes
  • 10% of tweets have a photo (~500 KB average)

Tweet storage per day:

Text:  150M × 2 × 500 bytes = 150 GB/day
Photos: 150M × 2 × 0.10 × 500 KB = 15 TB/day

Per year:

Text:   ~55 TB/year
Photos: ~5.5 PB/year

That tells us we need a serious storage strategy — object storage (like S3) for media, and sharded databases for tweet metadata.

Powers of Two — Quick Reference

PowerValueSize
2^101,024~1 Thousand (1 KB)
2^20~1 Million~1 MB
2^30~1 Billion~1 GB
2^40~1 Trillion~1 TB

Tips for the Interview

  1. State assumptions clearly. “I’m assuming 100M DAU” — the interviewer can correct us.
  2. Round aggressively. Use 10^5 instead of 86,400. Nobody expects exact math.
  3. Focus on order of magnitude. The difference between 50 TB and 90 TB doesn’t change our design. The difference between 50 GB and 50 TB does.
  4. Don’t spend more than 5 minutes. Estimation supports the design, it’s not the main event.

In simple language, back-of-the-envelope estimation is about getting a feel for the scale. Are we building a bicycle or a spaceship? The math takes 5 minutes but saves us from designing something wildly over- or under-engineered.


5

Design Principles and Trade-offs

beginner 0-2 YOE system-design trade-offs scalability availability CAP

System design is all about trade-offs. There’s no perfect system — every choice we make comes with a cost. The key is understanding what we’re gaining and what we’re giving up with each decision.

Key Design Principles

Scalability

The ability to handle more load by adding resources. Two flavors:

  • Vertical scaling (scale up): Bigger machine. More RAM, faster CPU. Simple but has a ceiling.
  • Horizontal scaling (scale out): More machines. Harder to implement but virtually unlimited.

Think of it like a restaurant. Vertical scaling = get a bigger kitchen. Horizontal scaling = open more locations.

Availability

The system is up and working when users need it. Measured in “nines”:

AvailabilityDowntime/YearDowntime/Month
99% (two 9s)3.65 days7.3 hours
99.9% (three 9s)8.76 hours43.8 min
99.99% (four 9s)52.6 min4.38 min
99.999% (five 9s)5.26 min26.3 sec

Reliability

The system does what it’s supposed to do correctly. A system can be available (it’s responding) but unreliable (it’s giving wrong answers). We need both.

Performance

How fast the system responds. Two key metrics:

  • Latency — Time to handle a single request (usually p50, p95, p99)
  • Throughput — How many requests we handle per second

Maintainability

Can other engineers (or future us) understand and modify this system? Simple designs beat clever ones.

The Big Trade-offs

Common Trade-off Spectrums
Consistency ◄━━━━━━━━━━━► Availability
  Banking, inventory               Social media, DNS
Low Latency ◄━━━━━━━━━━━► High Throughput
  Gaming, trading                Batch processing, analytics
Simplicity ◄━━━━━━━━━━━► Performance
  Monolith, single DB            Microservices, sharding
Cost ◄━━━━━━━━━━━━━━━► Performance
  Single region                  Multi-region, redundancy

Consistency vs Availability

This is the famous CAP theorem in disguise. In simple language: when a network issue happens, we have to choose — do we give users potentially stale data (availability) or do we tell them “try again later” (consistency)?

  • Bank account: Must be consistent. We can’t show the wrong balance.
  • Social media feed: Availability wins. If a like takes 2 seconds to show up globally, nobody cares.

Latency vs Throughput

We can make one request super fast (low latency) or handle a massive number of requests (high throughput), but optimizing for one often hurts the other. Batching requests improves throughput but increases latency for individual requests.

Single Point of Failure (SPOF)

A SPOF is any component whose failure takes down the entire system. Every part of our design should have a backup plan:

  • One server? Add another behind a load balancer.
  • One database? Add a replica.
  • One data center? Deploy across multiple regions.
  • One load balancer? Use active-passive failover.

The rule is simple: if it can fail, assume it will. Then plan for it.

Stateless vs Stateful Services

Stateful services store information about the current session (like “this user is logged in”). If that server goes down, the state is lost.

Stateless services don’t remember anything between requests. Every request carries all the information needed (like a JWT token). Any server can handle any request.

Stateless services are much easier to scale — we just add more servers behind a load balancer and it just works. That’s why we push state to external stores (databases, Redis, sessions stores) and keep our application servers stateless.

Key Takeaway

In simple language, there’s no “best” architecture. There’s only the right architecture for our specific requirements. A system design interview is our chance to show we understand these trade-offs and can make informed decisions — not just draw boxes on a whiteboard.


Core Building Blocks

6

DNS and How the Internet Works

beginner 0-2 YOE system-design DNS networking internet

Before we can design any system, we need to understand how a request gets from a user’s browser to our server. It all starts with DNS — the phone book of the internet.

What Happens When We Type a URL?

From URL to Response
1. Browser checks its DNS cache
   ↓ not found?
2. OS checks its DNS cache
   ↓ not found?
3. Query goes to DNS Resolver (ISP)
   ↓ not found?
4. Resolver asks Root → TLD (.com) → Authoritative DNS
   ↓ got the IP!
5. Browser opens TCP connection (+ TLS handshake for HTTPS)
   ↓
6. Browser sends HTTP request → Server responds
   ↓
7. Browser renders the page

In simple language, DNS translates human-readable names (like google.com) into IP addresses (like 142.250.80.46) that computers understand. Without DNS, we’d have to memorize IP addresses for every website.

How DNS Resolution Works

DNS has a hierarchy, like asking directions from more and more knowledgeable people:

  1. DNS Resolver (our ISP or something like 8.8.8.8) — The starting point. It does the legwork.
  2. Root Name Server — Knows which servers handle .com, .org, .io, etc. There are only 13 root server clusters worldwide.
  3. TLD Name Server — Handles a specific top-level domain (like all .com domains). Points to the authoritative server.
  4. Authoritative Name Server — The final answer. This server actually knows what IP google.com maps to.

The result gets cached at every level (browser, OS, resolver) with a TTL (Time To Live). That’s why DNS changes take time to propagate — old cached entries have to expire first.

DNS Record Types We Should Know

Record TypeWhat It DoesExample
AMaps domain to IPv4 addresspman47.cc → 144.24.126.230
AAAAMaps domain to IPv6 addresspman47.cc → 2001:0db8::1
CNAMEAlias for another domainwww.pman47.cc → pman47.cc
NSDelegates to a name serverpman47.cc → ns1.hostinger.com
MXMail server for the domainpman47.cc → mail.pman47.cc

Why DNS Matters in System Design

DNS isn’t just “it resolves names.” In system design, DNS is a powerful tool:

Load distribution — DNS can return different IPs for the same domain, spreading traffic across multiple servers (DNS round-robin).

Geo-routing — DNS can return the IP of the server closest to the user. A user in India gets routed to the Mumbai server, while a user in the US hits the Virginia server.

Failover — If a server goes down, DNS can stop returning its IP. Health checks detect the failure, and DNS automatically routes traffic to healthy servers.

CDN routing — Services like CloudFront and Cloudflare use DNS to route users to the nearest edge server.

DNS in Our System Designs

When we’re drawing system design diagrams, DNS is usually the very first step:

User → DNS → Load Balancer → Application Servers → Database

We don’t usually deep-dive into DNS in interviews unless asked, but we should always mention it. It shows we understand the full picture — not just the backend.

Quick Gotchas

  • DNS propagation delay: Changing DNS records can take up to 48 hours because of caching at various levels. In practice, it’s usually faster.
  • DNS is a SPOF (sort of): If our DNS provider goes down, nobody can reach us. That’s why companies like Netflix use multiple DNS providers.
  • TTL trade-off: Short TTL = faster failover but more DNS queries. Long TTL = fewer queries but slower to react to changes.

In simple language, DNS is the first thing that happens in any web request. It’s simple in concept but powerful in practice — it can do load balancing, geo-routing, and failover, all before a single HTTP request is made.


7

Load Balancers

beginner 0-2 YOE system-design load-balancer scaling availability

A load balancer is a device (or software) that sits in front of our servers and distributes incoming traffic across multiple servers. Think of it like a host at a restaurant — they don’t let everyone pile into one table; they spread diners evenly across the floor.

Why We Need Load Balancers

Without a load balancer, all traffic goes to one server. That’s bad for two reasons:

  1. Single point of failure — If that server crashes, the whole system is down.
  2. Limited capacity — One server can only handle so much traffic.

A load balancer solves both problems. If one server dies, traffic goes to the others. If traffic grows, we add more servers.

Where Load Balancers Sit

Load Balancer in Action
Users
↓ ↓ ↓ ↓ ↓
Load Balancer
Server 1 Server 2 Server 3
Database

We can place load balancers at multiple points:

  • Between users and web servers (most common)
  • Between web servers and application servers
  • Between application servers and databases

L4 vs L7 Load Balancing

L4 (Transport Layer) — Routes based on IP address and port. Fast but dumb. It doesn’t look at the content of the request. Think of it like sorting mail by zip code.

L7 (Application Layer) — Routes based on the actual content: URL path, HTTP headers, cookies. Slower but smarter. We can send /api/* requests to API servers and /static/* to file servers.

Most modern load balancers (NGINX, AWS ALB) operate at L7.

Load Balancing Algorithms

AlgorithmHow It WorksBest For
Round RobinRequests go to servers in order: 1, 2, 3, 1, 2, 3…Servers with equal capacity
Weighted Round RobinSame but some servers get more traffic (proportional to weight)Servers with different specs
Least ConnectionsSend to the server with fewest active connectionsVarying request durations
IP HashHash the client IP to pick a server (same client always goes to same server)When we need sticky sessions
RandomPick a server randomlySurprisingly effective at scale

Round Robin is the simplest and works great when all servers are identical and requests take roughly the same time.

Least Connections is better when some requests take much longer than others (like file uploads vs simple GETs).

IP Hash ensures the same client always hits the same server. Useful when we have session data on the server (though we should prefer stateless servers + external session stores).

Health Checks

Load balancers periodically ping each server to check if it’s alive. If a server stops responding, the load balancer removes it from the rotation. When it comes back, it gets added again.

LB → GET /health → Server 1: 200 OK  ✓ (keep in rotation)
LB → GET /health → Server 2: timeout ✗ (remove from rotation)
LB → GET /health → Server 3: 200 OK  ✓ (keep in rotation)

Load Balancer Redundancy

Wait — if the load balancer is a single point of failure, what do we do? We use two load balancers in an active-passive setup:

  • Active LB handles all traffic
  • Passive LB monitors the active one
  • If active goes down, passive takes over (using a floating IP or DNS failover)

Some setups use active-active, where both LBs handle traffic simultaneously.

  • NGINX — Software LB, very popular, can also be a reverse proxy and web server
  • HAProxy — High-performance software LB, used by GitHub and Stack Overflow
  • AWS ALB/NLB — Managed cloud LBs (ALB = L7, NLB = L4)
  • Caddy — Modern LB with automatic HTTPS

In simple language, a load balancer is like a traffic cop for our servers. It spreads the work evenly, routes around failures, and lets us add more servers whenever we need more capacity. It’s one of the first things we add when scaling beyond a single server.


8

Caching

beginner 0-2 YOE system-design caching Redis performance CDN

Caching is storing a copy of data in a faster location so we don’t have to fetch it from the slower source every time. Think of it like keeping a sticky note on our desk instead of walking to the filing cabinet every time we need a phone number.

It’s probably the single most impactful thing we can do for performance.

Where to Cache

Caching Layers (fastest → slowest)
Layer 1: Browser Cache
  └── Static assets (JS, CSS, images). No network call at all.
Layer 2: CDN Cache
  └── Cached at edge servers closest to the user.
Layer 3: Application Cache (Redis / Memcached)
  └── Hot data in memory. Way faster than hitting the DB.
Layer 4: Database Cache
  └── Query cache, buffer pool. The DB's own internal caching.

The closer the cache is to the user, the faster the response. A browser cache hit is instant. A CDN hit avoids the trip to our data center. A Redis hit avoids the slow database query.

Cache Hit vs Cache Miss

  • Cache hit — The data is in the cache. We return it immediately. Fast.
  • Cache miss — The data is NOT in the cache. We fetch from the source, store it in the cache, then return it. Slow (but next time it’ll be a hit).

The hit ratio tells us how effective our cache is. If 95% of requests are cache hits, our cache is doing great. Below 80%, we should rethink our strategy.

Cache Eviction Policies

The cache has limited memory. When it’s full and new data comes in, we have to kick something out. The question is: what do we evict?

PolicyHow It WorksUse When
LRU (Least Recently Used)Evict the item not accessed for the longest timeGeneral purpose — most common choice
LFU (Least Frequently Used)Evict the item accessed the fewest timesHot items should stay (e.g., trending content)
FIFO (First In, First Out)Evict the oldest itemSimple, order-based access patterns
TTL (Time To Live)Items expire after a fixed timeData that goes stale (API responses, sessions)

LRU is the default choice in most system design interviews. It’s simple and works well for most access patterns.

Cache Invalidation Strategies

The hardest problem with caching: keeping the cache in sync with the database. If the database changes but the cache still has the old value, users see stale data.

Cache-Aside (Lazy Loading)

The most common pattern. The application manages the cache directly:

  1. Read: Check cache first → miss → read from DB → write to cache → return
  2. Write: Write to DB → delete from cache (next read will re-populate it)

Pros: Only caches what’s actually requested. Cache failure doesn’t break the system. Cons: First request after a miss is slow. Potential for stale data between DB write and cache delete.

Write-Through

Every write goes to the cache AND the database at the same time.

Pros: Cache is always up to date. No stale data. Cons: Higher write latency (two writes per operation). Cache may fill with data that’s never read.

Write-Behind (Write-Back)

Write to the cache first, then asynchronously write to the database later.

Pros: Super fast writes. Great for write-heavy workloads. Cons: Risk of data loss if the cache crashes before persisting to DB.

When NOT to Cache

Caching isn’t always the answer:

  • Frequently changing data — If data changes every second, the cache is constantly stale.
  • Low-traffic data — If it’s rarely accessed, the cache miss rate is high and we’re wasting memory.
  • Write-heavy workloads — More writes than reads means the cache is constantly being invalidated.
  • Data that must be perfectly consistent — Like account balances. Stale cache = real problems.
  • Redis — In-memory key-value store. Supports data structures (lists, sets, sorted sets). Most popular choice.
  • Memcached — Simpler than Redis, pure key-value. Slightly faster for simple caching.
  • Varnish — HTTP reverse proxy cache. Great for caching entire HTTP responses.

Cache in System Design Interviews

When an interviewer asks “how would you improve performance?”, caching is almost always part of the answer. Common cache use cases:

  • Cache database query results to reduce DB load
  • Cache user session data for fast authentication
  • Cache API responses from third-party services
  • Cache computed results (like feed generation or recommendations)

In simple language, caching trades memory for speed. We store a copy of frequently accessed data in a fast place so we don’t keep hammering the slow place. It’s the difference between a 2ms response (Redis) and a 200ms response (database query with joins).


9

Content Delivery Networks (CDNs)

beginner 2-4 YOE system-design CDN caching performance latency

A CDN (Content Delivery Network) is a geographically distributed network of servers that delivers content from the location closest to the user. Think of it like having copies of a book in libraries across the country instead of one central warehouse — people get the book faster because they go to the nearest library.

How a CDN Works

CDN Serving Content from Nearest Edge
Origin Server (US-East)
Edge: Mumbai Edge: London Edge: Tokyo
User in India User in UK User in Japan

Here’s the flow:

  1. User in India requests image.png from our site
  2. DNS routes the request to the nearest CDN edge server (Mumbai)
  3. If the edge has it cached → return immediately (cache hit)
  4. If not → edge fetches from origin server, caches it, returns to user (cache miss)
  5. Next user in India requesting the same file gets it from Mumbai instantly

The first request is slow (has to go to origin). Every subsequent request from that region is fast.

Push CDN vs Pull CDN

Pull CDN — The CDN fetches content from our origin server on demand (when someone requests it). Content is cached after the first request and expires based on TTL.

  • We don’t manage what’s on the CDN
  • Works great for high-traffic sites (content gets cached quickly)
  • Simpler to set up

Push CDN — We explicitly upload content to the CDN. We decide what’s cached and when it’s updated.

  • Full control over what’s on the CDN
  • Better for content that changes infrequently (firmware files, large media)
  • More work to manage

Most websites use Pull CDN. It’s simpler and handles most use cases well.

What to Put on a CDN

Great for CDN:

  • Images, videos, audio files
  • CSS, JavaScript bundles
  • Fonts
  • Static HTML pages
  • Large downloadable files (PDFs, installers)

Not great for CDN:

  • Dynamic API responses (changes per user/request)
  • Private/personalized content
  • Real-time data (stock prices, live scores)
  • Very rarely accessed content (wastes edge storage)

CDN Invalidation

When we update content on our origin server, the CDN still has the old version cached. We need to invalidate it:

  1. TTL expiration — Set a Time To Live. After it expires, the edge fetches fresh content. Simple but has a delay.
  2. Explicit invalidation — Tell the CDN to purge specific files. Instant but costs money per request on some providers.
  3. Versioned URLs — Use style.v2.css instead of style.css. The new URL is a cache miss, so fresh content is fetched. This is the most reliable approach.
CDNKnown For
CloudflareFree tier, DDoS protection, global network
AWS CloudFrontDeep AWS integration, Lambda@Edge
AkamaiOldest CDN, massive enterprise network
FastlyReal-time purging, edge compute
Google Cloud CDNTight GCP integration

CDN in System Design

In interviews, mention CDN whenever the system serves static content to a global audience. It’s a quick win that dramatically reduces latency and server load.

A typical architecture mention:

User → CDN (static assets) → Load Balancer → App Server → DB

   Serves images, JS, CSS
   without hitting our servers

The performance impact is massive. Instead of a user in Tokyo making a round trip to our US server (150ms+), they hit a Tokyo edge server (5ms). For media-heavy sites like Instagram or Netflix, CDNs handle the vast majority of traffic — our origin servers barely break a sweat.

In simple language, a CDN puts copies of our content closer to users around the world. Less distance = less latency = happier users. For any system that serves files or media globally, a CDN is a no-brainer.


10

Message Queues

intermediate 2-4 YOE system-design message-queue Kafka RabbitMQ async

A message queue is a component that sits between two services and holds messages until the receiver is ready to process them. Think of it like a mailbox — the sender drops a letter in, and the receiver picks it up when they’re available. The sender doesn’t have to wait for the receiver to be home.

This is the foundation of asynchronous processing — doing work later instead of right now.

Why We Need Message Queues

Without a queue, services talk to each other synchronously. Service A calls Service B and waits. If B is slow or down, A is stuck.

With a queue:

  • A drops a message in the queue and moves on immediately
  • B picks it up whenever it’s ready
  • If B crashes, the message stays in the queue — nothing is lost

This gives us decoupling, resilience, and scalability.

The Core Pattern

Producer → Queue → Consumer
Producer [ msg3 | msg2 | msg1 ] Consumer
Producer sends messages at its own pace
Consumer processes messages at its own pace
Queue holds messages in between
  • Producer — The service that creates and sends messages
  • Queue — The buffer that holds messages
  • Consumer — The service that reads and processes messages

Point-to-Point vs Pub/Sub

Point-to-Point (Queue) — Each message is consumed by exactly one consumer. Once processed, the message is removed. Like a task queue where each task is done once.

Pub/Sub (Topic) — Each message can be consumed by multiple subscribers. The message stays available for all subscribers. Like a broadcast — everyone who’s listening gets the message.

PatternDeliveryUse Case
Point-to-PointOne consumer per messageTask queues, job processing
Pub/SubAll subscribers get every messageNotifications, event streaming, analytics

When to Use Message Queues

Decoupling services — The order service doesn’t need to know about the email service. It just publishes “order placed” and moves on. The email service subscribes and sends the confirmation.

Handling traffic spikes — During a flash sale, we get 100x the normal orders. The queue absorbs the spike. Workers process orders at a steady rate.

Retry and error handling — If processing fails, the message goes back to the queue. It’ll be retried instead of lost. We can even have a dead letter queue (DLQ) for messages that fail repeatedly.

Heavy async work — Sending emails, generating reports, processing images, encoding videos — none of these need to happen during the user’s request. Drop a message in the queue and respond to the user immediately.

Kafka

  • Distributed event streaming platform
  • Extremely high throughput (millions of messages/sec)
  • Messages are persisted to disk and retained for days/weeks
  • Consumers can replay messages from any point in time
  • Great for: event sourcing, log aggregation, real-time analytics

RabbitMQ

  • Traditional message broker
  • Supports complex routing (exchanges, bindings)
  • Messages are removed after consumption
  • Lower throughput than Kafka but more flexible routing
  • Great for: task queues, RPC patterns, complex routing

Amazon SQS

  • Fully managed queue service from AWS
  • No infrastructure to manage
  • Two flavors: Standard (at-least-once, unordered) and FIFO (exactly-once, ordered)
  • Great for: AWS-native apps, simple queuing needs

Quick Comparison

FeatureKafkaRabbitMQSQS
ThroughputVery highMediumMedium
Message retentionDays/weeksUntil consumedUp to 14 days
OrderingPer partitionPer queueFIFO variant only
ReplayYesNoNo
Managed optionConfluent CloudCloudAMQPAWS native

Message Queues in System Design

In interviews, bring up message queues whenever we have:

  • Work that doesn’t need to happen immediately
  • Services that should be independent of each other
  • Traffic that’s bursty or unpredictable
  • Operations that might fail and need retries

A common pattern in system design interviews:

User uploads video → API Server → Queue → Video Processing Workers → Store in S3

                   Return "processing..." to user immediately

In simple language, a message queue lets us say “I’ll deal with this later” instead of doing everything right now. It makes our systems more resilient, more scalable, and better at handling the unpredictable real world.


11

Proxies and Reverse Proxies

beginner 0-2 YOE system-design proxy reverse-proxy Nginx Caddy

A proxy is a server that sits between a client and another server, acting as an intermediary. There are two types, and they sit on opposite sides of the connection. Let’s clear up the difference once and for all.

Forward Proxy vs Reverse Proxy

Forward Proxy vs Reverse Proxy
Forward Proxy (sits in front of clients)
Client A Forward Proxy Internet Server
Server doesn't know about the client. Proxy hides the client.
Reverse Proxy (sits in front of servers)
Client Internet Reverse Proxy Server A
Client doesn't know about the server. Proxy hides the server.

The only difference is which side they protect:

  • Forward proxy sits in front of clients. The server doesn’t know who the real client is. Example: a VPN or corporate proxy.
  • Reverse proxy sits in front of servers. The client doesn’t know which server actually handled the request. Example: Nginx routing to backend servers.

In system design, we almost always talk about reverse proxies.

What a Reverse Proxy Does

A reverse proxy is incredibly useful. It handles a bunch of cross-cutting concerns so our application servers don’t have to:

SSL/TLS Termination

The reverse proxy handles HTTPS encryption and decryption. Traffic between the proxy and our internal servers can be plain HTTP (since it’s within our network). This offloads CPU-intensive crypto work from our app servers.

Load Balancing

Distribute requests across multiple backend servers. Most reverse proxies have load balancing built in (round robin, least connections, etc.).

Compression

Compress responses (gzip, brotli) before sending them to clients. Reduces bandwidth and speeds up page loads.

Caching

Cache static content (or even dynamic responses) and serve them directly without hitting the backend. Huge performance boost.

Request Routing

Route different URL paths to different backend services:

  • /api/* → API servers
  • /static/* → File server
  • / → Frontend server

This is how we can run multiple services behind a single domain.

Rate Limiting & Security

Block malicious traffic, limit requests per IP, add security headers — all at the proxy level before requests reach our application.

ToolKnown For
NginxIndustry standard. Extremely fast and widely used. Powers ~30% of the web.
CaddyModern, automatic HTTPS (auto-renews certificates). Great DX.
HAProxyHigh-performance, used by high-traffic sites like GitHub.
TraefikDocker/Kubernetes native. Auto-discovers services.
ApacheOld school but still widely used. mod_proxy module.

Forward Proxy Use Cases

While less common in system design interviews, forward proxies are still important:

  • Corporate networks — Control and monitor employee internet access
  • Bypassing restrictions — VPNs are forward proxies that let us access geo-blocked content
  • Anonymity — Hide our IP address from websites we visit
  • Caching for clients — A school might cache frequently accessed educational sites to save bandwidth

Reverse Proxy in System Design

In every system design interview, there’s a reverse proxy somewhere. It’s usually sitting right behind the DNS/load balancer:

User → DNS → Reverse Proxy / LB → App Servers → Database

Sometimes the reverse proxy IS the load balancer (Nginx and Caddy can do both). In cloud setups, managed load balancers (AWS ALB) take over this role.

When discussing our architecture, we can mention the reverse proxy handles:

  • SSL termination (so internal traffic is faster)
  • Static file serving (so app servers focus on business logic)
  • Request routing (so we can run multiple services behind one domain)

Proxy vs Load Balancer vs API Gateway

These terms overlap a lot. Here’s the quick distinction:

  • Reverse Proxy — General purpose: SSL, caching, compression, routing
  • Load Balancer — Specifically distributes traffic across servers
  • API Gateway — Like a reverse proxy but for APIs: auth, rate limiting, request transformation, analytics

Many tools (like Nginx, Caddy, Kong) can act as all three. In system design, we don’t need to be pedantic about the labels — just explain what the component does.

In simple language, a reverse proxy is the bouncer at the club. It stands at the door, handles the crowd, checks IDs (SSL), and directs people to the right area inside. Our application servers just focus on doing their job without worrying about all that.


Database Deep Dive

12

SQL vs NoSQL

beginner 0-2 YOE sql nosql database acid base postgresql mongodb redis

This is one of the most common system design questions. “Should we use SQL or NoSQL?” The answer is always “it depends” — but let’s learn what it depends on.

SQL (Relational Databases)

SQL databases store data in tables with rows and columns. Every row has the same structure. Think of it like a spreadsheet with strict rules — every column has a type, and we define relationships between tables.

Examples: PostgreSQL, MySQL, SQLite, SQL Server, Oracle

Key traits:

  • Fixed schema — we define the structure upfront
  • Relationships via foreign keys (joins)
  • ACID transactions (more on this later)
  • Great for structured, predictable data

NoSQL (Non-Relational Databases)

NoSQL means “Not Only SQL.” These databases break free from the rigid table structure. There are actually four types, and they’re quite different from each other.

SQL vs NoSQL at a Glance
SQL (Relational)
Tables with rows & columns
Fixed schema
ACID transactions
Vertical scaling (bigger machine)
Great for: complex queries, joins
NoSQL (Non-Relational)
Flexible data models
Dynamic schema
BASE (eventual consistency)
Horizontal scaling (more machines)
Great for: scale, unstructured data
Four Types of NoSQL
Key-Value
Redis, DynamoDB
Document
MongoDB, CouchDB
Columnar
Cassandra, HBase
Graph
Neo4j, ArangoDB

Key-Value Stores (Redis, DynamoDB)

The simplest NoSQL type. It’s literally a giant hash map — a key maps to a value. Super fast for lookups, but we can’t query by value. Think of it like a locker room: we know the locker number (key), we get the contents (value).

Use case: Caching, sessions, rate limiting, leaderboards.

Document Stores (MongoDB, CouchDB)

Data is stored as JSON-like documents. Each document can have a different structure. Think of it like filing cabinets where each folder can hold different papers.

Use case: Product catalogs, content management, user profiles — anything with varied structure.

Columnar Stores (Cassandra, HBase)

Data is stored by columns instead of rows. This is great when we need to read specific columns across millions of rows. Think of it like reading one column of a spreadsheet without loading all other columns.

Use case: Analytics, time-series data, write-heavy workloads.

Graph Databases (Neo4j, ArangoDB)

Data is stored as nodes and edges (relationships). When the relationships between data are the most important thing, graph databases shine. Think of it like a social network map.

Use case: Social networks, recommendation engines, fraud detection.

ACID vs BASE

SQL databases follow ACID — strong guarantees. NoSQL often follows BASE — looser guarantees in exchange for performance and availability.

  • ACID: Atomicity, Consistency, Isolation, Durability — everything is correct and reliable
  • BASE: Basically Available, Soft-state, Eventually consistent — the system stays available, and data will be consistent eventually (not immediately)

In simple language, ACID says “every transaction is perfect or it doesn’t happen.” BASE says “the system keeps running and catches up later.”

Decision Framework

Pick SQL when:

  • Data is structured and relationships matter (e-commerce orders, banking)
  • We need complex queries with joins
  • Data integrity is critical (financial transactions)
  • Schema is mostly known upfront

Pick NoSQL when:

  • Schema changes frequently or data is unstructured
  • We need massive horizontal scaling (millions of reads/writes per second)
  • Data is naturally hierarchical or document-shaped (JSON APIs)
  • Availability matters more than immediate consistency

Most real systems use both. PostgreSQL for core business data, Redis for caching, maybe Elasticsearch for search. The question isn’t SQL vs NoSQL — it’s which tool fits which part of our system.


13

Database Indexing

intermediate 2-4 YOE database indexing b-tree performance sql

Imagine we have a book with 1,000 pages and we need to find every mention of “caching.” We could read every single page (slow), or we could check the index at the back of the book (fast). Database indexes work exactly the same way.

What Is an Index?

An index is a separate data structure that the database maintains alongside our table. It stores a sorted reference to rows based on specific columns, so the database can jump directly to the right rows instead of scanning the entire table.

Without an index, a query like SELECT * FROM users WHERE email = 'a@b.com' has to check every row in the table. This is called a full table scan. With an index on email, the database finds the row almost instantly.

How Indexes Work

B-Tree Index (The Default)

Most databases use B-trees (balanced trees) as the default index type. A B-tree keeps data sorted and allows searches, insertions, and deletions in O(log n) time.

B-Tree Index Lookup
Root: [M]
↓          ↓
[D, H] [R, W]
↓ ↓ ↓     ↓ ↓ ↓
A,B,C E,F,G I,J,K N,O,P S,T,U X,Y,Z
Finding "F" = 3 steps (Root → [D,H] → leaf) instead of scanning 26 entries

Think of it like a phone book. We don’t read every name — we jump to the right letter section, then narrow down. That’s a B-tree.

Hash Index

A hash index uses a hash function to map keys to locations. It gives us O(1) lookups for exact matches, but it can’t handle range queries (WHERE age > 25). Only useful for equality checks.

Types of Indexes

Single-Column Index

The most basic kind. We index one column.

CREATE INDEX idx_users_email ON users(email);

Now WHERE email = 'a@b.com' is fast.

Composite Index (Multi-Column)

We can index multiple columns together. The order matters a lot.

CREATE INDEX idx_orders_user_date ON orders(user_id, created_at);

This index helps queries that filter by user_id, or by user_id AND created_at. But it does not help queries that filter only by created_at. Think of it like a phone book sorted by last name, then first name — we can’t look up just by first name.

This is called the leftmost prefix rule.

Covering Index

A covering index includes all the columns a query needs, so the database doesn’t even have to look at the actual table. It gets everything from the index itself.

-- If we frequently run this query:
SELECT email, name FROM users WHERE email = 'a@b.com';

-- This covering index handles it entirely:
CREATE INDEX idx_users_email_name ON users(email) INCLUDE (name);

Unique Index

Enforces uniqueness on a column. A primary key automatically creates a unique index.

CREATE UNIQUE INDEX idx_users_email_unique ON users(email);

The Trade-Off

Here’s the thing nobody tells us about indexes: they aren’t free.

Indexes speed up reads but slow down writes. Every time we INSERT, UPDATE, or DELETE a row, the database also has to update every index on that table. More indexes = more work on writes.

Indexes also consume disk space. A heavily indexed table might have indexes that are larger than the table itself.

When NOT to Index

  • Small tables — if the table has a few hundred rows, a full scan is fine
  • Write-heavy tables — if we’re inserting thousands of rows per second, indexes add overhead
  • Low-cardinality columns — a boolean column (is_active) has only two values; an index barely helps
  • Columns we never query on — if we never filter or sort by a column, don’t index it

Practical Tips

-- See how the database plans to run our query
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'a@b.com';

The EXPLAIN output tells us if the database is using an index (Index Scan) or scanning the whole table (Seq Scan). If we see a sequential scan on a large table with a WHERE clause, we probably need an index.

In simple language, an index is a cheat sheet the database keeps on the side. It makes finding data fast, but we pay for it on every write. The trick is indexing the right columns — the ones we actually search and filter on.


14

Database Replication

intermediate 2-4 YOE database replication master-slave availability distributed-systems

If our database server dies, all our data is gone. If traffic spikes, one server can’t handle all the reads. Replication solves both problems by keeping copies of our data on multiple servers.

What Is Replication?

Replication means keeping the same data on multiple database servers (called replicas or nodes). When we write data to one server, that change gets copied to the others.

The goals are simple:

  • High availability — if one server crashes, others take over
  • Read scaling — spread read queries across multiple servers
  • Low latency — place replicas closer to users in different regions

Single-Leader Replication (Master-Slave)

This is the most common setup. One node is the leader (master) and the rest are followers (replicas/slaves).

  • All writes go to the leader
  • The leader sends changes to all followers
  • Reads can go to any follower (or the leader)
Single-Leader Replication
Client
→ writes → ← reads ←
Leader
→ replicates →
Follower 1 (reads)
Follower 2 (reads)
Follower 3 (reads)

This is what PostgreSQL, MySQL, and MongoDB use by default. It’s simple and works well when reads far outnumber writes (which is true for most applications).

The downside? The leader is a single point of failure for writes. If it goes down, we need to promote a follower to become the new leader (failover).

Multi-Leader Replication

Instead of one leader, we have multiple leaders that all accept writes. Each leader replicates its changes to all others.

Use case: Multi-region deployments. We put a leader in each data center so users write to the nearest one.

The big problem is write conflicts. If two leaders update the same row at the same time, which one wins? We need conflict resolution strategies:

  • Last write wins — the latest timestamp wins (simple but can lose data)
  • Merge values — combine both changes (complex but preserves data)
  • Let the application decide — store both versions and let the app resolve it

In simple language, multi-leader replication is like having two people editing the same Google Doc at the same time. It works, but we need a plan for when their edits conflict.

Leaderless Replication (Dynamo-Style)

No leader at all. Any node can accept reads and writes. The client sends writes to multiple nodes at once. This is the approach Amazon DynamoDB and Apache Cassandra use.

It uses a concept called quorum. If we have n replicas:

  • Write to w nodes
  • Read from r nodes
  • As long as w + r > n, we’re guaranteed to read the latest value

For example, with 3 replicas: write to 2, read from 2. At least one of the nodes we read from will have the latest data.

Synchronous vs Asynchronous Replication

Synchronous: The leader waits until the follower confirms it saved the data before telling the client “write successful.” Guarantees the follower is up to date, but makes writes slower.

Asynchronous: The leader tells the client “done” immediately and replicates in the background. Faster writes, but the follower might be slightly behind.

Most systems use semi-synchronous — one follower is synchronous (so at least two nodes always have the latest data), and the rest are asynchronous.

Replication Lag

With asynchronous replication, followers might be seconds (or even minutes) behind the leader. This causes some weird problems:

Read-after-write inconsistency: A user writes a comment, then immediately reads the page — but the read goes to a follower that hasn’t received the write yet. The user’s own comment is missing.

Fix: Route reads for “own data” to the leader, or track the latest write timestamp and don’t read from a follower that’s too far behind.

Monotonic read inconsistency: A user reads from Follower 1 (which is up to date), then reads from Follower 2 (which is behind). It looks like data disappeared.

Fix: Ensure each user always reads from the same replica (sticky sessions).

Causality violations: User A writes “How’s the weather?” then User B replies “Sunny!” — but a follower receives B’s message before A’s, so the reply appears before the question.

In simple language, replication lag is the gap between what the leader knows and what the followers know. Most of the time it’s milliseconds, but under heavy load it can grow and cause confusing bugs.

When to Use What

StrategyBest ForWatch Out For
Single-leaderMost apps, read-heavy workloadsLeader is a write bottleneck
Multi-leaderMulti-region deploymentsWrite conflicts are hard
LeaderlessHigh availability, no single point of failureQuorum math, stale reads

15

Database Sharding

intermediate 4-7 YOE database sharding partitioning scaling distributed-systems

There comes a point when a single database server just can’t handle it all. The data is too big to fit on one disk, or the queries are too many for one CPU. That’s when we need sharding.

What Is Sharding?

Sharding is splitting our data across multiple database servers. Each server holds a subset of the data (called a shard or partition). Together, all shards hold the complete dataset.

Think of it like splitting a phone book into volumes: A-F goes to Server 1, G-M goes to Server 2, N-Z goes to Server 3. Each server only deals with its portion.

Horizontal vs Vertical Partitioning

Horizontal sharding (row-based): We split rows across servers. All servers have the same columns, but different rows. This is what people usually mean when they say “sharding.”

Vertical partitioning (column-based): We split columns across servers. One server holds user_id, name, email and another holds user_id, profile_pic, bio. This is more like splitting a big table into smaller, focused tables.

Most of the time, we’re talking about horizontal sharding.

Horizontal Sharding — Users Table
Application Layer
shard_key = user_id % 3
Shard 0
user_id: 3, 6, 9...
~33% of users
Shard 1
user_id: 1, 4, 7...
~33% of users
Shard 2
user_id: 2, 5, 8...
~33% of users

Sharding Strategies

Hash-Based Sharding

We run the shard key through a hash function: shard = hash(user_id) % number_of_shards. This distributes data evenly across shards.

Pros: Even distribution, simple. Cons: Range queries become expensive (we have to query all shards). Adding a new shard means rehashing everything. This is why consistent hashing exists (next topic).

Range-Based Sharding

We assign ranges to each shard. Users with IDs 1-1M go to Shard 1, 1M-2M go to Shard 2, and so on. Or we could shard by date: January data on one shard, February on another.

Pros: Range queries are efficient (all data in a range lives on one shard). Cons: Can create hotspots. If most new users sign up this month, the “current month” shard gets hammered while older shards sit idle.

Directory-Based Sharding

We maintain a lookup table that maps each key to its shard. A separate service says “user 42 lives on Shard 3.”

Pros: Flexible — we can move data between shards without changing the sharding logic. Cons: The lookup service becomes a single point of failure and a bottleneck.

Problems with Sharding

Sharding introduces real complexity. Here’s what we have to deal with:

Cross-Shard Joins

If a user is on Shard 1 and their orders are on Shard 2, doing a JOIN across shards is a nightmare. We either have to query both shards and merge in the application, or denormalize our data so related data lives on the same shard.

Resharding

When we add or remove shards, we have to redistribute data. With hash-based sharding, hash(key) % 3 gives completely different results than hash(key) % 4. So adding one shard means moving most of our data. This is why consistent hashing was invented — it minimizes data movement.

Hotspots

Even with good distribution, some keys get way more traffic. A celebrity’s profile on a social network gets millions of reads while a regular user gets a few. That one shard handling the celebrity becomes a hotspot.

Mitigation: Add a random suffix to hot keys to spread them across shards, or handle hot keys specially in the application.

Referential Integrity

Foreign key constraints don’t work across shards. We lose the database’s help in enforcing relationships and have to handle it in application code.

When Do We Actually Need Sharding?

Sharding is a last resort, not a first step. Before sharding, we should try:

  1. Vertical scaling — get a bigger machine (more RAM, faster SSD)
  2. Read replicas — offload reads to follower databases
  3. Caching — put Redis in front of the database
  4. Query optimization — add indexes, rewrite slow queries

If all of that isn’t enough and our database is still struggling, then it’s time to shard.

In simple language, sharding is like splitting a library into multiple buildings. Each building has a portion of the books. It lets us serve way more readers, but finding a book now requires knowing which building it’s in. And moving books between buildings is a headache.


16

Consistent Hashing

intermediate 4-7 YOE consistent-hashing distributed-systems hash-ring scaling

We just learned about hash-based sharding: shard = hash(key) % N. The problem? When we add or remove a server (N changes), almost every key maps to a different server. We have to move nearly all the data. Consistent hashing fixes this.

The Problem with Regular Hashing

Say we have 3 servers and we use hash(key) % 3 to decide where data goes.

hash("user:1") % 3 = 1  → Server 1
hash("user:2") % 3 = 0  → Server 0
hash("user:3") % 3 = 2  → Server 2

Now we add a 4th server. Everything becomes hash(key) % 4:

hash("user:1") % 4 = 1  → Server 1 (same)
hash("user:2") % 4 = 2  → Server 2 (MOVED!)
hash("user:3") % 4 = 3  → Server 3 (MOVED!)

Most keys end up on different servers. If these are cache servers, we just invalidated almost our entire cache. If these are database shards, we need to migrate a huge amount of data. With a million keys, roughly (N-1)/N of them move. That’s brutal.

How Consistent Hashing Works

The core idea: instead of hash % N, we arrange everything on a ring (a circle from 0 to some max value).

  1. Hash each server onto the ring (using the server’s name or IP)
  2. Hash each key onto the same ring
  3. Walk clockwise from the key’s position — the first server we hit owns that key
Hash Ring
A B C k1 k2 k3 clockwise ↻
k1 → clockwise → B
k2 → clockwise → C
k3 → clockwise → A

The magic: when we add a new server D to the ring, only the keys between D and the next server counter-clockwise need to move to D. Everything else stays put. Instead of moving ~75% of keys (with modular hashing), we move roughly 1/N of keys. Same thing when a server is removed — only its keys redistribute to the next server clockwise.

Virtual Nodes

There’s a catch with basic consistent hashing. If servers aren’t evenly spaced on the ring, some servers end up with way more keys than others. And with just 3-4 servers, the distribution is often terrible.

The fix is virtual nodes (vnodes). Instead of placing each server once on the ring, we place it many times at different positions.

Virtual Nodes on the Ring
Without virtual nodes (uneven):
A (50%)
B (15%)
C (35%)
With virtual nodes (balanced):
A1
C1
B1
A2
B2
C2
Each server gets ~33% of keys with virtual nodes

Server A gets virtual nodes A-1, A-2, A-3, ... A-100 spread across the ring. More virtual nodes = more even distribution. In practice, systems use 100-200 virtual nodes per physical server.

Another cool benefit: if a server is more powerful, we give it more virtual nodes so it handles more keys. Simple and elegant.

Where Consistent Hashing Is Used

It’s everywhere in distributed systems:

  • Distributed caches (Memcached, Redis Cluster) — deciding which cache server holds which key
  • CDNs (Akamai) — routing content to the nearest/appropriate edge server
  • Load balancers — sticky sessions where the same user always hits the same backend
  • Distributed databases (Cassandra, DynamoDB) — partitioning data across nodes
  • Distributed file systems — deciding which node stores which chunk of a file

The Key Insight

With regular hashing, adding or removing a server reshuffles almost everything. With consistent hashing, only K/N keys move on average (where K is total keys and N is total servers). That’s the difference between a catastrophe and a minor adjustment.

In simple language, consistent hashing is like arranging servers in a circle and letting each key find its server by walking clockwise. When a server joins or leaves, only its immediate neighbors are affected. Everyone else doesn’t even notice.


17

ACID and Transactions

beginner 0-2 YOE acid transactions database isolation-levels consistency

A transaction is a group of database operations that should succeed or fail as a unit. Transfer $100 from Account A to Account B? That’s two operations (debit A, credit B) — and we never want only one of them to happen.

ACID is the set of guarantees that make transactions reliable. Let’s break each one down.

The ACID Properties

ACID Properties
A — Atomicity
All or nothing. No partial changes.
C — Consistency
Data stays valid. Rules are never broken.
I — Isolation
Transactions don't interfere with each other.
D — Durability
Once committed, it survives crashes.

Atomicity — All or Nothing

Think of it like a light switch: it’s either ON or OFF. There’s no half-ON. If any operation in a transaction fails, all operations are rolled back. The database looks like the transaction never started.

Real-world analogy: We’re transferring money. Debit Account A by $100, credit Account B by $100. If the credit fails, the debit is undone. We never lose $100 into thin air.

BEGIN;
  UPDATE accounts SET balance = balance - 100 WHERE id = 'A';
  UPDATE accounts SET balance = balance + 100 WHERE id = 'B';
COMMIT;
-- If anything fails between BEGIN and COMMIT, everything rolls back

Consistency — Rules Are Always Respected

After every transaction, the database must be in a valid state. All constraints, foreign keys, and rules we defined must hold.

Real-world analogy: If we have a rule that account balance can never go negative, the database will reject any transaction that violates it. We can’t transfer $500 from an account with $300.

The only difference from the other properties: consistency is partly the application’s responsibility. The database enforces constraints we define, but our application logic also needs to make sense.

Isolation — Transactions Don’t Peek at Each Other

Even when multiple transactions run at the same time, each one behaves as if it’s the only one running. One transaction’s uncommitted changes are invisible to other transactions.

Real-world analogy: Two people at different ATMs both checking the same account balance. Neither should see the other’s half-finished transaction.

This is the trickiest property. Full isolation is expensive, so databases offer different isolation levels (more on this below).

Durability — Committed Means Permanent

Once the database says “committed,” the data survives power outages, crashes, and panics. It’s written to non-volatile storage (disk), not just sitting in memory.

Real-world analogy: Once we get the receipt, the transaction is done. Even if the bank’s computers crash right after, our money is safe.

Databases achieve this by writing to a write-ahead log (WAL) before making changes. Even if the system crashes mid-write, the WAL can replay and recover.

Transaction Isolation Levels

Full isolation (Serializable) is safe but slow. So databases give us four levels to choose from, each trading safety for speed.

Read Uncommitted (Weakest)

A transaction can see another transaction’s uncommitted changes. This is almost never used because it allows dirty reads — we might read data that gets rolled back.

Read Committed

We can only see data that has been committed. No dirty reads. This is the default in PostgreSQL and most databases.

But we can still have a non-repeatable read: we read a row, another transaction modifies and commits it, and when we read it again within the same transaction, we get a different value.

Repeatable Read

Once we read a row, we’ll always see the same value for the rest of our transaction, even if another transaction changes it. No dirty reads, no non-repeatable reads.

But we can still get phantom reads: we run a query and get 10 rows. Another transaction inserts a new row that matches our query. If we run the same query again, we get 11 rows. The new row is a “phantom.”

Serializable (Strongest)

Transactions behave as if they ran one after another, in sequence. No dirty reads, no non-repeatable reads, no phantom reads. Complete safety.

The trade-off? It’s the slowest level. The database may need to lock rows, detect conflicts, or abort and retry transactions.

Read Anomalies Cheat Sheet

AnomalyWhat happensPrevented by
Dirty readRead uncommitted data that gets rolled backRead Committed+
Non-repeatable readSame row returns different values in same transactionRepeatable Read+
Phantom readNew rows appear in a repeated querySerializable

When ACID Matters vs When We Can Relax

ACID is critical for:

  • Financial systems (bank transfers, payments)
  • Inventory management (don’t oversell)
  • Booking systems (don’t double-book a seat)
  • Anything where wrong data = real money lost

We can relax for:

  • Social media feeds (seeing a post 2 seconds late is fine)
  • Analytics and reporting (slightly stale data is acceptable)
  • Caching layers (eventual consistency is OK)
  • Logging and metrics (losing one log line isn’t the end of the world)

This is the ACID vs BASE trade-off we mentioned in the SQL vs NoSQL topic. Many NoSQL databases choose availability and partition tolerance (BASE) over strict ACID, because for their use cases, it’s the right call.

In simple language, ACID is a promise from the database: “Your transaction will either happen completely and correctly, or not at all. And once it’s done, it’s done forever.” The isolation levels let us tune how strict that promise is, depending on how much speed we need.


Scalability Patterns

18

Horizontal vs Vertical Scaling

beginner 0-2 YOE scaling horizontal-scaling vertical-scaling scalability

Scaling is what we do when our system can’t handle the load anymore. Users are growing, requests are piling up, and the server starts sweating. We have exactly two options: make the machine bigger, or add more machines.

What Is Scaling?

Scaling means increasing our system’s capacity to handle more traffic, more data, or more users. Every system hits a ceiling at some point. The question is — how do we raise that ceiling?

Vertical Scaling (Scale Up)

Vertical scaling means upgrading our existing machine. More CPU, more RAM, bigger disk. We take our one server and make it beefier.

Think of it like replacing a small car with a truck. Same number of vehicles, just a more powerful one.

Pros:

  • Dead simple — no code changes needed
  • No distributed system complexity
  • Data consistency is easy (one machine = one source of truth)
  • Lower latency between components (everything is local)

Cons:

  • There’s a hard ceiling — even the biggest machine on AWS has limits
  • Single point of failure — if that one machine dies, everything dies
  • Expensive — high-end hardware gets disproportionately pricier
  • Downtime during upgrades (usually need to restart)

Horizontal Scaling (Scale Out)

Horizontal scaling means adding more machines to share the load. Instead of one beefy server, we run 10 smaller ones behind a load balancer.

Think of it like adding more cars to a delivery fleet instead of buying one mega-truck.

Pros:

  • No theoretical limit — just keep adding machines
  • Better fault tolerance — one machine dies, others keep running
  • Cost-effective — commodity hardware is cheap
  • Can scale on demand (add machines during peak, remove after)

Cons:

  • Distributed system complexity (network failures, data consistency)
  • Need a load balancer to distribute traffic
  • Session management gets tricky (which server has the user’s session?)
  • Data synchronization across machines is hard

Visual Comparison

Vertical vs Horizontal Scaling
Vertical (Scale Up)
Before:
Server
4 CPU / 8 GB RAM
After:
BIG Server
64 CPU / 256 GB RAM
Horizontal (Scale Out)
Before:
Server
4 CPU / 8 GB RAM
After:
S1
4C/8G
S2
4C/8G
S3
4C/8G

When to Use What?

ScenarioGo With
Small app, few usersVertical — keep it simple
Database serverOften vertical first (consistency matters)
Stateless web serversHorizontal — easy to add more
Sudden traffic spikesHorizontal — auto-scale with cloud
Need 99.99% uptimeHorizontal — redundancy is built-in

Why Most Large Systems Go Horizontal

Here’s the reality: vertical scaling buys us time, but horizontal scaling is what the big players use.

Netflix, Google, Amazon — they all run thousands of small machines, not one supercomputer. The reasons:

  1. No single point of failure — a server dying is expected, not catastrophic
  2. Linear cost scaling — 10 small machines cost less than 1 giant one
  3. Geographic distribution — we can place machines closer to users worldwide
  4. Cloud-native — modern cloud platforms are built for horizontal scaling

Real-World Examples

  • Instagram started on a single server. As they grew, they moved to horizontally scaled web servers + vertically scaled database servers (before eventually sharding the DB too).
  • Databases often scale vertically first because distributing data is harder than distributing stateless logic.
  • Kubernetes is essentially a tool for managing horizontal scaling — run more pods when traffic increases.

Key Takeaway

In simple language, vertical scaling is buying a bigger box, horizontal scaling is buying more boxes. Start vertical for simplicity, but design our code to be stateless so we can go horizontal when the time comes. Most production systems end up using a mix of both.


19

Microservices vs Monolith

intermediate 2-4 YOE microservices monolith architecture scalability

When we build a backend, we have a fundamental choice: put everything in one application (monolith) or split it into many small, independent services (microservices). Both are valid. The right choice depends on where we are, not where we want to be.

What Is a Monolith?

A monolith is a single application where all the code lives together. User auth, payments, notifications, search — everything is in one codebase, one deployment, one process.

Think of it like a restaurant where one chef does everything: takes orders, cooks, plates, and serves. Simple when it’s a small restaurant.

Advantages:

  • Easy to develop and debug — everything is in one place
  • Simple deployment — one build, one deploy
  • No network calls between components — just function calls
  • Easy to test end-to-end
  • Great for small teams (2-10 devs)

Disadvantages:

  • Gets messy as the codebase grows (the “big ball of mud”)
  • One bug can bring down the entire system
  • Scaling means scaling everything, even the parts that don’t need it
  • Slow CI/CD — small change requires full rebuild and redeploy
  • Hard for multiple teams to work on without stepping on each other

What Are Microservices?

Microservices split the application into small, independent services. Each service owns one piece of functionality, has its own database, and can be deployed independently.

Think of it like a food court — each stall specializes in one thing. The pizza stall doesn’t need to know how the sushi stall works.

Advantages:

  • Independent deployment — update one service without touching others
  • Scale individual services based on demand
  • Teams can own services end-to-end
  • Tech diversity — each service can use the best tool for its job
  • Fault isolation — one service crashing doesn’t kill everything

Disadvantages:

  • Distributed system complexity (network failures, latency)
  • Data consistency across services is genuinely hard
  • Debugging a request that spans 5 services is painful
  • More infrastructure to manage (service discovery, monitoring, logging)
  • Operational overhead — we need DevOps maturity

The Architecture Difference

Monolith vs Microservices
Monolith
Auth Module
Orders Module
Payments Module
Notifications
Single DB  |  Single Deploy
Microservices
Auth
own DB
Orders
own DB
Payments
own DB
Notifs
own DB
Independent DBs  |  Independent Deploys

Communication Between Services

When services are split, they need to talk to each other. Two main approaches:

Synchronous (HTTP/gRPC)

Service A calls Service B and waits for a response. Like a phone call.

Order Service --HTTP POST--> Payment Service
               <--200 OK---
  • Simple to understand and implement
  • But creates tight coupling — if Payment Service is down, Order Service is stuck
  • Cascading failures are a real risk

Asynchronous (Message Queues)

Service A puts a message on a queue and moves on. Service B picks it up whenever it’s ready. Like sending a text message.

Order Service --publish--> [Message Queue] --consume--> Payment Service
  • Services are decoupled — they don’t need to be online at the same time
  • Better fault tolerance — messages wait in the queue
  • But adds complexity (message ordering, duplicate handling, eventual consistency)

Most production systems use a mix of both. Synchronous for things that need an immediate response (login, checkout), asynchronous for things that can happen later (send email, generate report).

Service Discovery

In a monolith, calling another module is just a function call. In microservices, we need to know where the other service lives. That’s service discovery.

Two approaches:

  • Client-side discovery — the caller asks a registry (like Consul or Eureka) for the service address
  • Server-side discovery — a load balancer or API gateway handles routing (Kubernetes does this with its built-in DNS)

Kubernetes makes this almost free. Each service gets a DNS name like payment-service.default.svc.cluster.local, and K8s handles the rest.

The “Monolith First” Approach

Martin Fowler’s advice (and most experienced engineers agree): start with a monolith.

Why? Because:

  1. We don’t know our domain boundaries well enough at the start
  2. Splitting too early creates the wrong service boundaries
  3. Monoliths are faster to iterate on when we’re finding product-market fit
  4. We can always split later once we understand the pain points

The pattern looks like this:

  1. Start: Monolith with clean module boundaries
  2. Grow: Identify which modules cause the most scaling/deployment pain
  3. Extract: Pull those modules out into separate services one at a time
  4. Repeat: Keep extracting as needed

Amazon, Netflix, and Uber all started as monoliths before moving to microservices.

When to Go Microservices

Microservices make sense when:

  • Multiple teams need to deploy independently
  • Different parts of the system have wildly different scaling needs
  • We need fault isolation (one feature crashing can’t kill the whole app)
  • The codebase is so large that build times are painful
  • We need technology diversity (ML in Python, API in Go, etc.)

Key Takeaway

In simple language, a monolith is one big app, microservices are many small apps that talk to each other. Monoliths aren’t bad — they’re the right starting point. Microservices aren’t magic — they trade code complexity for operational complexity. Pick based on team size, scale needs, and how well we understand our domain.


20

API Gateway

intermediate 4-7 YOE api-gateway microservices routing rate-limiting

When we move from a monolith to microservices, our clients suddenly need to know about 10 different services with 10 different URLs. That’s a mess. An API Gateway solves this by giving clients one single door to walk through. Behind that door, it routes requests to the right service.

What Is an API Gateway?

An API Gateway is a server that sits between clients and our backend services. Every request goes through it first. It’s the receptionist of our microservices architecture — it knows where everything is and handles common tasks so individual services don’t have to.

What Does It Do?

An API Gateway handles a bunch of cross-cutting concerns:

  • Request Routing — Routes /api/users to the User Service, /api/orders to the Order Service
  • Authentication & Authorization — Validates JWT tokens, API keys, or OAuth before the request even reaches a service
  • Rate Limiting — Stops a single client from hammering our services (e.g., 100 requests/minute)
  • Load Balancing — Distributes requests across multiple instances of a service
  • Request/Response Transformation — Converts between protocols (REST to gRPC, XML to JSON)
  • Response Aggregation — Combines responses from multiple services into one response for the client
  • Caching — Caches frequent responses to reduce load on services
  • Logging & Monitoring — Centralized request logging, metrics, and tracing

Where It Sits in the Architecture

API Gateway in Microservices
Web App
Mobile App
3rd Party
↓     ↓     ↓
API Gateway
Auth | Routing | Rate Limiting | Logging
↓         ↓         ↓
User
Service
Order
Service
Payment
Service

Without a gateway, every client would need to know every service URL, handle auth separately, and manage its own retry logic. That’s a nightmare, especially for mobile apps.

API Gateway vs Load Balancer

This comes up a lot in interviews. They’re related but different:

FeatureLoad BalancerAPI Gateway
Primary jobDistribute traffic across instancesRoute requests to different services
LayerL4 (TCP) or L7 (HTTP)L7 (Application layer)
AuthNoYes
Rate limitingBasic (by IP)Advanced (by user, API key, plan)
Request transformationNoYes
Response aggregationNoYes
ScopeOne service, multiple instancesMultiple services

In simple language, a load balancer says “here are 5 copies of the same thing, pick one.” An API gateway says “what are we looking for? Let me take us to the right place.”

In practice, we often use both. The API gateway routes to the right service, and a load balancer in front of each service distributes traffic across its instances.

Response Aggregation — The Big Win

One of the most valuable things an API gateway does is aggregation. Say a mobile app’s home screen needs data from 3 services:

Without a gateway: the app makes 3 separate HTTP calls (slow on mobile, battery drain).

With a gateway: the app makes 1 call, the gateway fans out to 3 services internally, combines the results, and returns a single response. Way faster for the client.

Client → GET /api/home

Gateway internally:
  → GET /users/123      → User Service
  → GET /orders/recent  → Order Service
  → GET /recommendations → Rec Service

Gateway combines all three and returns one response.
  • Kong — Open-source, plugin-based, built on Nginx. Very popular.
  • AWS API Gateway — Fully managed, integrates with Lambda and other AWS services. Pay per request.
  • Nginx — Can act as a basic API gateway with reverse proxy config. Lightweight and fast.
  • Traefik — Cloud-native, auto-discovers services in Docker/Kubernetes.
  • Express Gateway — Node.js-based, good for teams already in the JS ecosystem.

For Kubernetes, an Ingress Controller (like Nginx Ingress or Traefik) often serves as the API gateway.

Potential Downsides

The API gateway is powerful, but it’s not free:

  • Single point of failure — If the gateway goes down, everything goes down. We need it highly available (multiple instances + health checks).
  • Added latency — Every request goes through an extra hop. Usually negligible, but it adds up.
  • Can become a bottleneck — All traffic funnels through it. We need to scale the gateway itself.
  • Complexity — More infrastructure to maintain, configure, and monitor.

Key Takeaway

An API Gateway is the front door to our microservices. It handles the boring-but-critical stuff (auth, routing, rate limiting, logging) so our services can focus on business logic. In a monolith we don’t need one. In a microservices architecture, we almost always do.


21

Denormalization and Read-Write Separation

intermediate 4-7 YOE denormalization CQRS read-replicas database scalability

In a normalized database, we avoid duplicate data. That’s great for writes and data integrity. But when our app grows and reads start dominating (most apps are 90% reads, 10% writes), all those JOINs across perfectly normalized tables start crushing performance. That’s when denormalization and read-write separation come in.

What Is Denormalization?

Denormalization is intentionally adding redundant data to our database to speed up reads. We’re trading storage space and write complexity for faster queries.

In a normalized database, we’d store a user’s name in the users table only. If we need it on a post, we JOIN. In a denormalized database, we store the user’s name directly on the post too.

Normalization vs Denormalization

AspectNormalizedDenormalized
Data duplicationNoneYes, by design
Read performanceSlower (JOINs)Faster (fewer/no JOINs)
Write performanceFaster (update one place)Slower (update multiple places)
Data consistencyEasy (single source of truth)Harder (must sync copies)
StorageLessMore
Best forWrite-heavy, small scaleRead-heavy, large scale

Example: Normalizing a User Feed

Say we’re building a social feed. Normalized approach:

-- Normalized: 3 tables, need JOINs to display a post
SELECT p.content, p.created_at, u.name, u.avatar_url,
       COUNT(l.id) as like_count
FROM posts p
JOIN users u ON p.user_id = u.id
LEFT JOIN likes l ON l.post_id = p.id
WHERE p.id = 123
GROUP BY p.id, u.id;

This works fine for 1,000 users. At 10 million users with millions of posts, this JOIN is painful. Denormalized approach:

-- Denormalized: everything we need is right on the posts table
SELECT content, created_at, author_name, author_avatar, like_count
FROM posts
WHERE id = 123;

One table, zero JOINs, blazing fast. The only difference is we duplicated author_name, author_avatar, and pre-computed like_count onto the posts table.

The trade-off? When a user changes their name, we need to update it on all their posts too. That’s the price of denormalization.

When to Denormalize

Denormalization makes sense when:

  • Read queries are way more frequent than writes
  • JOINs are becoming the bottleneck (check query plans)
  • We need very low latency reads (real-time feeds, dashboards)
  • The duplicated data changes infrequently (user names don’t change every minute)

Don’t denormalize when:

  • Our data changes frequently and consistency is critical (bank accounts)
  • We’re still small — premature optimization is the root of all evil
  • We haven’t profiled our queries to confirm JOINs are the actual bottleneck

Read-Write Separation (CQRS Lite)

The idea is simple: use separate database instances for reads and writes.

The primary (master) database handles all writes. One or more read replicas handle all reads. The primary replicates data to the replicas asynchronously.

Writes → Primary DB ──replication──→ Read Replica 1
                                  ──→ Read Replica 2
                                  ──→ Read Replica 3
Reads  → Load Balancer → [Replica 1, Replica 2, Replica 3]

This works because most apps are read-heavy. If 90% of our queries are reads, we just offload them to replicas and our primary DB breathes again.

Setting It Up

Most managed databases make this easy:

  • AWS RDS — Click “Create read replica” and we’re done
  • PostgreSQL — Streaming replication to standby servers
  • MySQL — Built-in master-slave replication

Our application code needs to know which database to talk to:

# Pseudocode
def get_user(user_id):
    db = read_replica()  # reads go to replica
    return db.query("SELECT * FROM users WHERE id = %s", user_id)

def update_user(user_id, data):
    db = primary()  # writes go to primary
    db.execute("UPDATE users SET name = %s WHERE id = %s", data.name, user_id)

The Replication Lag Problem

Here’s the catch: replication isn’t instant. There’s a small delay (usually milliseconds, sometimes seconds) between a write hitting the primary and showing up on replicas.

This means: a user updates their profile, refreshes the page, and sees the old data because the read hit a replica that hasn’t caught up yet.

Solutions:

  • Read-your-own-writes — After a write, route that user’s reads to the primary for a few seconds
  • Sticky sessions — Route a user to the same replica consistently
  • Accept it — For many use cases (feeds, dashboards), slight staleness is fine

Full CQRS (The Advanced Version)

CQRS (Command Query Responsibility Segregation) takes read-write separation further. Instead of just separate DB instances, we use entirely different data models for reads and writes.

  • Write model — Normalized, optimized for data integrity
  • Read model — Denormalized, optimized for query patterns (maybe even a different database like Elasticsearch)

Events sync data from the write model to the read model. This is powerful but complex — we only need full CQRS for systems with very different read and write patterns (think: analytics dashboards reading from a normalized transactional DB).

For most systems, simple read replicas are more than enough.

Key Takeaway

In simple language, denormalization is copying data so we read faster but write slower. Read-write separation is splitting our database so reads don’t compete with writes. Both are tools for scaling read-heavy systems. Start normalized, measure our bottlenecks, then denormalize strategically where it hurts most.


22

Blob Storage and Object Storage

intermediate 2-4 YOE object-storage blob-storage S3 CDN storage

Every app eventually needs to store files — profile pictures, uploaded documents, video content, log exports. We can’t just throw these into our database. That’s where object storage comes in. It’s purpose-built for storing massive amounts of unstructured data cheaply and reliably.

What Is Object Storage?

Object storage treats every file as a standalone object with three parts:

  1. Data — the actual file bytes (an image, a PDF, a video)
  2. Metadata — info about the file (content type, upload date, custom tags)
  3. Unique key — a flat identifier like users/123/avatar.jpg

There’s no folder hierarchy like a regular filesystem. The “folders” we see in S3 are just key prefixes — it’s all a flat namespace under the hood.

Three Types of Storage Compared

Storage Types
Block Storage
Raw disk volumes
Attached to one server
Low latency
Like a hard drive
AWS EBS, Azure Disk
File Storage
Shared filesystem
Multiple servers access
Hierarchical (folders)
Like a network drive
AWS EFS, NFS, GCP Filestore
Object Storage
Flat key-value blobs
Access via HTTP API
Virtually unlimited
Like a giant locker
AWS S3, GCS, Azure Blob

The key insight: block storage is for OS and databases (needs to be fast, attached to one machine). File storage is for shared access between servers. Object storage is for everything else — and that “everything else” is usually the bulk of our data.

When to Use Object Storage

Pretty much anytime we deal with files:

  • User uploads — profile pictures, documents, attachments
  • Media — images, videos, audio files
  • Static assets — CSS, JS, fonts for a website
  • Backups — database dumps, log archives
  • Data lakes — raw data for analytics pipelines
  • ML artifacts — model files, training datasets

The rule of thumb: if it’s a file and we don’t need to query inside it, object storage is the answer.

  • Amazon S3 — The OG. “S3” has basically become a generic term for object storage. 11 nines of durability (99.999999999%). Pretty wild.
  • Google Cloud Storage (GCS) — Google’s version. Very similar API. Tight integration with BigQuery and other GCP services.
  • Azure Blob Storage — Microsoft’s offering. Three tiers: Hot, Cool, and Archive based on access frequency.
  • MinIO — Open-source, S3-compatible. Great for self-hosting or local development.
  • Cloudflare R2 — No egress fees (a big deal — S3 egress costs add up fast).

Pre-Signed URLs — The Smart Pattern

Here’s a common mistake: uploading files through our backend server. The file goes Client —> Our Server —> S3. Our server becomes a bottleneck and wastes bandwidth.

The better approach is pre-signed URLs. Our server generates a temporary, signed URL that lets the client upload directly to S3.

1. Client asks our server: "I want to upload a file"
2. Server generates a pre-signed URL (valid for 15 minutes)
3. Client uploads directly to S3 using that URL
4. Client tells our server: "Upload done, here's the key"
5. Server saves the file reference in the database

Same pattern works for downloads. Instead of streaming the file through our server, we generate a pre-signed download URL and redirect the client to it.

# Generating a pre-signed upload URL (Python + boto3)
import boto3

s3 = boto3.client('s3')
url = s3.generate_presigned_url(
    'put_object',
    Params={'Bucket': 'my-bucket', 'Key': 'uploads/photo.jpg'},
    ExpiresIn=900  # 15 minutes
)
# Give this URL to the client — they can PUT directly to S3

Benefits:

  • Our server doesn’t touch the file data — no bandwidth or CPU wasted
  • Scales better — S3 handles the heavy lifting
  • Secure — URL expires, and we control who can generate them

CDN Integration

Object storage and CDNs are best friends. The pattern:

  1. Store files in S3 (or any object storage)
  2. Put a CDN (CloudFront, Cloudflare) in front of it
  3. Users download from the CDN edge server closest to them instead of hitting S3 directly
User in Tokyo → CloudFront Edge (Tokyo) → S3 (us-east-1)
                ↑ cached here after first request

This gives us:

  • Lower latency — files served from the nearest edge location
  • Lower cost — CDN egress is often cheaper than S3 egress, and caching reduces S3 requests
  • Less load on S3 — the CDN absorbs most of the traffic

For static websites, this combo (S3 + CloudFront) is the go-to architecture. Fast, cheap, and almost infinitely scalable.

Storage Classes and Costs

Most providers offer tiered storage for different access patterns:

  • Standard/Hot — Frequently accessed data. Highest storage cost, lowest retrieval cost.
  • Infrequent Access — Data accessed less than once a month. Cheaper storage, small retrieval fee.
  • Archive/Cold — Data rarely accessed (compliance, old backups). Very cheap storage, expensive and slow retrieval (hours).

Setting up lifecycle policies to automatically move old data to cheaper tiers is one of the easiest cost optimizations we can make.

Key Takeaway

In simple language, object storage is where we put files that our application needs but our database shouldn’t hold. Use pre-signed URLs so files go directly between the client and storage without touching our server. Pair it with a CDN for fast delivery worldwide. It’s one of the simplest and most impactful architectural decisions we’ll make.


Reliability & Consistency

23

CAP Theorem

intermediate 2-4 YOE CAP distributed-systems consistency availability partition-tolerance

The CAP theorem is one of the most important ideas in distributed systems. It says that when we build a system spread across multiple machines, we can only guarantee two out of three properties at the same time. Understanding this helps us make smart trade-offs when picking databases and designing architectures.

The Three Properties

Consistency (C) — Every read gets the most recent write. If we write “balance = $500,” every node in the system immediately returns $500. No stale data, ever.

Availability (A) — Every request gets a response (success or failure), even if some nodes are down. The system never says “sorry, I can’t respond right now.”

Partition Tolerance (P) — The system keeps working even when network communication between nodes breaks. Messages get dropped or delayed, and the system handles it.

Why We Can Only Pick Two

Think of it like this. We have two database nodes, Node A and Node B, that are supposed to stay in sync. Suddenly the network between them breaks (a partition).

Now a write comes in to Node A: “update balance to $500.”

We have two choices:

  1. Stay consistent (CP) — Node B rejects reads because it can’t confirm it has the latest data. We sacrifice availability.
  2. Stay available (AP) — Node B serves reads with its old data. We sacrifice consistency.

We can’t have both. If the network is broken, we either serve stale data (lose C) or refuse to respond (lose A). There’s no trick around this.

Partition Tolerance Is Not Optional

Here’s the thing most people miss: we always have to pick P. Network partitions happen in the real world. Cables get cut. Switches fail. Cloud regions have outages. If our system can’t handle a network partition, it’s not a real distributed system.

So the real choice is between CP and AP. It’s never “CA” in a distributed system.

CP vs AP Systems

CAP Triangle
Consistency (C)
╱         ╲
CP            CA
╱                 ╲
Availability (A) ──── Partition Tolerance (P)
AP
CP: MongoDB, HBase, Redis Cluster, Zookeeper
AP: Cassandra, DynamoDB, CouchDB, Riak
CA: Single-node PostgreSQL, MySQL (not distributed)

CP Systems (Consistency + Partition Tolerance)

These systems prioritize correctness. During a partition, they’d rather refuse some requests than serve stale data.

  • MongoDB — In a replica set, if the primary goes down, writes are refused until a new primary is elected
  • HBase — Strong consistency via a single region server per row
  • Redis (single node) — All reads/writes go to one node, so it’s always consistent
  • Zookeeper — Used for distributed coordination where correctness is everything

Best for: Banking, inventory systems, anything where serving wrong data is worse than being temporarily unavailable.

AP Systems (Availability + Partition Tolerance)

These systems prioritize uptime. During a partition, they keep serving requests even if the data might be slightly stale.

  • Cassandra — Writes succeed on any available node, syncs later
  • DynamoDB — Designed for “always available” at Amazon scale
  • CouchDB — Multi-master replication, resolves conflicts later

Best for: Social media feeds, shopping carts, analytics — where showing slightly old data is fine but downtime is not.

How to Decide: CP or AP?

Ask ourselves these questions:

  1. Is stale data dangerous? If a user sees an old bank balance and overdrafts, that’s bad → CP
  2. Is downtime unacceptable? If our e-commerce site goes down during Black Friday, we lose millions → AP
  3. Can we fix inconsistency later? If we can reconcile data after the partition heals → AP is fine

Most real-world systems aren’t purely CP or AP. They’re tunable. Cassandra, for example, lets us choose consistency level per query. We can read with QUORUM for important data (more consistent) and ONE for less critical data (more available).

Common Misconceptions

“CAP means we pick exactly two.” Not really. In normal operation (no partition), we get all three. CAP only forces a choice during a partition event.

“CA systems exist in distributed setups.” Nope. If we’re distributed, partitions will happen. CA only applies to single-node databases.

“CP means zero availability.” No. CP systems are available most of the time. They only sacrifice availability during actual partition events, which are (hopefully) rare.

Key Takeaway

In simple language, CAP theorem tells us that in a distributed system, when the network breaks, we have to choose: do we serve possibly-stale data (AP), or do we refuse requests until we’re sure the data is correct (CP)? Neither choice is wrong — it depends on what our system is doing. Banks pick consistency. Social media picks availability. The key is knowing which trade-off we’re making and why.


24

Consistency Models

intermediate 4-7 YOE consistency eventual-consistency strong-consistency distributed-systems

When we have data spread across multiple servers, the big question is: how “fresh” does the data need to be when someone reads it? The answer depends on the consistency model we choose. Different models give different guarantees, and there’s always a trade-off between correctness and performance.

The Consistency Spectrum

Consistency Spectrum
Strongest → weaker but faster → Weakest
Linearizable Sequential Causal Read-Your-Writes Eventual
← Higher latency, more coordination Lower latency, more available →

The further left we go, the more “correct” our reads are — but the slower and harder to scale. The further right, the faster and more available — but we might read stale data.

Strong Consistency (Linearizability)

Every read returns the most recent write. Period. If User A writes “balance = $500” and User B reads right after, User B sees $500. There’s no window where stale data is visible.

Think of it like a single database with one copy of the data. Even if there are multiple replicas behind the scenes, the system behaves as if there’s only one copy.

How it works: The system coordinates between all replicas before confirming a write. This takes time — every write waits for a majority of nodes to agree.

Examples:

  • Google Spanner (uses synchronized clocks across data centers)
  • CockroachDB
  • Single-node PostgreSQL (trivially strong — there’s only one copy)

Best for: Bank accounts, stock trading, inventory counts — anywhere wrong data causes real damage.

Trade-off: Higher latency on writes. More coordination overhead. Less availability during partitions.

Eventual Consistency

If we stop writing, eventually all replicas will converge to the same value. But right after a write, different replicas might return different values. There’s no guarantee on when they’ll catch up — could be milliseconds, could be seconds.

Think of it like updating a Google Doc. If two people edit at the same time, each person sees their own version briefly. Eventually it all syncs up.

Examples:

  • Amazon DynamoDB (default mode)
  • Cassandra (with consistency level ONE)
  • DNS propagation — we update a DNS record and it takes hours to propagate everywhere

Best for: Social media likes, view counts, recommendation feeds — where showing a count of 4,823 instead of 4,825 for a few seconds is perfectly fine.

Trade-off: Fast writes and high availability, but we might read stale data.

Causal Consistency

This one is clever. It preserves cause-and-effect ordering. If event A caused event B, everyone sees A before B. But unrelated events can appear in any order.

Example: I post a message “Hey, who wants pizza?” (event A), then someone replies “I do!” (event B). Causal consistency guarantees everyone sees my question before the reply. But two unrelated posts from different users can appear in any order.

Examples:

  • MongoDB (with causal sessions)
  • Some configurations of Cassandra

Best for: Comment threads, chat applications — where ordering within a conversation matters but global ordering doesn’t.

Read-Your-Writes Consistency

After we write something, we are guaranteed to see our own write. Other users might see stale data temporarily, but the person who made the change always sees it reflected.

This solves the classic problem: a user updates their profile name, refreshes the page, and sees the old name. Infuriating. With read-your-writes, that doesn’t happen.

How to implement it:

  • Route the user’s reads to the same replica that processed their write
  • After a write, read from the primary for a few seconds
  • Include a timestamp token — only accept reads from replicas that are at least that fresh

Best for: User profile updates, settings changes — anywhere the “I just changed this, why doesn’t it show?” experience is bad.

Monotonic Reads

Once we’ve read a value, we never see an older value on subsequent reads. Our reads only move forward in time, never backward.

Without this, a user might refresh a page and see their feed go backward — posts disappear and reappear. That happens when successive reads hit different replicas at different replication states.

How to implement: Use sticky sessions — route a user to the same replica consistently.

How Real Systems Handle This

Most databases let us tune consistency per operation. We don’t have to pick one model for everything.

SystemDefaultTunable?
PostgreSQLStrong (single node)N/A for single node
DynamoDBEventualYes — strongly consistent reads available
CassandraTunable per queryONE (eventual) to ALL (strong)
MongoDBStrong for primary readsRead preferences for replicas
CockroachDBSerializable (strongest)Can weaken for performance

Picking the Right Model

The decision comes down to two questions:

1. What happens if a user reads stale data?

  • Nothing bad → eventual consistency is fine
  • Confusion or bad UX → read-your-writes
  • Financial loss or data corruption → strong consistency

2. How much latency can we tolerate?

  • Milliseconds matter → eventual consistency (fastest)
  • Some latency is OK → causal or read-your-writes
  • Correctness over speed → strong consistency

A Practical Example

Imagine we’re building an e-commerce platform:

  • Product catalog (price, description) → Eventual consistency. If a price update takes a second to propagate, nobody gets hurt.
  • User sessions → Read-your-writes. A user logs in and should immediately see their cart.
  • Inventory count → Strong consistency. We can’t sell 5 items when only 3 are in stock.
  • Review counts → Eventual consistency. Showing “142 reviews” when it’s actually 143 for a few seconds is harmless.

Different parts of the same system can (and should) use different consistency models.

Key Takeaway

In simple language, consistency models are about how fresh our data needs to be when someone reads it. Strong consistency means always fresh — but slow. Eventual consistency means sometimes stale — but fast. Most real systems mix and match: strong where it matters (money, inventory), eventual where it doesn’t (likes, views). The skill is knowing which parts of our system need which guarantee.


25

Failover and Redundancy

intermediate 4-7 YOE failover redundancy high-availability SLA uptime

Things break. Servers crash, disks fail, networks go down, entire data centers lose power. The question isn’t if something will fail — it’s when. Failover and redundancy are how we design systems that keep working even when parts of them die.

What Is Redundancy?

Redundancy means having backup copies of everything critical. If one server dies, another takes over. If one database crashes, a replica is ready to go. We eliminate single points of failure.

Think of it like having a spare tire in our car. We don’t plan to get a flat, but when it happens, we’re not stuck on the side of the road.

What Is Failover?

Failover is the process of switching from a failed component to a healthy backup. It’s the mechanism that actually makes redundancy useful.

Without failover, having a backup database is like having a spare tire but no jack — the backup exists but we can’t switch to it.

Active-Passive Failover

The most common pattern. One server (the active/primary) handles all traffic. One or more backup servers (the passive/standby) sit idle, receiving replicated data but serving no traffic.

When the primary dies, a standby gets promoted to primary and starts handling traffic.

Active-Passive vs Active-Active
Active-Passive
Traffic → [Primary ✓]
replication ↓
[Standby 💤]
Primary fails ↓
Traffic → [Standby → New Primary ✓]
Active-Active
      ↗ [Server A ✓]
Traffic
      ↘ [Server B ✓]
Server A fails ↓
All Traffic → [Server B ✓]

Pros:

  • Simple to set up
  • No data conflict issues (only one node writes at a time)
  • Standby can also serve read queries (read replicas)

Cons:

  • The standby is wasted capacity — it sits idle most of the time
  • Failover isn’t instant — there’s a brief downtime during switchover
  • Risk of data loss if the primary dies before replicating its latest writes

Active-Active Failover

All servers handle traffic simultaneously. A load balancer splits requests across them. If one dies, the others absorb the extra load.

Pros:

  • No wasted capacity — every server is doing useful work
  • Better performance — load is distributed
  • Faster failover — no promotion step, traffic just routes around the dead node

Cons:

  • More complex — we need to handle data conflicts (what if two servers accept conflicting writes?)
  • Need a load balancer in front
  • Data synchronization is harder

Best for: Stateless application servers (easy), databases with multi-master replication (hard but possible with Cassandra, CockroachDB).

Redundancy at Every Layer

A chain is only as strong as its weakest link. We need redundancy at every layer of our stack:

LayerRedundancy Strategy
DNSMultiple DNS providers, DNS failover
Load BalancerPair of LBs in active-passive
Application ServersMultiple instances behind LB
DatabasePrimary + read replicas + standby
CacheRedis Sentinel or Redis Cluster
StorageS3 (11 nines durability, built-in replication)
Data CenterMulti-AZ or multi-region deployment

The most overlooked one? The load balancer itself. If we put all our servers behind a single load balancer and that LB dies, everything is down. Always have redundant LBs.

The Nines of Availability

When we talk about uptime, we use “nines”:

AvailabilityDowntime/YearDowntime/MonthDowntime/Week
99% (two nines)3.65 days7.31 hours1.68 hours
99.9% (three nines)8.77 hours43.8 minutes10.1 minutes
99.99% (four nines)52.6 minutes4.38 minutes1.01 minutes
99.999% (five nines)5.26 minutes26.3 seconds6.05 seconds

Going from 99.9% to 99.99% is exponentially harder and more expensive. Most web apps aim for three nines (99.9%). Banks and critical infra aim for four or five nines.

SLA vs SLO vs SLI

These terms come up in interviews:

  • SLI (Service Level Indicator) — The actual metric. “Our p99 latency is 200ms.”
  • SLO (Service Level Objective) — Our internal target. “We aim for 99.9% uptime.”
  • SLA (Service Level Agreement) — A legal promise to customers. “If we drop below 99.9%, we give you credits.”

The SLA is always looser than the SLO. We target 99.95% internally (SLO) so we don’t breach our 99.9% customer promise (SLA).

Health Checks and Heartbeats

How does the system know something is dead? Two main approaches:

Health checks (pull-based): A monitor pings each server periodically. “Are you alive? Return 200 OK.” If a server doesn’t respond after several retries, it’s considered dead.

Heartbeats (push-based): Each server periodically sends a “I’m alive” signal. If the monitor doesn’t hear from a server within a timeout, it’s considered dead.

Most load balancers use health checks. Most cluster managers (like Kubernetes) use a combination of both.

Health Check: Monitor → "GET /health" → Server → "200 OK"
             Monitor → "GET /health" → Server → (no response)
             Monitor → "GET /health" → Server → (no response)
             Monitor: "Server is dead, removing from pool"

Failover Strategies for Databases

Database failover is the trickiest because data is involved:

  1. Cold standby — Standby is off, brought online from backups. Slowest recovery (minutes to hours).
  2. Warm standby — Standby receives replicated data but doesn’t serve traffic. Moderate recovery (seconds to minutes).
  3. Hot standby — Standby is fully synced and ready to take over immediately. Fastest recovery (seconds).

Most cloud-managed databases (RDS, Cloud SQL) use warm or hot standby with automatic failover.

Key Takeaway

In simple language, redundancy is having backups of everything, and failover is the process of switching to those backups when something breaks. Active-passive is simpler (one server works, one waits). Active-active is more efficient (all servers work, survivors absorb the load). Build redundancy at every layer, understand our uptime target in nines, and make sure our health checks actually work. The goal is making our system boring — it keeps running no matter what breaks.


26

Circuit Breaker and Bulkhead Patterns

advanced 4-7 YOE circuit-breaker bulkhead resilience retry cascading-failure

In a distributed system, one failing service can take down everything else. Imagine our payment service is slow. Every request to it hangs for 30 seconds. Our order service calls payment and waits. Its threads get consumed. Now the order service can’t handle any requests either. Soon, the whole system is frozen. This is a cascading failure, and it’s one of the scariest things in distributed systems.

Circuit breakers and bulkheads are patterns that prevent this.

Circuit Breaker Pattern

Think of it like an electrical circuit breaker in our house. When there’s a power surge, the breaker trips and cuts the circuit — protecting everything from frying. Same idea in software.

A circuit breaker wraps calls to an external service and monitors for failures. When failures cross a threshold, the breaker “trips” and stops making calls entirely. Instead of waiting 30 seconds for a timeout, we fail instantly. This gives the downstream service time to recover without being hammered by requests.

The Three States

Circuit Breaker State Machine
CLOSED
Normal operation
Requests flow through
Counting failures
failure threshold exceeded →
OPEN
Requests fail fast
No calls to service
Waiting for timeout
timeout expires →
HALF-OPEN
Let one request through
If success → CLOSED
If fail → OPEN

Closed — Everything is normal. Requests pass through to the downstream service. The breaker counts failures. If failures stay below the threshold, life is good.

Open — Too many failures happened. The breaker trips. All requests immediately fail with a fallback response. No calls are made to the downstream service at all. This state lasts for a configured timeout (e.g., 30 seconds).

Half-Open — After the timeout, the breaker lets one request through as a test. If it succeeds, the breaker closes (back to normal). If it fails, the breaker opens again for another timeout period.

What Happens When the Circuit Is Open?

We don’t just show errors. We use fallbacks:

  • Return cached data (slightly stale but better than nothing)
  • Return a default value (“0 items in cart” instead of crashing)
  • Queue the request for later processing
  • Show a degraded experience (“Recommendations unavailable”)
# Pseudocode for a circuit breaker
def get_recommendations(user_id):
    if circuit_breaker.is_open():
        return cached_recommendations(user_id)  # fallback

    try:
        result = recommendation_service.get(user_id)
        circuit_breaker.record_success()
        return result
    except TimeoutError:
        circuit_breaker.record_failure()
        return cached_recommendations(user_id)  # fallback

Configuration

Typical circuit breaker settings:

  • Failure threshold: 5 failures in 60 seconds → trip
  • Open duration: 30 seconds before trying half-open
  • Half-open test: Allow 1 request through
  • Success threshold: 3 consecutive successes in half-open → close

These numbers aren’t magic. We tune them based on our service’s behavior.

Bulkhead Pattern

The bulkhead pattern isolates different parts of our system so a failure in one doesn’t sink the rest. The name comes from ships — a ship’s hull is divided into watertight compartments (bulkheads). If one compartment floods, the others stay dry and the ship stays afloat.

In software, we create isolation boundaries:

Thread pool bulkhead: Each downstream service gets its own thread pool. If the payment service is slow and consumes all its threads, the order-processing thread pool is untouched.

┌─────────────────────────────────────┐
│         Application Server          │
│                                     │
│  ┌──────────────┐ ┌──────────────┐  │
│  │ Payment Pool │ │  Order Pool  │  │
│  │  10 threads  │ │  20 threads  │  │
│  │  ⚠️ 10/10    │ │  ✅ 5/20    │  │
│  │  (saturated) │ │  (healthy)   │  │
│  └──────────────┘ └──────────────┘  │
│                                     │
│  Payment is broken but orders       │
│  keep processing normally.          │
└─────────────────────────────────────┘

Connection pool bulkhead: Same idea but with database connections. One slow query pattern doesn’t exhaust connections needed by other queries.

Process bulkhead: Run different features as separate services or containers. If the image processing service OOMs, the authentication service keeps running.

Retry with Exponential Backoff

When a request fails, our first instinct is to retry immediately. Bad idea. If the service is overloaded, 1,000 clients all retrying instantly makes it worse. That’s a retry storm.

Instead, we use exponential backoff:

Attempt 1: wait 1 second
Attempt 2: wait 2 seconds
Attempt 3: wait 4 seconds
Attempt 4: wait 8 seconds
(give up after max retries)

We also add jitter — a random offset so all clients don’t retry at the exact same time:

Attempt 1: wait 1s + random(0-500ms)
Attempt 2: wait 2s + random(0-500ms)
Attempt 3: wait 4s + random(0-500ms)

When NOT to Retry

  • 4xx errors (400, 401, 403, 404) — The request is bad. Retrying won’t fix it.
  • Non-idempotent operations — If we POST a payment and it timed out, retrying might charge twice. Only retry operations that are safe to repeat.

Combining All Three

In practice, we use circuit breakers, bulkheads, and retries together:

  1. Bulkhead isolates the failure so it doesn’t spread
  2. Retry with backoff handles transient errors
  3. Circuit breaker stops us from hammering a dead service
Request
  → Bulkhead (uses dedicated thread pool)
    → Circuit Breaker (checks if service is healthy)
      → Retry with backoff (handles transient failures)
        → Actual service call

Libraries like Resilience4j (Java), Polly (.NET), and Hystrix (Java, deprecated but influential) implement all three patterns.

Key Takeaway

In simple language, circuit breakers stop us from calling a broken service (fail fast instead of waiting forever). Bulkheads isolate failures so one broken thing doesn’t sink everything. Retries with exponential backoff handle temporary hiccups without overwhelming the system. Together, these three patterns are how we build systems that degrade gracefully instead of collapsing entirely when things go wrong.


27

Monitoring, Logging, and Alerting

intermediate 2-4 YOE monitoring logging alerting observability metrics tracing

We can design the most elegant system in the world, but if we can’t see what’s happening inside it, we’re flying blind. When something breaks at 3 AM, the question is: do we have enough information to figure out what went wrong, or are we guessing?

Monitoring, logging, and alerting are how we answer that question. In system design interviews, they’re often the final piece that shows we think about real production systems, not just architecture diagrams.

The Three Pillars of Observability

Three Pillars of Observability
Metrics
Numeric measurements
Aggregated over time
Cheap to store
"What is happening?"
Prometheus, Datadog, CloudWatch
Logs
Discrete events
Detailed context
Expensive at scale
"What happened?"
ELK Stack, Loki, Splunk
Traces
Request journey
Across services
Shows bottlenecks
"Where is it slow?"
Jaeger, Zipkin, Datadog APM

Each pillar answers a different question. We need all three for full observability.

Metrics — The Numbers

Metrics are numeric values measured over time. They tell us the health of our system at a glance.

The Four Golden Signals (from Google SRE)

  1. Latency — How long requests take. Not just the average — we care about percentiles.

    • p50 (median): Half the requests are faster than this
    • p95: 95% of requests are faster — this catches the slow tail
    • p99: 99% are faster — the really unlucky users
    • A p50 of 50ms with a p99 of 5 seconds means most users are happy but 1% are miserable
  2. Traffic — How many requests per second we’re handling. Helps us know when to scale.

  3. Error rate — Percentage of requests that fail (5xx errors, timeouts). A sudden spike means something broke.

  4. Saturation — How full our resources are. CPU at 90%? Memory almost maxed? Disk filling up? This tells us how close we are to falling over.

How Metrics Work

Most metrics systems use a pull model. Our application exposes metrics at an endpoint (like /metrics), and a collector (Prometheus) scrapes it periodically.

App Server → exposes /metrics
Prometheus → scrapes every 15 seconds
Grafana   → queries Prometheus, shows dashboards

Logs — The Details

Logs are records of discrete events. When something breaks and metrics show a spike, logs tell us why.

Structured Logging

Plain text logs are hard to search:

[2024-03-15 14:23:01] ERROR: Payment failed for user 12345

Structured logs (JSON) are much better:

{
  "timestamp": "2024-03-15T14:23:01Z",
  "level": "error",
  "service": "payment",
  "user_id": 12345,
  "error": "card_declined",
  "amount": 49.99,
  "request_id": "abc-123"
}

Structured logs let us filter and search: “show me all errors from the payment service where amount > $100 in the last hour.”

Log Aggregation

With 50 servers, we can’t SSH into each one to read logs. We ship all logs to a central system:

  • ELK Stack — Elasticsearch (store/search) + Logstash (process) + Kibana (visualize). Self-hosted, powerful, complex.
  • Loki + Grafana — Lightweight log aggregation. Pairs perfectly with Prometheus.
  • Splunk — Enterprise log management. Expensive but very powerful.
  • Datadog Logs — SaaS. Easy setup, good integration with metrics and traces.

Log Levels

Use them correctly:

  • DEBUG — Verbose detail for development. Never in production.
  • INFO — Normal operations. “User signed up,” “Order placed.”
  • WARN — Something unexpected but not broken. “Cache miss rate high.”
  • ERROR — Something failed. “Payment processing failed.”
  • FATAL — The system is going down. “Database connection lost.”

Distributed Tracing — The Journey

In a microservices architecture, a single user request might touch 10 different services. When that request is slow, which service is the bottleneck?

Distributed tracing follows a request across all services by propagating a trace ID:

User Request (trace-id: abc-123)
  → API Gateway     [12ms]
    → Auth Service   [5ms]
    → Order Service  [350ms]  ← bottleneck!
      → Inventory DB [300ms]  ← root cause!
      → Payment Svc  [45ms]
    → Notification   [8ms]

Each step is a span. All spans with the same trace ID form a complete picture of the request’s journey. We can see that the order service is slow because the inventory database query took 300ms.

Alerting — The Wake-Up Call

Monitoring is useless if nobody looks at the dashboards. Alerting bridges that gap — it tells us when something needs attention.

What to Alert On

Good alerts are actionable. Every alert should require a human to do something. If we get an alert and the response is “meh, it’ll fix itself,” that’s a bad alert.

Alert on symptoms, not causes:

  • Good: “Error rate exceeded 5% for 5 minutes” (symptom)
  • Bad: “CPU usage above 80%” (cause — might be normal during a deploy)

Alert on SLO breaches:

  • “p99 latency exceeded 500ms for the last 10 minutes”
  • “Availability dropped below 99.9% for this billing period”

Alert Fatigue

The worst thing we can do is alert on everything. When teams get 200 alerts a day, they start ignoring all of them. And then they miss the real one.

Rules for good alerting:

  • Every alert should have a clear owner
  • Every alert should have a runbook (what to do when it fires)
  • Review alerts monthly — if we keep dismissing one, delete it or fix the root cause
  • Use severity levels: critical (page someone at 3 AM) vs warning (check it tomorrow)

The Monitoring Stack

A typical production setup:

ToolPurpose
PrometheusMetrics collection and storage
GrafanaDashboards and visualization
ELK / LokiLog aggregation and search
Jaeger / ZipkinDistributed tracing
PagerDuty / OpsGenieAlert routing and on-call management
DatadogAll-in-one SaaS (metrics + logs + traces + alerts)

In System Design Interviews

When wrapping up a system design answer, mentioning monitoring shows maturity:

  • “We’d track p99 latency and error rate per service with Prometheus and Grafana”
  • “Structured JSON logs shipped to ELK for debugging”
  • “Distributed tracing with Jaeger so we can find bottlenecks across services”
  • “Alerting on SLO breaches — if p99 exceeds 500ms for 5 minutes, page the on-call”

We don’t need to go deep. Just showing we think about operational concerns sets us apart.

Key Takeaway

In simple language, metrics tell us what’s happening (numbers over time), logs tell us why it happened (detailed events), and traces show us where it happened (a request’s journey across services). Together, they give us observability — the ability to understand our system’s behavior from the outside. Without them, debugging production issues is just guessing.


Communication Protocols

28

REST API Design

beginner 0-2 YOE REST API HTTP API-design pagination

REST is the most common way to build web APIs. When we open a mobile app and it loads our feed, it’s almost certainly making REST API calls behind the scenes. Understanding REST API design is fundamental for system design interviews because every system we design will have an API layer.

What Is REST?

REST stands for Representational State Transfer. In simple language, it’s a set of conventions for building APIs over HTTP. We use URLs to identify resources, HTTP methods to define actions, and JSON to send data back and forth.

REST isn’t a protocol or a standard — it’s an architectural style. That means there’s no strict spec, just widely accepted conventions.

Core Principles

  1. Resource-based — Everything is a resource (user, post, order) with a URL
  2. Stateless — Each request contains everything the server needs. No session state stored between requests.
  3. Uniform interface — Consistent URL patterns and HTTP methods
  4. Client-server — Frontend and backend are separate and can evolve independently

HTTP Methods

Each method maps to a CRUD operation:

MethodPurposeIdempotent?Example
GETRead a resourceYesGET /users/123
POSTCreate a new resourceNoPOST /users
PUTReplace an entire resourceYesPUT /users/123
PATCHUpdate part of a resourceYesPATCH /users/123
DELETERemove a resourceYesDELETE /users/123

Idempotent means calling it multiple times has the same effect as calling it once. GET, PUT, PATCH, and DELETE are idempotent. POST is not — calling POST /orders three times creates three orders.

Status Codes

We don’t just return data — we return a status code that tells the client what happened:

Success (2xx)

  • 200 OK — Request succeeded. Used for GET, PUT, PATCH.
  • 201 Created — Resource created. Used for POST.
  • 204 No Content — Success but nothing to return. Used for DELETE.

Client Error (4xx)

  • 400 Bad Request — The request body is malformed or invalid.
  • 401 Unauthorized — Not authenticated. “Who are you?”
  • 403 Forbidden — Authenticated but not allowed. “I know who you are, but you can’t do this.”
  • 404 Not Found — Resource doesn’t exist.
  • 409 Conflict — Conflicts with current state (e.g., duplicate email).
  • 429 Too Many Requests — Rate limit exceeded. Slow down.

Server Error (5xx)

  • 500 Internal Server Error — Something broke on our end.
  • 502 Bad Gateway — Our server got a bad response from an upstream service.
  • 503 Service Unavailable — Server is overloaded or down for maintenance.

A good rule: 4xx means the client did something wrong. 5xx means we did something wrong.

URL Design

Good REST URLs are intuitive and consistent:

# Resources are nouns, not verbs
✅ GET  /users/123
❌ GET  /getUser?id=123

# Plural nouns for collections
✅ GET  /users
❌ GET  /user

# Nested resources for relationships
✅ GET  /users/123/posts        (posts by user 123)
✅ GET  /users/123/posts/456    (specific post by user 123)

# Query params for filtering, sorting, searching
✅ GET  /posts?status=published&sort=created_at&order=desc
✅ GET  /users?search=manish

# Actions that don't fit CRUD → use sub-resources
✅ POST /orders/123/cancel
✅ POST /users/123/verify-email

API Design Example: Blog Platform

# Posts
GET    /posts                    → List all posts
GET    /posts/123                → Get post 123
POST   /posts                    → Create a new post
PUT    /posts/123                → Replace post 123
PATCH  /posts/123                → Update post 123 partially
DELETE /posts/123                → Delete post 123

# Comments on a post
GET    /posts/123/comments       → List comments on post 123
POST   /posts/123/comments       → Add comment to post 123
DELETE /posts/123/comments/456   → Delete comment 456

# User's posts
GET    /users/789/posts          → All posts by user 789

A response might look like:

{
  "id": 123,
  "title": "REST API Design Guide",
  "content": "...",
  "author": {
    "id": 789,
    "name": "Manish"
  },
  "created_at": "2024-03-15T10:30:00Z",
  "tags": ["api", "design"]
}

Pagination

When a collection has thousands of items, we can’t return them all at once. Two main approaches:

Offset-Based Pagination

GET /posts?page=2&limit=20

Simple but has problems:

  • If items are added/deleted between pages, we might skip or duplicate items
  • OFFSET 10000 in SQL is slow — the DB still scans 10,000 rows to skip them

Cursor-Based Pagination

GET /posts?after=abc123&limit=20

The cursor (abc123) is an opaque token pointing to the last item we saw. The server fetches the next 20 items after that cursor.

  • No skipping or duplicates even if data changes
  • Fast — uses an index scan instead of OFFSET
  • Used by Twitter, Facebook, Slack — basically everyone at scale

Response includes the next cursor:

{
  "data": [...],
  "pagination": {
    "next_cursor": "def456",
    "has_more": true
  }
}

API Versioning

APIs evolve. We can’t break existing clients when we change things.

URL versioning (most common):

GET /v1/users/123
GET /v2/users/123

Header versioning:

GET /users/123
Accept: application/vnd.myapi.v2+json

Query parameter:

GET /users/123?version=2

URL versioning is the most explicit and easiest to understand. It’s what most APIs use.

Other Best Practices

  • Use JSON for request and response bodies (set Content-Type: application/json)
  • Use ISO 8601 for dates2024-03-15T10:30:00Z, not “March 15, 2024”
  • Return meaningful errors — Include an error code and message
  • Rate limiting — Protect our API from abuse with 429 Too Many Requests
  • HTTPS everywhere — Never serve an API over plain HTTP
  • Authentication — Use Bearer tokens (JWT) or API keys in the Authorization header
{
  "error": {
    "code": "VALIDATION_ERROR",
    "message": "Email is required",
    "field": "email"
  }
}

Key Takeaway

In simple language, REST is about using URLs to identify things (resources), HTTP methods to do things to them (GET, POST, PUT, DELETE), and status codes to say what happened. Keep URLs clean and consistent, use cursor-based pagination for large datasets, version our API from day one, and always return helpful error messages. REST isn’t perfect for every use case, but it’s the default for good reason — it’s simple, well-understood, and works great for most APIs.


29

GraphQL

intermediate 2-4 YOE GraphQL API query-language REST-vs-GraphQL

GraphQL was created by Facebook in 2012 to solve a very specific problem: their mobile app needed data from many different resources, and REST APIs were either sending too much data or requiring too many requests. Instead of the server deciding what data to return, GraphQL lets the client decide.

The Problem GraphQL Solves

Imagine we’re building a social media app. We want to show a user’s profile with their name, avatar, and last 3 posts (each with title and like count).

With REST (the problem)

Over-fetching: We call GET /users/123 and get back 30 fields — name, email, phone, address, bio, settings — when we only need name and avatar. We’re downloading data we don’t use.

Under-fetching: We need the user AND their posts. That’s two separate API calls: GET /users/123 then GET /users/123/posts. On a slow mobile connection, each round trip hurts.

The N+1 problem: We fetch 3 posts, and each post needs the author info. That’s 1 call for posts + 3 calls for authors = 4 requests total.

With GraphQL (the solution)

One request. We ask for exactly what we need:

query {
  user(id: 123) {
    name
    avatarUrl
    posts(last: 3) {
      title
      likeCount
    }
  }
}

Response — nothing more, nothing less:

{
  "data": {
    "user": {
      "name": "Manish",
      "avatarUrl": "https://...",
      "posts": [
        { "title": "System Design Basics", "likeCount": 42 },
        { "title": "CAP Theorem Explained", "likeCount": 28 },
        { "title": "REST vs GraphQL", "likeCount": 35 }
      ]
    }
  }
}

Core Concepts

Schema and Types

GraphQL has a strongly-typed schema. The schema defines what data exists and how it’s related:

type User {
  id: ID!
  name: String!
  email: String!
  avatarUrl: String
  posts: [Post!]!
}

type Post {
  id: ID!
  title: String!
  content: String!
  likeCount: Int!
  author: User!
}

type Query {
  user(id: ID!): User
  posts(limit: Int): [Post!]!
}

The ! means non-nullable. [Post!]! means a non-null list of non-null posts. The schema is the contract between frontend and backend.

Queries (Reading Data)

Queries are how we read data. The client specifies exactly which fields it wants:

# Just the name
query {
  user(id: 123) {
    name
  }
}

# Name + email + all posts with comments
query {
  user(id: 123) {
    name
    email
    posts {
      title
      comments {
        text
        author { name }
      }
    }
  }
}

Same endpoint, different shapes of data. The client controls it.

Mutations (Writing Data)

Mutations change data — create, update, delete:

mutation {
  createPost(input: {
    title: "New Post"
    content: "Hello world"
  }) {
    id
    title
    createdAt
  }
}

We can also ask for specific fields back in the response — so we don’t need a separate GET call after creating something.

Subscriptions (Real-Time Data)

Subscriptions let the client listen for real-time updates over WebSocket:

subscription {
  newMessage(chatId: "abc") {
    text
    sender { name }
    timestamp
  }
}

Whenever a new message is sent to that chat, the server pushes it to all subscribers.

REST vs GraphQL Comparison

AspectRESTGraphQL
EndpointsMultiple (/users, /posts)Single (/graphql)
Data fetchingServer decides what to returnClient decides what to return
Over-fetchingCommonImpossible (by design)
Under-fetchingCommon (need multiple calls)Solved (nested queries)
CachingEasy (HTTP caching by URL)Harder (single endpoint, POST requests)
File uploadsEasy (multipart)Awkward (needs workarounds)
Error handlingHTTP status codesAlways 200, errors in response body
Learning curveLowMedium
ToolingMature and everywhereGrowing fast

When to Use GraphQL

GraphQL shines when:

  • Our frontend needs data from multiple related resources in one call
  • Different clients (web, mobile, watch) need different data shapes
  • We’re tired of creating custom REST endpoints for each screen
  • Rapid frontend iteration — frontend can change data requirements without backend changes

Stick with REST when:

  • Our API is simple CRUD operations
  • We need HTTP caching heavily (GraphQL makes this harder)
  • We’re doing file uploads/downloads
  • Our team is small and the overhead of a GraphQL schema isn’t worth it
  • We’re building a public API (REST is more universally understood)

The Downsides

GraphQL isn’t perfect:

Complexity: We need a schema, resolvers, and a query engine. More moving parts than REST.

Caching is harder: REST uses URLs as cache keys. GraphQL uses a single endpoint with POST requests, so HTTP caching doesn’t work out of the box. We need client-side caching (Apollo Client, urql).

N+1 on the server: If a query asks for 100 posts with their authors, the resolver might make 100 separate database calls for authors. Solution: use DataLoader to batch and deduplicate.

Security: Clients can craft deeply nested queries that overload the server. We need query depth limiting and complexity analysis.

# A malicious query
query {
  user(id: 1) {
    posts {
      comments {
        author {
          posts {
            comments {
              author {
                # ...infinite nesting
              }
            }
          }
        }
      }
    }
  }
}

Key Takeaway

In simple language, GraphQL lets the client ask for exactly the data it needs in a single request — no over-fetching, no under-fetching. It’s great when we have complex, interconnected data and multiple client types. But it adds complexity compared to REST, especially around caching and security. For most simple APIs, REST is still the better choice. For data-rich applications with lots of relationships (social networks, dashboards, e-commerce), GraphQL really shines.


30

WebSockets

intermediate 2-4 YOE WebSocket real-time bidirectional HTTP scaling

HTTP was built for a simple pattern: the client asks, the server answers. But what if the server needs to push data to the client without being asked? Think chat messages, live notifications, stock price tickers, or multiplayer games. The server can’t wait for the client to ask “any new messages?” every second. We need a way for both sides to send data whenever they want. That’s what WebSockets are for.

HTTP vs WebSocket

HTTP vs WebSocket
HTTP (Request-Response)
Client → GET /messages → Server
Client ← 200 OK + data ← Server
connection closed
Client → GET /messages → Server
Client ← 200 OK + data ← Server
connection closed
WebSocket (Persistent)
Client → HTTP Upgrade → Server
═══ connection stays open ═══
Client → "hello!"
"hey there!" ← Server
"new message!" ← Server
Client → "typing..."
═══ still open ═══

With HTTP, the client always initiates. The server can only respond — it can’t proactively send data. Every exchange opens a new connection, sends data, and closes.

With WebSocket, we open a connection once and both sides can send data anytime. The connection stays open for as long as needed — minutes, hours, even days.

The WebSocket Handshake

WebSocket connections start as a regular HTTP request, then “upgrade” to a WebSocket connection:

Client → Server:
GET /chat HTTP/1.1
Host: example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13

Server → Client:
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=

After the 101 Switching Protocols response, the connection is upgraded. From here on, both sides communicate using WebSocket frames — lightweight binary packets, not HTTP.

Common Use Cases

  • Chat applications — Slack, Discord, WhatsApp Web. Messages need to arrive instantly.
  • Live notifications — Facebook notifications, Gmail new email alerts.
  • Real-time dashboards — Stock tickers, monitoring dashboards, live sports scores.
  • Collaborative editing — Google Docs, Figma. Multiple users editing the same document.
  • Multiplayer gaming — Player positions, actions, game state updates.
  • Live streaming — Chat alongside a video stream (Twitch chat).

The common thread: all of these need the server to push data to the client, and latency matters.

A Simple WebSocket Example

Client-side (JavaScript):

const ws = new WebSocket('wss://chat.example.com/ws');

ws.onopen = () => {
  console.log('Connected!');
  ws.send(JSON.stringify({ type: 'join', room: 'general' }));
};

ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  console.log('New message:', message);
};

ws.onclose = () => {
  console.log('Disconnected. Reconnecting...');
  // reconnect logic here
};

Server-side (Node.js):

const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });

wss.on('connection', (ws) => {
  ws.on('message', (data) => {
    const msg = JSON.parse(data);
    // broadcast to all connected clients
    wss.clients.forEach(client => {
      if (client.readyState === WebSocket.OPEN) {
        client.send(JSON.stringify(msg));
      }
    });
  });
});

Scaling WebSocket Connections

Here’s where it gets tricky. Each WebSocket connection is a persistent TCP connection held in memory. A single server might handle 10,000-100,000 connections. But what happens when we have multiple servers?

The Problem

User A connects to Server 1. User B connects to Server 2. User A sends a message to User B. But Server 1 doesn’t know about User B — that connection lives on Server 2.

Solution 1: Sticky Sessions

Route each user to the same server every time using a load balancer. Simple but limits our ability to distribute load evenly.

Solution 2: Pub/Sub Backbone

All servers subscribe to a shared message bus (Redis Pub/Sub, Kafka, NATS). When Server 1 receives a message, it publishes to the bus. Server 2 picks it up and delivers to User B.

User A → Server 1 → Redis Pub/Sub → Server 2 → User B
                   → Server 3 → User C
                   → Server 1 → User D

This is the standard pattern. Redis Pub/Sub is the most common choice for chat-scale applications.

Solution 3: Managed WebSocket Services

Skip the complexity and use a managed service:

  • AWS API Gateway WebSocket — Serverless WebSocket support
  • Pusher / Ably — Real-time messaging as a service
  • Socket.IO — Library that handles reconnection, fallbacks, and rooms

Connection Management

WebSocket connections are long-lived, so we need to handle:

Heartbeats (ping/pong): The server sends periodic pings. If the client doesn’t respond with a pong, the connection is dead — close it and free resources.

Reconnection: Connections drop all the time (network changes, phone going to sleep). Clients must implement auto-reconnect with exponential backoff.

Authentication: We can’t send auth headers on every message like HTTP. Authenticate during the handshake or in the first message after connecting.

Connection limits: Each connection uses memory. Set a max per server and reject new connections gracefully when full.

When NOT to Use WebSockets

WebSockets aren’t always the answer:

  • Occasional updates (new email every few minutes) → Server-Sent Events or long polling are simpler
  • One-directional server push → SSE is lighter weight
  • Standard CRUD operations → REST is simpler and cacheable
  • Firewalls and proxies — Some corporate networks block WebSocket connections

Key Takeaway

In simple language, WebSockets give us a persistent, two-way connection between client and server. Instead of the client constantly asking “anything new?”, the server can push data whenever it wants. They’re essential for real-time features like chat, live notifications, and collaborative editing. The main challenge is scaling — each connection lives on a specific server, so we need a pub/sub layer (like Redis) to bridge messages across servers.


31

gRPC and Protocol Buffers

advanced 4-7 YOE gRPC protobuf RPC microservices protocol-buffers

REST uses JSON over HTTP. It’s human-readable, widely supported, and great for public APIs. But when two microservices are talking to each other thousands of times per second inside our data center, we don’t need human-readability. We need speed. That’s where gRPC comes in.

What Is gRPC?

gRPC (gRPC Remote Procedure Calls) is an open-source framework created by Google. It lets services call methods on other services as if they were local function calls. Under the hood, it uses HTTP/2 for transport and Protocol Buffers (protobuf) for serialization.

In simple language, instead of crafting HTTP requests with URLs and JSON, we just call orderService.createOrder(orderData) and gRPC handles the networking.

Protocol Buffers (Protobuf)

Protobuf is a binary serialization format. We define our data structure in a .proto file, and protobuf generates code in any language.

// user.proto
syntax = "proto3";

package userservice;

message User {
  int32 id = 1;
  string name = 2;
  string email = 3;
  repeated string roles = 4;
}

message GetUserRequest {
  int32 id = 1;
}

message GetUserResponse {
  User user = 1;
}

service UserService {
  rpc GetUser(GetUserRequest) returns (GetUserResponse);
  rpc CreateUser(User) returns (User);
  rpc ListUsers(ListUsersRequest) returns (stream User);
}

The numbers (= 1, = 2) are field tags — they identify fields in the binary format and should never be changed once deployed.

Why Binary?

JSON is text-based. A number like 12345 is 5 characters (5 bytes). In protobuf, it’s a variable-length integer — just 2 bytes. At millions of requests per second, those savings add up.

FormatSize (typical)Parse SpeedHuman Readable
JSONLargerSlowerYes
Protobuf3-10x smaller5-100x fasterNo
XMLLargestSlowestYes

How gRPC Works

  1. We define our service and messages in a .proto file
  2. The protobuf compiler (protoc) generates client and server code in our language
  3. The server implements the service methods
  4. The client calls those methods like regular functions
  5. gRPC handles serialization, HTTP/2 transport, and error handling
Client                           Server
  │                                │
  │  orderService.CreateOrder()    │
  │  ──── protobuf bytes ────→    │
  │        over HTTP/2             │
  │                                │  → executes CreateOrder()
  │  ←──── protobuf bytes ────    │
  │       (response)               │

We never manually construct HTTP requests or parse JSON. The generated code handles everything.

Four Communication Patterns

1. Unary RPC (Request-Response)

Like a regular function call. Client sends one request, server sends one response.

rpc GetUser(GetUserRequest) returns (GetUserResponse);

This is the most common pattern. Same as a REST API call but faster.

2. Server Streaming

Client sends one request, server sends back a stream of responses.

rpc ListUsers(ListUsersRequest) returns (stream User);

Great for: fetching large datasets incrementally, real-time event feeds, live logs.

3. Client Streaming

Client sends a stream of requests, server sends one response when the client is done.

rpc UploadChunks(stream FileChunk) returns (UploadResult);

Great for: file uploads, sending sensor data, batch operations.

4. Bidirectional Streaming

Both sides send streams of messages simultaneously. Like a WebSocket conversation but with typed messages.

rpc Chat(stream ChatMessage) returns (stream ChatMessage);

Great for: chat systems, collaborative editing, real-time gaming.

gRPC vs REST

AspectRESTgRPC
ProtocolHTTP/1.1 (usually)HTTP/2 (always)
FormatJSON (text)Protobuf (binary)
SpeedGood2-10x faster
Payload sizeLarger3-10x smaller
StreamingLimited (SSE, WebSocket)Built-in (4 patterns)
Code generationOptional (OpenAPI)Required (proto files)
Browser supportNativeNeeds grpc-web proxy
SchemaOptional (OpenAPI/Swagger)Required (proto files)
CachingEasy (HTTP caching)Hard
Human debuggingEasy (read JSON)Hard (binary data)
Learning curveLowMedium-High

When to Use gRPC

Use gRPC for:

  • Microservice-to-microservice communication inside our infrastructure
  • High-throughput services where performance matters (10,000+ RPS)
  • Polyglot environments — proto files generate code for Go, Java, Python, C++, etc.
  • Streaming use cases — server streaming, bidirectional streaming
  • Mobile clients — smaller payloads = less bandwidth on cellular networks

Stick with REST for:

  • Public APIs — REST is universally understood, gRPC needs protobuf knowledge
  • Browser clients — browsers can’t make native gRPC calls (need grpc-web proxy)
  • Simple CRUD — the overhead of proto files isn’t worth it for basic operations
  • Debugging — we can’t just curl a gRPC endpoint and read the response

HTTP/2 Advantages

gRPC uses HTTP/2, which gives us:

  • Multiplexing — Multiple RPC calls over a single TCP connection. No head-of-line blocking.
  • Header compression — HTTP/2 compresses headers with HPACK, reducing overhead on repeated calls.
  • Binary framing — More efficient than HTTP/1.1 text parsing.

In REST over HTTP/1.1, each request needs its own TCP connection (or waits in a queue). In gRPC over HTTP/2, we multiplex hundreds of concurrent calls on one connection.

Error Handling

gRPC uses its own status codes (similar to HTTP but different):

gRPC CodeMeaningSimilar HTTP
OKSuccess200
NOT_FOUNDResource doesn’t exist404
ALREADY_EXISTSDuplicate409
PERMISSION_DENIEDNot authorized403
UNAVAILABLEService is down503
DEADLINE_EXCEEDEDTimeout408
INTERNALServer error500

gRPC also has built-in deadlines. Every call can specify a timeout, and if the server doesn’t respond in time, the call fails with DEADLINE_EXCEEDED. This prevents hanging calls.

Key Takeaway

In simple language, gRPC is a faster alternative to REST that uses binary data (protobuf) instead of JSON and HTTP/2 instead of HTTP/1.1. It’s ideal for communication between microservices where speed and efficiency matter more than human-readability. The proto file serves as both the API contract and the source for auto-generated client/server code. Use gRPC internally between services, REST externally for public APIs. Many large systems (Google, Netflix, Square) use both — gRPC inside, REST outside.


32

Polling, Long Polling, and Server-Sent Events

intermediate 2-4 YOE polling long-polling SSE real-time server-sent-events

Not every real-time feature needs WebSockets. Sometimes we just need the server to push updates to the client — new notifications, live scores, order status changes. Before reaching for WebSockets, we should consider simpler alternatives: polling, long polling, and Server-Sent Events. Each has its sweet spot.

Short Polling

The simplest approach. The client asks the server “any updates?” at regular intervals.

// Client polls every 5 seconds
setInterval(async () => {
  const response = await fetch('/api/notifications');
  const data = await response.json();
  if (data.length > 0) {
    showNotifications(data);
  }
}, 5000);

Think of it like checking our mailbox every 5 minutes. Most of the time, there’s nothing new. But we keep walking to the mailbox and back anyway.

Pros:

  • Dead simple to implement
  • Works everywhere (it’s just regular HTTP)
  • Stateless — server doesn’t track connections

Cons:

  • Wasteful — most requests return “nothing new”
  • Delay between events — if something happens right after we poll, we don’t know for 5 seconds
  • Scales poorly — 10,000 clients polling every 5 seconds = 2,000 requests per second for nothing

Long Polling

A smarter version of polling. The client sends a request, and the server holds it open until there’s new data. When data arrives, the server responds. The client immediately sends another request.

// Client long-polls
async function longPoll() {
  try {
    const response = await fetch('/api/notifications/subscribe');
    const data = await response.json();
    showNotifications(data);
  } catch (err) {
    // wait a bit before retrying on error
    await new Promise(r => setTimeout(r, 3000));
  }
  // immediately poll again
  longPoll();
}

longPoll();

Think of it like calling a restaurant and asking “is my table ready?” Instead of them saying “no” and hanging up (short polling), they say “hold on, I’ll let you know when it’s ready” and keep us on the line.

Pros:

  • Near real-time — data arrives as soon as the server has it
  • No wasted requests — each request results in data
  • Works through firewalls and proxies (it’s still HTTP)

Cons:

  • Server holds open connections — each waiting client ties up server resources
  • Not truly bidirectional — client can’t push data while waiting
  • Timeout handling — if the server holds too long, proxies/load balancers may kill the connection (typically 30-60 second timeout)

Server-Sent Events (SSE)

SSE is a standardized way for the server to push updates to the client over a single, long-lived HTTP connection. The client opens the connection once, and the server sends events whenever it wants.

// Client — SSE is built into browsers
const eventSource = new EventSource('/api/notifications/stream');

eventSource.onmessage = (event) => {
  const data = JSON.parse(event.data);
  showNotification(data);
};

eventSource.onerror = () => {
  console.log('Connection lost, browser will auto-reconnect');
};

Server side (Node.js):

app.get('/api/notifications/stream', (req, res) => {
  res.writeHead(200, {
    'Content-Type': 'text/event-stream',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
  });

  // Send a notification whenever we have one
  const sendEvent = (data) => {
    res.write(`data: ${JSON.stringify(data)}\n\n`);
  };

  // Subscribe to notification events
  notificationEmitter.on('notify', sendEvent);

  req.on('close', () => {
    notificationEmitter.off('notify', sendEvent);
  });
});

The server sends data in a specific format:

data: {"type": "notification", "message": "New follower!"}

data: {"type": "notification", "message": "Your order shipped!"}

event: heartbeat
data: ping

Pros:

  • Built into browsers — EventSource API with automatic reconnection
  • Simple — just HTTP with a special content type
  • Efficient — single connection, server pushes when ready
  • Auto-reconnect — the browser handles reconnection automatically

Cons:

  • One-directional — server to client only. Client can’t send data over the same connection.
  • Connection limits — browsers limit SSE connections per domain (6 in HTTP/1.1, more in HTTP/2)
  • Text only — no binary data (but we can base64 encode if needed)

Comparison: All Four Approaches

Timeline: How Each Approach Works
Short Polling
Client: req→ ←(empty)   req→ ←(empty)   req→ ←(DATA!)   req→ ←(empty)
Most requests return nothing. Simple but wasteful.
Long Polling
Client: req→ ........←(DATA!)   req→ ..←(DATA!)   req→ .......
Server holds until data arrives. Fewer wasted requests.
Server-Sent Events
Client: req→ ═══ ←(DATA!) ═══ ←(DATA!) ═══ ←(DATA!) ═══
One connection. Server pushes whenever. Client can't send.
WebSocket
Client: upgrade→ ═ →msg ←msg ←msg →msg ←msg →msg ═
One connection. Both sides send anytime. Full bidirectional.
FeatureShort PollingLong PollingSSEWebSocket
DirectionClient → ServerClient → ServerServer → ClientBoth ways
LatencyUp to poll intervalNear real-timeReal-timeReal-time
ComplexityVery lowLowLowMedium-High
Server resourcesLow per requestMedium (held connections)Medium (held connections)Medium (held connections)
Browser supportEverywhereEverywhereAll modern browsersAll modern browsers
Auto-reconnectN/AManualBuilt-inManual
Binary dataYesYesNo (text only)Yes
ScalabilityPoor (wasted requests)OKGoodGood
Firewall friendlyYesYesYesSometimes no

When to Use Each

Short Polling — When updates are infrequent and slight delay is fine. Weather data, email checking. Dead simple to implement.

Long Polling — When we need near real-time but can’t use WebSocket or SSE. Good fallback. Used by older chat systems.

SSE — When we only need server-to-client push. Live notifications, news feeds, stock tickers, progress bars, live logs. This is the underrated option — it’s simpler than WebSocket and handles 90% of “real-time” use cases.

WebSocket — When we need full bidirectional communication. Chat, multiplayer games, collaborative editing. Worth the extra complexity only when both sides need to send data frequently.

A Real-World Decision Tree

Do we need real-time updates?
  ├─ No → Regular HTTP (REST)
  └─ Yes → Does the client need to send data too?
      ├─ No → SSE (simplest real-time option)
      └─ Yes → How often does the client send?
          ├─ Rarely → SSE + regular HTTP POST for client messages
          └─ Frequently → WebSocket

The often-overlooked option is SSE + REST. We use SSE for the server pushing updates and regular POST requests when the client needs to send something. This gives us 80% of WebSocket functionality with way less complexity.

Key Takeaway

In simple language, polling is the client repeatedly asking “anything new?” — simple but wasteful. Long polling improves this by having the server wait until there IS something new. SSE opens a one-way channel where the server pushes updates as they happen. WebSocket opens a two-way channel for full real-time communication. Start with the simplest option that meets our needs — often that’s SSE, not WebSocket.


Advanced Patterns

33

Rate Limiting

intermediate 2-4 YOE rate-limiting API token-bucket throttling system-design

Rate limiting is controlling how many requests a client can make to our system in a given time period. Think of it like a bouncer at a club — only so many people can get in per hour, no matter how eager they are.

Every production system needs it. Without rate limiting, one misbehaving client (or an attacker) can overwhelm our servers and ruin the experience for everyone else.

Why We Need It

  • Prevent abuse — Stop DDoS attacks and brute-force login attempts
  • Protect resources — Our servers, databases, and third-party APIs have finite capacity
  • Fair usage — Make sure one noisy client doesn’t hog everything
  • Cost control — If we’re paying per API call to a downstream service, we don’t want runaway costs
  • Stability — Graceful degradation is better than a total crash

Rate Limiting Algorithms

There are five main algorithms. Each has tradeoffs.

1. Token Bucket

The most popular algorithm. Used by AWS, Stripe, and most API gateways.

Imagine a bucket that holds tokens. A refiller adds tokens at a fixed rate (say 10 per second). Each request takes one token. If the bucket is empty, the request is rejected. The bucket has a max capacity, so tokens don’t accumulate forever.

Token Bucket Algorithm
Refiller (10 tokens/sec)
Bucket (max: 20)
T T T T T T    
6 tokens available
Request arrives
Has token? → Allow
Empty? → Reject (429)

Why it’s popular: It allows bursts (up to the bucket size) while maintaining a steady average rate. Simple to implement and memory efficient.

2. Leaking Bucket

Similar to token bucket, but requests go INTO a bucket (a FIFO queue) and leak out at a fixed rate. If the bucket overflows, new requests are dropped.

The only difference from token bucket: leaking bucket enforces a perfectly smooth outflow rate. Token bucket allows bursts. Leaking bucket doesn’t.

Good for: When we need a perfectly steady processing rate (like sending API calls to a strict third-party rate limit).

3. Fixed Window Counter

Divide time into fixed windows (say 1-minute windows). Count requests in each window. If the count exceeds the limit, reject.

Window: 10:00-10:01 → 95 requests (limit: 100) ✓
Window: 10:01-10:02 → 45 requests (limit: 100) ✓

Problem: Burst at the boundary. If 90 requests come at 10:00:55 and 90 more at 10:01:05, that’s 180 requests in 10 seconds even though the limit is 100/minute. Both windows pass individually, but the actual rate is way over.

4. Sliding Window Log

Keep a timestamped log of every request. When a new request arrives, remove entries older than the window size, then count what’s left.

Fixes the boundary burst problem. But it’s memory-hungry — we’re storing a timestamp for every single request.

5. Sliding Window Counter

A hybrid of fixed window and sliding window log. We keep counts for the current and previous windows, then calculate a weighted count based on where we are in the current window.

Previous window: 84 requests
Current window (40% through): 36 requests
Weighted count: 84 × 0.6 + 36 = 86.4 → Under 100, allow!

Best of both worlds: Smooth like sliding window, memory-efficient like fixed window. This is what most real implementations use.

HTTP Headers

When we rate limit, we should tell clients what’s going on. Standard headers:

HeaderMeaning
X-RateLimit-LimitMax requests allowed in the window
X-RateLimit-RemainingHow many requests the client has left
X-RateLimit-ResetWhen the window resets (Unix timestamp)
Retry-AfterHow many seconds to wait before trying again (sent with 429)

The HTTP status code for “you’ve been rate limited” is 429 Too Many Requests.

Where to Implement

API Gateway level — The most common place. Rate limit before requests even hit our services. Tools like Kong, NGINX, and AWS API Gateway have this built in.

Middleware level — Inside the application. More flexible — we can rate limit based on user tier, endpoint, or any custom logic.

Per-service level — Each microservice protects itself. Useful when different services have different capacity limits.

In practice, most systems do it at the API gateway with additional per-service limits for extra protection.

Client-Side vs Server-Side

Server-side rate limiting is what we’ve been discussing — the server decides when to reject requests. This is the important one. We never trust the client.

Client-side rate limiting is when the client self-throttles (e.g., backing off after receiving a 429). It’s a nice optimization — reduces wasted requests — but it’s not a security mechanism. A malicious client can just ignore it.

Good API clients implement exponential backoff: wait 1s, then 2s, then 4s, then 8s between retries. This prevents a stampede when the server recovers.

Rate Limiting in Distributed Systems

When we have multiple servers, we need a shared counter. Each server can’t just keep its own count — a client could hit different servers and bypass the limit.

The solution: use a centralized store like Redis. Redis is fast enough for this (single-digit millisecond lookups), and commands like INCR and EXPIRE make implementing rate limiters straightforward.

INCR user:123:rate_limit    → 1
EXPIRE user:123:rate_limit 60  → expires in 60s

In simple language, rate limiting is putting a speed limit on our API. Without it, one bad actor can crash the party for everyone. Token bucket is the go-to algorithm — it’s simple, allows bursts, and is easy to implement with Redis. Always return proper HTTP headers so good clients know where they stand.


34

Advanced Caching Patterns

advanced 4-7 YOE caching Redis cache-aside write-through distributed-cache system-design

We covered the basics of caching earlier. Now let’s go deeper into the different caching patterns and when to use each one. The difference between a well-cached system and a poorly-cached system isn’t whether we use a cache — it’s which pattern we pick.

The Five Caching Patterns

Caching Patterns Overview
Cache-Aside
App → Cache?
Miss → App → DB
App → Cache (fill)
App manages everything
Read-Through
App → Cache?
Miss → Cache → DB
Cache fills itself
Cache manages loading
Write-Through
App → Cache → DB
Both written together
Synchronous
Always consistent
Write-Behind
App → Cache (done!)
Cache → DB (later)
Asynchronous
Fastest writes, risky
Refresh-Ahead
Cache auto-refreshes
Before TTL expires
Proactive
No cache misses for hot data

Cache-Aside (Lazy Loading)

The most common pattern. The application talks to the cache and the database directly. The cache doesn’t know the database exists.

Read path:

  1. App checks the cache
  2. Hit — return the cached value
  3. Miss — app reads from DB, writes the value into cache, then returns it

Write path:

  1. App writes to the database
  2. App deletes (invalidates) the cache entry
  3. Next read will re-populate the cache

Why it’s popular: Simple to implement. Cache only holds data that’s actually been requested. If the cache goes down, the app still works (just slower).

The catch: The first request for any key is always a cache miss. And there’s a small window between the DB write and cache invalidation where we serve stale data.

Read-Through

Looks similar to cache-aside from the app’s perspective, but the only difference is: the cache itself is responsible for loading data from the database on a miss. The app only ever talks to the cache.

Think of it like a library. We ask the librarian (cache) for a book. If they don’t have it, they go fetch it from the warehouse (DB) — we just wait.

Pros: Simpler application code. The data-loading logic lives in one place (the cache layer). Cons: Cache needs to know how to query the database. Cache failure means no data access at all.

Write-Through

Every write goes to the cache AND the database at the same time, synchronously. The cache acts as the primary interface for writes.

App writes "user:123" → Cache stores it → DB stores it → Done

Pros: Cache is always in sync with the DB. No stale data. Pairs perfectly with read-through — together they give us a fully consistent cache layer.

Cons: Higher write latency (two writes per operation). We end up caching data that might never be read, wasting memory.

Write-Behind (Write-Back)

Write to the cache immediately, then asynchronously flush to the database in the background. The app gets a fast response because it only waits for the cache write.

App writes "user:123" → Cache stores it → Returns immediately
                          ↓ (async, batched)
                          DB stores it later

Pros: Super fast writes. We can batch multiple writes into one DB operation, reducing database load.

Cons: If the cache crashes before flushing to DB, we lose data. This is a real risk. Use this only when speed matters more than durability.

Good for: Write-heavy workloads like analytics counters, view counts, or gaming leaderboards where losing a few seconds of data is acceptable.

Refresh-Ahead

The cache proactively refreshes entries that are about to expire. If an entry’s TTL is 60 seconds, the cache might refresh it at the 50-second mark — before anyone asks for it.

Pros: Hot data never experiences a cache miss. Users get consistently fast responses.

Cons: We’re refreshing data that might not be needed. If we predict wrong, we waste resources refreshing data nobody’s reading.

Good for: Data that’s read very frequently and is expensive to compute (dashboards, popular product pages, trending feeds).

The Cache Stampede Problem

Also called the thundering herd. Here’s the scenario:

  1. A popular cache key expires
  2. 1000 requests come in simultaneously for that key
  3. All 1000 requests see a cache miss
  4. All 1000 requests hit the database at the same time
  5. The database collapses under the load

This is a real problem for hot keys with millions of reads.

Solutions

Locking (Mutex): When a cache miss happens, the first request acquires a lock, fetches from DB, and populates the cache. All other requests wait for the lock to release, then read from cache.

Early expiration (Staggered TTL): Add a random jitter to TTL values so not all keys expire at the same time. Instead of all keys expiring at exactly 60s, they expire between 55-65s.

Refresh-ahead: Proactively refresh before expiry, so the key never actually expires for readers.

Never expire + background refresh: Set keys to never expire. A background job periodically refreshes them. Readers always get a cache hit (possibly slightly stale).

Distributed Caching

When our app runs on multiple servers, a local in-memory cache (like a HashMap) on each server has problems — different servers have different data, and when a server restarts, its cache is gone.

The solution: a shared distributed cache that all servers read from and write to.

Redis Cluster

Redis is the go-to for distributed caching. Redis Cluster splits data across multiple Redis nodes using hash slots (16,384 slots total). Each node owns a range of slots. If one node fails, a replica takes over.

Memcached

Simpler than Redis. Pure key-value store. Uses consistent hashing to distribute keys across multiple servers. Slightly faster than Redis for simple get/set operations because it does less.

When to Use Which

FeatureRedisMemcached
Data structuresLists, sets, sorted sets, hashesSimple key-value only
PersistenceOptional disk persistenceNone
ReplicationBuilt-inNone
Pub/SubYesNo
Best forFeature-rich caching, sessions, leaderboardsSimple, high-throughput caching

In simple language, caching patterns are about who loads the data, who writes the data, and when. Cache-aside is the safe default — we manage everything ourselves. Read-through and write-through let the cache do more work. Write-behind is fast but risky. And always plan for the stampede — because when a hot key expires, the thundering herd is coming.


35

Search and Indexing

advanced 4-7 YOE search Elasticsearch inverted-index full-text-search system-design

We’ve all used search. Type a few words, get results in milliseconds. But have we ever thought about how it works under the hood? Regular databases are terrible at this. Let’s see why, and what we use instead.

If we want to find all products containing the word “wireless” in their description, a SQL LIKE '%wireless%' query seems fine. But here’s the problem:

  • It does a full table scan — checks every single row
  • No index can help with a leading wildcard (%wireless%)
  • Case sensitivity, typos, synonyms, relevance ranking — none of that works
  • With millions of rows, this takes seconds. Users expect milliseconds.

Databases are built for structured queries (find user where id = 5). Search engines are built for unstructured text queries (find documents that best match “wireless bluetooth headphones”).

What Is an Inverted Index?

This is the core data structure behind every search engine. Think of it like the index at the back of a textbook — instead of reading every page to find “photosynthesis,” we look it up in the index and go straight to pages 47, 112, and 203.

An inverted index maps every word to the list of documents that contain it.

Inverted Index
Documents:
Doc 1: "fast red car"
Doc 2: "fast blue bike"
Doc 3: "red bike sale"
Inverted Index:
"fast" → [Doc 1, Doc 2]
"red" → [Doc 1, Doc 3]
"car" → [Doc 1]
"blue" → [Doc 2]
"bike" → [Doc 2, Doc 3]
"sale" → [Doc 3]
Search "fast red" → intersect [Doc 1, Doc 2] ∩ [Doc 1, Doc 3] = Doc 1

When someone searches “fast red,” we look up both words in the index, find the intersection of document lists, and return the results. No scanning required. This is why search engines are fast.

How Elasticsearch Works

Elasticsearch (ES) is the most popular search engine for backend systems. It’s built on Apache Lucene, which does the actual indexing and searching. ES wraps Lucene with a distributed system, REST API, and cluster management.

Key Concepts

Index — A collection of related documents. Think of it like a database table. An e-commerce site might have a products index and a reviews index.

Document — A single record in an index. It’s a JSON object. Each product is one document.

Shard — An index is split into shards for horizontal scaling. Each shard is a self-contained Lucene index. A products index with 5 shards distributes data across the cluster.

Replica — A copy of a shard on a different node. Replicas give us fault tolerance and allow read scaling (searches can hit replicas too).

How Data Gets Indexed

When we add a document to ES:

  1. The text is passed through an analyzer
  2. The analyzer tokenizes it — breaks text into individual terms (“Wireless Bluetooth Headphones” becomes [“wireless”, “bluetooth”, “headphones”])
  3. Tokens are normalized — lowercased, stemmed (“running” becomes “run”), stop words removed (“the”, “is”, “a”)
  4. The resulting terms go into the inverted index

This is why ES can find “running shoes” when we search for “run shoe” — the analyzer reduced both to the same root form.

Relevance Scoring

Not all results are equal. ES ranks them by relevance using a scoring algorithm (BM25 by default). The score depends on:

  • Term Frequency (TF) — How often does the term appear in this document? More = more relevant.
  • Inverse Document Frequency (IDF) — How rare is this term across all documents? Rare terms are more meaningful. “the” appears everywhere and tells us nothing. “Elasticsearch” is specific.
  • Field length — A match in a short title is more significant than a match in a long description.

When to Use a Search Engine

Use a search engine when:

  • We need full-text search across large amounts of text
  • Users expect typo tolerance, synonyms, and relevance ranking
  • We’re building product search, log analysis, or autocomplete
  • We need aggregations on text data (faceted search, like filtering by category)

Stick with the database when:

  • We’re doing exact lookups (get user by ID)
  • The dataset is small (a few thousand rows)
  • We only need simple prefix matching (PostgreSQL’s LIKE 'prefix%' uses indexes just fine)
  • We don’t want to maintain another system

Common Use Cases

Product search — The classic. Users search “blue running shoes size 10” and we need to match across multiple fields (name, description, category, size) with relevance ranking. Elasticsearch shines here.

Log analysis — Centralize logs from hundreds of servers. The ELK stack (Elasticsearch, Logstash, Kibana) is the standard for log search and visualization.

Autocomplete — As the user types “wire…”, we suggest “wireless headphones”, “wireless charger”, etc. ES has dedicated completion and search_as_you_type field types for this.

Geosearch — Find restaurants within 5km. ES supports geo queries natively with geo_point fields and distance filters.

Keeping Search in Sync with the Database

The database is still the source of truth. ES is a secondary index. We need to keep them in sync.

Dual write — Write to both DB and ES. Simple but risky — if one write fails, they’re out of sync.

Change Data Capture (CDC) — Listen to database changes (like Debezium reading the Postgres WAL) and stream them to ES. More reliable. The database doesn’t even know ES exists.

Periodic sync — A background job re-indexes data from DB to ES every few minutes. Simple but introduces lag.

CDC is the best approach for production systems. It’s reliable, near real-time, and decoupled.

In simple language, search engines use inverted indexes — maps from words to documents — to find things fast. Regular databases scan through rows; search engines look up words directly. Elasticsearch gives us full-text search, relevance ranking, and typo tolerance out of the box. We use it alongside our database, not instead of it.


36

Event Sourcing and CQRS

advanced 4-7 YOE event-sourcing CQRS event-store projections system-design

Most systems store the current state of things. A user’s balance is $500. An order status is “shipped.” But what if instead of storing the final answer, we stored every event that got us there? That’s event sourcing. And when we combine it with separate models for reading and writing, that’s CQRS.

Traditional CRUD vs Event Sourcing

In a traditional system, when a user deposits $100 into their account, we just update the balance:

UPDATE accounts SET balance = balance + 100 WHERE user_id = 123;

The old balance is gone. We overwrote it. If someone asks “what happened at 3pm yesterday?” — we can’t answer that from the database.

With event sourcing, we store the event instead:

{ event: "MoneyDeposited", userId: 123, amount: 100, timestamp: "..." }

The current balance isn’t stored. We calculate it by replaying all events. Deposit $500, withdraw $200, deposit $100 — replay them and we get $400.

Traditional CRUD vs Event Sourcing
CRUD (stores state)
accounts table:
user_id: 123
balance: $400
Only the final state. History is lost.
Event Sourcing (stores events)
event store:
+$500  AccountOpened
-$200  MoneyWithdrawn
+$100  MoneyDeposited
Replay → balance = $400

Why Event Sourcing Is Powerful

Complete audit trail — Every change is recorded. We can answer “what was the balance at 3pm on Tuesday?” by replaying events up to that timestamp. For banking, healthcare, and compliance, this is huge.

Replay and rebuild — Found a bug in our billing logic? Fix it, replay all events, and recalculate. We can rebuild the entire state from scratch.

Temporal queries — “Show me the state of this order 2 days ago.” With CRUD, that’s impossible unless we built time travel. With event sourcing, we just replay events up to that point.

Event-driven architecture — Events become the backbone. Other services can subscribe to events and react. “OrderPlaced” can trigger inventory update, email notification, and analytics — all independently.

Debugging — When something goes wrong, we have a complete history of exactly what happened and in what order.

The Event Store

Events are stored in an event store — an append-only log. Events are never modified or deleted. This is important. The immutability is what makes the audit trail trustworthy.

Each event typically has:

  • Event type — “OrderPlaced”, “ItemShipped”, “PaymentReceived”
  • Aggregate ID — Which entity this event belongs to (order_id: 456)
  • Payload — The event data (items, amounts, addresses)
  • Timestamp — When it happened
  • Version — Sequence number for ordering

Tools like EventStoreDB, Apache Kafka (as an event log), or even a regular database table with an append-only constraint can serve as an event store.

Projections (Materialised Views)

Replaying thousands of events for every read request would be way too slow. So we build projections — pre-computed read models that are kept up to date by processing events.

Think of it like this: the event store is the source of truth. Projections are pre-built answers to common questions.

Events:  OrderPlaced → ItemAdded → ItemAdded → PaymentReceived → OrderShipped

Projection 1 (Order Summary):  { orderId: 456, items: 2, status: "shipped", total: $89 }
Projection 2 (Revenue Report):  { date: "2025-03-15", revenue: $12,450 }

When a new event arrives, the relevant projections are updated. Reads are fast because they hit the projection directly — no replay needed.

What Is CQRS?

CQRS stands for Command Query Responsibility Segregation. In simple language, it means using a different model for reading data than we use for writing data.

In a typical app, the same database model handles both reads and writes. CQRS splits them apart:

  • Command side (writes) — Receives commands like “PlaceOrder” or “UpdateProfile.” Validates business rules and produces events or state changes.
  • Query side (reads) — Serves read requests using optimised read models (projections). Can use a completely different database or schema.

Why separate them? Because reads and writes have very different needs:

  • Writes need strong consistency, validation, and business rules
  • Reads need speed, and the shape of data we read is often different from how we store it

How Event Sourcing + CQRS Work Together

They’re a natural pair:

  1. A command comes in (“PlaceOrder”)
  2. The command handler validates it and produces events (“OrderPlaced”, “InventoryReserved”)
  3. Events are saved to the event store (write side)
  4. Event handlers update projections (read side)
  5. Read requests query the projections

The write side doesn’t care about read performance. The read side doesn’t care about business rules. Each is optimised for its job.

Real-World Examples

Banking — Every deposit, withdrawal, and transfer is an event. The balance is a projection. Regulators love the complete audit trail.

Shopping cart — ItemAdded, ItemRemoved, QuantityChanged, CouponApplied. We can replay to see exactly how the cart evolved. If a bug miscalculated a discount, we fix the logic and replay.

Version control (Git) — Git is basically event sourcing. Commits are events. The current state of the code is a projection. We can go back to any point in time.

When to Use It

Good fit:

  • Systems that need a complete audit trail (finance, healthcare, legal)
  • Complex business domains with many state transitions
  • Systems where replaying history has real value
  • High-read, low-write scenarios where CQRS shines

Bad fit:

  • Simple CRUD apps — event sourcing adds enormous complexity for little benefit
  • Small projects or MVPs — we’re adding infrastructure overhead we don’t need yet
  • Systems where eventual consistency between write and read models is unacceptable
  • Teams unfamiliar with the pattern — the learning curve is steep

The Tradeoffs

Event sourcing and CQRS are powerful but come with real costs:

  • Complexity — Significantly more moving parts than simple CRUD
  • Eventual consistency — Read models lag behind writes. We have to be OK with that
  • Event schema evolution — Events are immutable, but our schema will change over time. We need a versioning strategy
  • Storage growth — We never delete events, so storage grows forever. Snapshots help (save the current state periodically so we don’t replay from the beginning)

In simple language, event sourcing says “don’t store the answer, store the math.” Instead of saving “balance is $400,” we save every deposit and withdrawal. CQRS says “use different models for reading and writing because they have different needs.” Together, they give us an auditable, replayable, flexible system — but only if the complexity is worth it.


37

Distributed Consensus

advanced 7+ YOE distributed-systems consensus Raft leader-election ZooKeeper etcd

Here’s a problem that sounds simple but is incredibly hard: how do a bunch of servers agree on something? Which node is the leader? Is this transaction committed? What’s the current configuration? When networks are unreliable and servers can crash at any moment, getting everyone on the same page is one of the hardest problems in computer science.

That’s what distributed consensus solves.

The Problem

Imagine we have 5 database replicas. A write comes in. We need all (or most) of them to agree that the write happened. Sounds easy — just tell all 5, right?

But what if:

  • Node 3 is temporarily unreachable (network partition)
  • Node 5 crashed and restarted with stale data
  • Two nodes both think they’re the leader
  • Messages arrive out of order

Without a consensus protocol, we’d end up with nodes disagreeing about the state of the world. One node thinks the value is “A,” another thinks it’s “B.” That’s data corruption.

Leader Election

Most consensus systems work by electing a leader. One node is in charge. It receives all writes, decides the order of operations, and tells followers what to do. If the leader dies, a new one is elected.

Why not just let any node accept writes? Because coordinating writes across multiple nodes simultaneously is much harder than funneling everything through one node that makes decisions.

The tricky part: electing a leader that everyone agrees on, especially when the network is flaky.

Raft Consensus Algorithm

Raft was designed to be understandable (unlike Paxos, the OG consensus algorithm that’s famously confusing). Most modern systems use Raft or something based on it.

The Three States

Every node in a Raft cluster is in one of three states:

Raft: Leader Election Flow
Follower
Passive. Listens to leader.
Candidate
Requesting votes.
Leader
In charge. Sends heartbeats.
Follower —(timeout, no heartbeat)→ Candidate
Candidate —(gets majority votes)→ Leader
Candidate —(loses or times out)→ Follower
Leader —(discovers higher term)→ Follower

How Election Works (Simplified)

  1. All nodes start as followers
  2. Followers expect regular heartbeats from the leader
  3. If a follower doesn’t hear from the leader for a while (election timeout), it assumes the leader is dead
  4. It becomes a candidate and starts a new term (election round)
  5. It votes for itself and asks other nodes for their vote
  6. If it gets votes from a majority (quorum), it becomes the new leader
  7. The new leader starts sending heartbeats, and everyone falls in line

The election timeout is randomised (e.g., 150-300ms) so that not all followers become candidates at the same time. This avoids vote splitting.

How Writes Work

Once we have a leader:

  1. Client sends a write to the leader
  2. Leader appends it to its log
  3. Leader sends the entry to all followers
  4. Followers append it and acknowledge
  5. Once a majority acknowledges, the entry is committed
  6. Leader applies it and responds to the client

The key: writes are committed only when a majority of nodes have the data. Even if some nodes are down, as long as the majority is alive, the system works.

The Split-Brain Problem

This is the nightmare scenario. Two nodes both think they’re the leader. Both accept writes. Now we have conflicting data and no way to merge it.

How does this happen? Network partitions. If the network splits a 5-node cluster into a group of 3 and a group of 2, both groups might try to elect a leader.

Quorum Prevents Split-Brain

The solution is the quorum — a majority requirement. To do anything (elect a leader, commit a write), we need agreement from more than half the nodes.

In a 5-node cluster, the quorum is 3. If the network splits into groups of 3 and 2:

  • The group of 3 can elect a leader (3 >= quorum)
  • The group of 2 cannot elect a leader (2 < quorum)

There can only ever be one group with a majority. Math prevents split-brain.

This is why consensus clusters use odd numbers of nodes (3, 5, 7). With 4 nodes and a 2-2 split, neither side has a majority and the system is stuck.

Fault Tolerance

A Raft cluster of N nodes can tolerate (N-1)/2 failures:

Cluster SizeQuorumTolerated Failures
321
532
743

Going from 3 to 5 nodes gives us tolerance for one more failure. Going from 5 to 7 gives another. But more nodes means more coordination overhead. Most production systems use 3 or 5 nodes.

Where Consensus Is Used

Distributed databases — CockroachDB, TiDB, and YugabyteDB use Raft to replicate data across nodes. Every write is committed through consensus.

Configuration management — etcd (used by Kubernetes) and ZooKeeper store cluster configuration. When config changes, all nodes must agree on the new value.

Leader election for other services — Services that need exactly one active instance (like a scheduler) use consensus to pick the leader. If the leader dies, a new one is elected.

Distributed locks — Only one service can hold a lock at a time. Consensus ensures everyone agrees on who holds it.

Tools That Implement Consensus

etcd

A distributed key-value store that uses Raft. It’s the backbone of Kubernetes — all cluster state (pods, services, configs) lives in etcd. Written in Go. Simple API. Very reliable.

ZooKeeper

The older, battle-tested option. Used by Kafka (older versions), Hadoop, and HBase. Uses a protocol called ZAB (ZooKeeper Atomic Broadcast), which is similar to Raft. Written in Java. More complex to operate than etcd.

Consul

By HashiCorp. Uses Raft. Focuses on service discovery and health checking in addition to key-value storage. Often used in microservice architectures.

Consensus vs Eventual Consistency

Consensus gives us strong consistency — all nodes agree before we respond. The tradeoff is latency. Every write requires a network round trip to the majority.

Eventual consistency (like DynamoDB or Cassandra) is the opposite — writes are fast because we don’t wait for agreement. Nodes will eventually converge. Faster, but we might read stale data.

Most systems mix both. Use consensus for the things that absolutely must be correct (leader election, config, financial transactions). Use eventual consistency for things where speed matters more (feeds, analytics, caches).

In simple language, distributed consensus is how servers vote on the truth. Raft elects a leader, the leader coordinates writes, and everything needs a majority to proceed. The quorum rule guarantees we can never have two conflicting leaders. It’s slower than just writing to one server, but it’s the price we pay for systems that survive failures without losing data.


Real System Design Questions

38

Design a URL Shortener (TinyURL)

intermediate 2-4 YOE url-shortener system-design encoding caching

This is probably the most classic system design interview question. We’re designing a URL shortening service like TinyURL or bit.ly — take a long URL, give back a short one, and when someone visits the short URL, redirect them to the original. Simple concept, but the details get interesting fast.

Let’s walk through it step by step, the way we’d do it in an actual interview.

Step 1: Requirements

Functional Requirements

  • Given a long URL, generate a unique short URL
  • When a user visits the short URL, redirect them to the original long URL
  • Users can optionally set a custom short link
  • Links expire after a default period (configurable)
  • Analytics — track how many times a short URL was clicked

Non-Functional Requirements

  • High availability — the redirect service can’t go down, or millions of links break
  • Low latency — redirects should feel instant (< 50ms)
  • Read-heavy — way more people click short links than create them (100:1 read/write ratio)
  • Short URLs should be as short as possible
  • URLs should not be predictable (so people can’t guess other URLs)

Step 2: Estimation

Let’s put some numbers on this.

Assumptions:

  • 500M new URLs created per month
  • 100:1 read to write ratio → 50B redirects per month

QPS:

Write QPS = 500M / (30 × 86,400) ≈ ~200 URLs/sec
Read QPS  = 200 × 100 = ~20,000 redirects/sec
Peak QPS  = ~40,000 redirects/sec (2x average)

Storage (5 years):

Each URL record ≈ 500 bytes (short URL + long URL + metadata)
500M × 12 months × 5 years = 30 Billion URLs
30B × 500 bytes = 15 TB

Cache:

Following the 80/20 rule — 20% of URLs generate 80% of traffic.
Daily read requests = 50B / 30 ≈ 1.7B/day
Cache 20% of daily data = 1.7B × 0.2 × 500 bytes ≈ 170 GB

170 GB fits comfortably in a few Redis instances. Nice.

Step 3: High-Level Design

URL Shortener — High-Level Architecture
Clients (Browser / Mobile)
Load Balancer
App Server 1 App Server 2 App Server N
Cache (Redis)
hot URLs
Database
all URLs
Write flow: Client → LB → App Server → DB (generate short URL)
Read flow: Client → LB → App Server → Cache (hit?) → DB → 301 Redirect

How the redirect works:

  1. User visits short.url/abc123
  2. Load balancer routes to an app server
  3. App server checks Redis cache for abc123
  4. Cache hit → return the long URL. Cache miss → query DB, put in cache, return.
  5. Server responds with a 301 (permanent redirect) or 302 (temporary redirect)

301 vs 302 — which do we pick?

  • 301 (Moved Permanently) — The browser caches the redirect. Next time, it goes directly to the long URL without hitting our server. Better for the user, but we lose analytics visibility.
  • 302 (Found/Temporary) — The browser always comes back to our server first. We can track every click. More load on our server, but we keep full analytics.

If analytics matter (and for a URL shortener, they do), we go with 302.

Step 4: API Design

POST /api/v1/shorten
Body: { "long_url": "https://example.com/very/long/path", "custom_alias": "my-link", "expires_at": "2026-12-31" }
Response: { "short_url": "https://short.url/abc123", "expires_at": "2026-12-31" }

GET /{short_url_key}
Response: HTTP 302 Redirect → Location: https://example.com/very/long/path

GET /api/v1/stats/{short_url_key}
Response: { "total_clicks": 15420, "created_at": "2025-01-15", "long_url": "..." }

Step 5: Data Model

-- Main URL table
CREATE TABLE urls (
    id          BIGINT PRIMARY KEY AUTO_INCREMENT,
    short_key   VARCHAR(7) UNIQUE NOT NULL,   -- the "abc123" part
    long_url    TEXT NOT NULL,
    user_id     BIGINT,                        -- nullable for anonymous users
    created_at  TIMESTAMP DEFAULT NOW(),
    expires_at  TIMESTAMP,
    INDEX idx_short_key (short_key)            -- fast lookups by short key
);

-- Analytics table (append-only, write-heavy)
CREATE TABLE click_events (
    id          BIGINT PRIMARY KEY AUTO_INCREMENT,
    short_key   VARCHAR(7) NOT NULL,
    clicked_at  TIMESTAMP DEFAULT NOW(),
    ip_address  VARCHAR(45),
    user_agent  TEXT,
    referrer    TEXT,
    country     VARCHAR(2),
    INDEX idx_short_key_time (short_key, clicked_at)
);

We keep the URL table lean for fast reads. The analytics table is append-only — we’re only ever inserting into it, never updating. This is a perfect candidate for a time-series approach or a message queue that writes asynchronously.

Step 6: Deep Dives

Deep Dive 1: Short URL Generation Strategies

This is the heart of the problem. How do we turn a long URL into a short, unique key? We have three main approaches.

Approach A: Hash + Collision Check

Take the long URL, hash it (MD5 or SHA-256), and take the first 7 characters.

MD5("https://example.com/long/path") = "a1b2c3d4e5f6..."
Short key = "a1b2c3d" (first 7 chars)

Problem: collisions. Two different URLs could produce the same first 7 characters. So we check the DB — if the key exists, we append a counter and rehash. This works but the collision checks add latency and complexity.

Approach B: Auto-Increment ID + Base62 Encoding

Use a database auto-increment ID and convert it to base62 (a-z, A-Z, 0-9 = 62 characters).

ID = 123456789
Base62 = "8M0kX"   (123456789 in base 62)

With 7 characters, base62 gives us 62^7 = 3.5 trillion possible URLs. That’s plenty. The problem? URLs are predictable. If someone gets abc123, they know abc124 probably exists too. Also, the auto-increment ID becomes a single point of failure in a distributed system.

Approach C: Pre-Generated Key Service (KGS)

A separate service pre-generates millions of unique keys and stores them in a database. When an app server needs a key, it grabs one from the pool.

KGS Database:
┌─────────────┬──────────┐
│ key         │ used     │
├─────────────┼──────────┤
│ "a7Bx2q"   │ false    │
│ "k9Mw3r"   │ false    │
│ "p2Lz8n"   │ true     │  ← already assigned
└─────────────┴──────────┘

Each app server fetches a batch of keys (say 1000) and keeps them in memory. No collision checking needed. No coordination between servers. This is the most scalable approach and the one most interviewers love.

The winner: Pre-Generated Key Service. It’s clean, fast, and eliminates the collision problem entirely.

Deep Dive 2: Caching Hot URLs

Our system is extremely read-heavy (100:1). Most traffic goes to a small percentage of popular URLs. This screams “cache me.”

We put Redis between the app servers and the database. The strategy:

  1. On a redirect request, check Redis first
  2. Cache hit → return immediately (sub-millisecond)
  3. Cache miss → query DB → store in Redis with a TTL → return
  4. Use LRU eviction — when cache is full, kick out the least recently used URL

With our 170 GB estimate, we can use a Redis cluster of 3-4 nodes with replication. The cache hit rate should be 90%+ since URL access follows a power law — a few URLs get the vast majority of clicks.

Cache invalidation: When a URL expires or gets deleted, we remove it from the cache. Simple because URLs are immutable — we never update a short URL to point to a different long URL.

Deep Dive 3: Analytics and Click Tracking

Every redirect is a potential analytics event. But we can’t let analytics slow down the redirect. The redirect must be fast — analytics can be eventual.

The approach: async processing with a message queue.

User clicks → App Server sends 302 redirect immediately
           → App Server pushes click event to Kafka/SQS
           → Analytics workers consume from queue
           → Workers batch-insert into click_events table

This way, the user gets their redirect in milliseconds. The analytics data flows through a queue and gets processed in the background. If the analytics system falls behind, clicks queue up but the redirect service stays fast.

For the dashboard, we can pre-aggregate hourly/daily counts in a summary table instead of running expensive COUNT queries on billions of rows.

Step 7: Scaling

Database scaling:

  • The URL table is read-heavy → add read replicas
  • As data grows past a single DB → shard by short_key hash
  • The click_events table grows fast → partition by time (monthly partitions) and archive old data

App server scaling:

  • Stateless servers behind a load balancer → horizontally scale by adding more instances
  • Each server holds a batch of pre-generated keys in memory → no coordination needed between servers

Cache scaling:

  • Start with a single Redis instance, then move to Redis Cluster
  • Consistent hashing to distribute keys across cache nodes

Global distribution:

  • Deploy app servers in multiple regions
  • Use GeoDNS to route users to the nearest region
  • Replicate the database across regions (or use a globally distributed DB like CockroachDB)

Handling 40K QPS at peak:

  • Load balancer distributes across app servers
  • Redis handles ~90% of reads (36K QPS in cache = easy for Redis)
  • Only ~4K QPS actually hits the database
  • Each DB read is a simple key lookup by indexed short_key — blazing fast

In simple language, a URL shortener is a giant key-value store with a smart key generation strategy. The short key is the hard part — we use a pre-generated key service to avoid collisions. The redirect is the hot path — we cache it aggressively. And analytics go through a queue so they never slow down the user. That’s the whole system.


39

Design a Rate Limiter

intermediate 2-4 YOE rate-limiter system-design distributed-systems Redis API-gateway

A rate limiter controls how many requests a client can make in a given time window. Think of it like a bouncer at a club — only so many people get in per hour. If we exceed the limit, the request gets rejected with a 429 Too Many Requests.

Every major API has one. Without it, a single misbehaving client can take down our entire service.

Step 1: Requirements

Functional Requirements

  • Limit the number of requests a client can make within a time window
  • Support different rate limiting rules (e.g., 100 req/min for free tier, 1000 req/min for paid)
  • Return clear error responses when rate limited (429 status + retry-after header)
  • Support multiple limiting strategies (by user ID, IP address, API key)

Non-Functional Requirements

  • Low latency — the rate limiter sits in the request path, so it must be fast (< 1ms overhead)
  • Distributed — must work across multiple servers (not just in-memory on one box)
  • Highly available — if the rate limiter goes down, we’d rather let all requests through than block everything
  • Accurate — in a distributed setup, we shouldn’t let significantly more requests through than the limit

Step 2: Estimation

Assumptions:

  • We’re protecting an API that handles 10,000 QPS across all clients
  • 1 million unique clients (user IDs or API keys)
  • Each rate limit check should add < 1ms of latency

Storage:

Each client needs a counter + timestamp ≈ 20 bytes
1M clients × 20 bytes = 20 MB

That’s nothing. The entire state fits easily in Redis.

QPS on the rate limiter:

Every incoming request triggers a rate limit check.
10,000 QPS → 10,000 Redis operations/sec

A single Redis instance handles 100K+ operations/sec. We’re well within limits. For higher scale, we shard by client ID.

Step 3: High-Level Design

Rate Limiter — Request Flow
Client Request
API Gateway / Load Balancer
Rate Limiter Middleware
Allowed
App Servers
Rejected
429 Too Many Requests
Redis (Counters) Rules Config (DB/YAML)
Rate limiter checks Redis on every request. Rules define limits per client/endpoint.

Where does the rate limiter live?

We have a few options:

  • API Gateway — Cloud providers (AWS API Gateway, Kong) have built-in rate limiting. Easiest to set up.
  • Middleware — A thin layer in our application code, before the request hits the business logic.
  • Sidecar — A separate process alongside our app (like Envoy proxy).

For most systems, putting it at the API Gateway level is the right call. It’s centralized, handles it before the request even reaches our servers, and most gateways support it out of the box.

Step 4: API Design

The rate limiter doesn’t have its own API per se — it’s middleware. But the response headers tell clients about their rate limit status:

# On every response (allowed or rejected):
X-RateLimit-Limit: 100        # max requests allowed in the window
X-RateLimit-Remaining: 42     # requests left in current window
X-RateLimit-Reset: 1735689600 # Unix timestamp when the window resets

# On rejection:
HTTP/1.1 429 Too Many Requests
Retry-After: 30               # seconds until client should retry
Content-Type: application/json
{ "error": "Rate limit exceeded. Try again in 30 seconds." }

For configuring rate limit rules, we’d have an internal admin API:

POST /admin/rate-rules
Body: {
  "client_type": "free_tier",
  "endpoint": "/api/v1/search",
  "max_requests": 100,
  "window_seconds": 60
}

Step 5: Data Model

Rate limiters usually don’t use a traditional database. Redis is the go-to because it’s in-memory and supports atomic operations.

# Redis key pattern for counters
rate_limit:{client_id}:{endpoint}:{window}

# Example: user 42, search endpoint, per-minute window
rate_limit:user_42:/api/search:202603301430

# Value: current request count (integer)
# TTL: set to expire when the window ends

For the rules configuration:

# Rate limit rules (stored in DB or config file)
rules:
  - client_type: "free_tier"
    limits:
      - endpoint: "/api/v1/*"
        max_requests: 100
        window: 60        # seconds

  - client_type: "paid_tier"
    limits:
      - endpoint: "/api/v1/*"
        max_requests: 1000
        window: 60

  - client_type: "default"
    limits:
      - endpoint: "*"
        max_requests: 50
        window: 60

Step 6: Deep Dives

Deep Dive 1: Rate Limiting Algorithms

There are four main algorithms. Each has different tradeoffs.

1. Token Bucket

Think of a bucket that holds tokens. It starts full. Each request takes one token. The bucket refills at a steady rate. If the bucket is empty, the request is rejected.

Bucket: max 10 tokens, refill 1 token/second

Time 0:  [10 tokens] → Request → [9 tokens] ✓
Time 0:  [9 tokens]  → Request → [8 tokens] ✓
...
Time 0:  [1 token]   → Request → [0 tokens] ✓
Time 0:  [0 tokens]  → Request → REJECTED    ✗
Time 1:  [1 token]   → Request → [0 tokens] ✓  (refilled)

Pros: Allows short bursts (up to bucket size). Smooth over time. Amazon and Stripe use this. Cons: Two parameters to tune (bucket size + refill rate).

2. Sliding Window Log

We store the timestamp of every request. To check the limit, we count how many timestamps fall within the last N seconds.

Window: 3 requests per 60 seconds

Timestamps: [1:00:15, 1:00:30, 1:00:45]
New request at 1:00:50 → 3 requests in window → REJECTED
New request at 1:01:20 → remove 1:00:15 (expired) → 2 in window → ALLOWED

Pros: Very accurate — no boundary issues. Cons: Stores every timestamp, uses more memory. Not great for high limits.

3. Sliding Window Counter

A clever hybrid. We keep counters for the current and previous windows, then calculate a weighted count based on where we are in the current window.

Previous window (1:00-1:01): 84 requests
Current window  (1:01-1:02): 36 requests
We're 25% into the current window.

Weighted count = 84 × 0.75 + 36 = 99
Limit = 100 → ALLOWED (barely!)

Pros: Memory-efficient (just two counters). Smooth enough for most use cases. Cons: Not perfectly accurate — it’s an approximation. But it’s close enough.

The winner for most systems: Token Bucket or Sliding Window Counter. Token Bucket if we want to allow bursts. Sliding Window Counter if we want simplicity.

Deep Dive 2: Distributed Rate Limiting with Redis

Here’s the tricky part. Our API has multiple servers behind a load balancer. If we just count requests in memory on each server, a client can send 100 requests to Server A and 100 to Server B, effectively getting 200 when the limit is 100.

We need a centralized counter, and Redis is perfect for this.

The basic approach with the Fixed Window Counter algorithm:

# Pseudocode for each incoming request:

key = f"rate:{client_id}:{current_minute}"
count = INCR(key)              # atomic increment in Redis
if count == 1:
    EXPIRE(key, 60)            # set TTL on first request

if count > limit:
    return 429                 # rejected
else:
    forward request            # allowed

But there’s a subtle race condition. The INCR and EXPIRE are two separate commands. If our server crashes between them, the key never expires and the client is blocked forever.

The fix: use a Lua script to make it atomic.

-- Atomic rate limit check in Redis (Lua script)
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])

local current = redis.call('INCR', key)
if current == 1 then
    redis.call('EXPIRE', key, window)
end

if current > limit then
    return 0  -- rejected
else
    return 1  -- allowed
end

Lua scripts in Redis execute atomically — no other command can run in between. Problem solved.

For Token Bucket in Redis, we store the last refill timestamp and current token count. Each request runs a Lua script that calculates how many tokens to add since the last refill, subtracts one, and returns allow/deny.

Deep Dive 3: Client Identification and Rules

How do we identify who’s making the request? We have several options, and the right choice depends on our use case:

  • API Key — Most common for authenticated APIs. Each key maps to a tier with specific limits.
  • User ID — For logged-in users. Tied to their account regardless of IP changes.
  • IP Address — For unauthenticated requests. Problem: many users behind the same NAT/VPN share one IP.
  • Combination — Use API key when available, fall back to IP for unauthenticated requests.

Granularity matters too. We can rate limit at different levels:

Global:     1M requests/min across all clients
Per-client: 100 requests/min per API key
Per-endpoint: 10 requests/min on POST /api/v1/upload per client

Layering these gives us defense in depth. A client might be within their per-endpoint limit but hitting the global limit, and we’d still throttle them.

Rules engine: We load rules from a config file or database and cache them in memory on each server. A background job refreshes the cache every few seconds so we can update limits without redeploying.

Step 7: Scaling

Redis scaling:

  • A single Redis instance handles our 10K QPS easily
  • For higher scale, shard by client ID using Redis Cluster
  • Use Redis replicas for high availability — if the primary goes down, the replica takes over
  • If Redis is completely down, fail open (let all requests through). Briefly no rate limiting is better than blocking all requests.

Multi-region:

  • If our API is global, we need rate limiters in each region
  • Option 1: Each region has its own Redis → limits are per-region (simpler, but a client could get N times the limit across N regions)
  • Option 2: Sync counters across regions → accurate global limits but adds latency
  • Option 3: Use a global Redis with cross-region replication → eventual consistency means small over-counting

For most systems, per-region limiting is good enough. If we need strict global limits, we accept the slight inaccuracy of eventual consistency.

Handling bursty traffic:

  • Token Bucket naturally handles bursts (that’s its superpower)
  • Set the burst size to 2-3x the per-second rate for reasonable burst allowance
  • Add a request queue for soft rate limiting — instead of rejecting immediately, hold the request for a short time and retry

Monitoring and alerting:

  • Track how many requests get rate limited (by client, endpoint, rule)
  • Alert if a large percentage of requests are being rejected — might mean our limits are too low
  • Dashboard showing top rate-limited clients — helps identify abuse patterns

In simple language, a rate limiter is a counter that says “nope, too many” when a client goes too fast. The hard parts are making it distributed (Redis + Lua scripts for atomicity) and choosing the right algorithm (Token Bucket for burst tolerance, Sliding Window Counter for simplicity). Keep it fast, keep it centralized, and always fail open.


40

Design a Chat System (WhatsApp)

advanced 4-7 YOE chat messaging websocket system-design real-time

We’re designing a real-time chat system like WhatsApp or Messenger. This is an advanced question because it touches on real-time communication, message ordering, delivery guarantees, group chats, and online presence — all at massive scale.

The core challenge: how do we deliver a message from one person to another in real time, reliably, and at the scale of billions of messages per day?

Step 1: Requirements

Functional Requirements

  • One-on-one messaging between users
  • Group chats (up to 500 members)
  • Message delivery status: sent, delivered, read
  • Online/offline presence indicators
  • Push notifications for offline users
  • Media sharing (images, videos, documents)
  • Message history and persistence

Non-Functional Requirements

  • Real-time — messages should arrive in under 500ms
  • Reliable delivery — no messages should be lost, even if the recipient is offline
  • Message ordering — messages in a conversation should appear in the correct order
  • High availability — chat is a core feature, downtime is unacceptable
  • Low latency — the system should feel instant
  • Scale — support 2B users, 60B messages/day

Step 2: Estimation

Assumptions:

  • 2B total users, 500M daily active users
  • Each active user sends 40 messages/day on average
  • Average message size: 100 bytes (text), 500 KB (media — 10% of messages include media)

QPS:

Messages/day = 500M × 40 = 20B messages/day
Write QPS    = 20B / 86,400 ≈ ~230,000 messages/sec
Peak QPS     = ~500,000 messages/sec

Storage:

Text messages:  20B × 100 bytes = 2 TB/day
Media messages: 20B × 0.10 × 500 KB = 1 PB/day (media stored in object storage)
Text per year:  ~730 TB

Connections:

500M DAU = 500M concurrent WebSocket connections (at peak)
Each connection ≈ 10 KB memory on the server
500M × 10 KB = 5 TB of memory across all chat servers

That’s a lot of connections. We’ll need thousands of chat servers.

Step 3: High-Level Design

Chat System — High-Level Architecture
User A
User B
│ WebSocket
│ WebSocket
Chat Server 1 Chat Server 2 Chat Server N
Message Queue (Kafka)
Message DB
(Cassandra)
User Service
(profiles, auth)
Push Service
(APNs, FCM)
Media Service
(S3 + CDN)
1:1 flow: A → Chat Server 1 → Kafka → Chat Server 2 → B (if online) or Push Service (if offline)
Persistence: Every message gets written to the Message DB via Kafka consumers

Why WebSocket?

HTTP is request-response. The client has to ask the server “got any new messages?” over and over. That’s called polling, and it’s wasteful.

WebSocket gives us a persistent, two-way connection. The server can push a message to the client the instant it arrives. No polling, no wasted requests. This is how every modern chat app works.

The flow for a 1:1 message:

  1. User A sends a message through their WebSocket connection to Chat Server 1
  2. Chat Server 1 looks up which chat server User B is connected to (using a connection registry in Redis)
  3. If User B is online → route the message to that chat server → push to B’s WebSocket
  4. If User B is offline → send a push notification via APNs/FCM
  5. Either way, the message gets persisted to the database

Step 4: API Design

For WebSocket messages, we use a JSON-based protocol:

// Send a message
{
  "type": "send_message",
  "conversation_id": "conv_abc123",
  "content": "Hey, how are you?",
  "content_type": "text",
  "client_msg_id": "uuid-from-client"
}

// Server acknowledgment
{
  "type": "message_ack",
  "client_msg_id": "uuid-from-client",
  "server_msg_id": "msg_789",
  "status": "sent",
  "timestamp": 1735689600
}

// Receive a message (pushed by server)
{
  "type": "new_message",
  "conversation_id": "conv_abc123",
  "server_msg_id": "msg_789",
  "sender_id": "user_42",
  "content": "Hey, how are you?",
  "content_type": "text",
  "timestamp": 1735689600
}

// Delivery receipt
{
  "type": "message_status",
  "server_msg_id": "msg_789",
  "status": "delivered"   // or "read"
}

For REST endpoints (non-real-time operations):

GET  /api/v1/conversations                    — list user's conversations
GET  /api/v1/conversations/{id}/messages      — message history (paginated)
POST /api/v1/conversations                    — create a group chat
POST /api/v1/media/upload                     — get pre-signed URL for media upload

Step 5: Data Model

We need a database that handles massive write throughput and scales horizontally. Cassandra is a great fit for the messages table — it’s designed for write-heavy workloads and scales linearly.

-- Users table (PostgreSQL — relational, small)
CREATE TABLE users (
    user_id     BIGINT PRIMARY KEY,
    username    VARCHAR(50) UNIQUE,
    display_name VARCHAR(100),
    avatar_url  TEXT,
    last_seen   TIMESTAMP,
    created_at  TIMESTAMP
);

-- Conversations table (PostgreSQL)
CREATE TABLE conversations (
    conversation_id  BIGINT PRIMARY KEY,
    type            VARCHAR(10),          -- 'direct' or 'group'
    name            VARCHAR(100),         -- group name (null for direct)
    created_at      TIMESTAMP
);

-- Conversation participants (PostgreSQL)
CREATE TABLE participants (
    conversation_id  BIGINT,
    user_id         BIGINT,
    role            VARCHAR(10) DEFAULT 'member',  -- 'admin', 'member'
    joined_at       TIMESTAMP,
    PRIMARY KEY (conversation_id, user_id)
);

-- Messages table (Cassandra — write-heavy, time-series)
-- Partition key: conversation_id
-- Clustering key: message_id (TimeUUID for ordering)
CREATE TABLE messages (
    conversation_id  BIGINT,
    message_id      TIMEUUID,
    sender_id       BIGINT,
    content         TEXT,
    content_type    VARCHAR(10),    -- 'text', 'image', 'video'
    media_url       TEXT,
    created_at      TIMESTAMP,
    PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id ASC);

Why Cassandra for messages? Each conversation is a partition. Reading a conversation’s messages is a single partition read — super fast. Writes are append-only. And Cassandra scales horizontally by adding more nodes.

Connection registry (Redis):

# Which chat server is each user connected to?
user_connection:{user_id} → "chat-server-42:ws-connection-id"
TTL: 300 seconds (refreshed by heartbeat)

Step 6: Deep Dives

Deep Dive 1: Message Delivery and Ordering

Delivering messages reliably is harder than it sounds. What if User B’s phone is offline? What if they switch networks mid-conversation? What if two messages arrive out of order?

Message delivery flow:

Sender                    Server                    Receiver
  │                         │                          │
  │── Send message ────────>│                          │
  │<── ACK (sent) ──────────│                          │
  │                         │── Push message ─────────>│
  │                         │<── ACK (delivered) ──────│
  │<── Status: delivered ───│                          │
  │                         │       (user opens chat)  │
  │<── Status: read ────────│<── Read receipt ─────────│

Three statuses: sent (server received it), delivered (recipient’s device got it), read (recipient opened the conversation).

Ordering: Messages within a single conversation must appear in order. We use a per-conversation sequence number for this. Every message in a conversation gets an incrementing sequence number assigned by the server.

Conversation conv_abc123:
  msg_1: seq=1, "Hey"           (from A, 1:00:00)
  msg_2: seq=2, "What's up?"    (from A, 1:00:01)
  msg_3: seq=3, "Not much, you?" (from B, 1:00:05)

The client uses these sequence numbers to display messages in order. If a message arrives with seq=5 but the client only has up to seq=3, it knows seq=4 is missing and requests it from the server.

Handling offline users: When User B is offline, the server stores the message in the database and sends a push notification. When User B comes back online and reconnects via WebSocket, the server sends all undelivered messages (everything since B’s last received sequence number).

Deep Dive 2: Group Chat Fan-Out

In a 1:1 chat, we deliver the message to one person. In a group chat with 200 people, we have to deliver it to 199 people. How?

Option A: Fan-out on write (push model)

When someone sends a message to a group, the server immediately pushes it to every online member’s WebSocket connection and queues push notifications for offline members.

Alice sends "Hello everyone" to group (200 members)
→ Server looks up all 200 participants
→ For each online member: push via WebSocket
→ For each offline member: send push notification
→ Write message once to the messages table

Pros: Receivers get the message instantly. Reading is simple — just fetch from the connection. Cons: Sending a message is expensive for large groups. If 200 people are online, that’s 200 WebSocket pushes per message.

Option B: Fan-out on read (pull model)

The server just stores the message. When each member opens the group chat, they pull new messages.

Pros: Sending is fast — just one write. Cons: Opening a group chat requires a DB read. For active groups, this adds latency.

Our approach: Push for small/medium groups (up to ~500 members). WhatsApp caps groups at 1024 members for exactly this reason. For very large groups (like Slack channels with 10K+ people), we’d switch to a pull model with smart caching.

The message queue (Kafka) helps here. When a message is sent to a group, it’s published to a topic partitioned by conversation. Consumer workers read from the topic and handle the fan-out — looking up members, pushing to WebSockets, and sending push notifications.

Deep Dive 3: Online Presence and Typing Indicators

How does the app show that someone is “online” or “typing…”?

Online/offline status:

Every connected client sends a heartbeat to the server every 30 seconds. The server updates the user’s status in Redis:

presence:{user_id} → { status: "online", last_seen: 1735689600 }
TTL: 60 seconds (auto-expires if no heartbeat)

When the TTL expires without a heartbeat, the user is considered offline. We update last_seen and publish an “offline” event to their contacts.

But we don’t want to notify every contact every time someone goes online/offline. That’s too expensive at scale. Instead:

  • We only send presence updates to people who have the chat open (actively viewing the conversation list or the specific chat)
  • Presence is best-effort — it’s okay if it’s a few seconds stale

Typing indicators:

When a user starts typing, the client sends a typing_start event. The server forwards it to the other participants in the conversation. After 3 seconds of no keystrokes, the client sends typing_stop.

{ "type": "typing", "conversation_id": "conv_abc", "user_id": "user_42", "status": "typing" }

Typing indicators are fire-and-forget. If one gets lost, nothing bad happens — the “typing…” label just disappears sooner. No persistence needed.

Step 7: Scaling

Chat server scaling:

  • Each chat server handles ~100K concurrent WebSocket connections
  • 500M concurrent users → ~5,000 chat servers
  • Use consistent hashing to assign users to chat servers
  • The connection registry in Redis lets any chat server route a message to the right destination

Database scaling:

  • Messages in Cassandra — partition by conversation_id for locality
  • User data in PostgreSQL — read replicas for lookups
  • Hot conversations (active group chats) can be cached in Redis

Message queue (Kafka) scaling:

  • Partition by conversation_id — all messages for a conversation go to the same partition (preserves ordering)
  • Add more partitions and consumers as throughput grows

Media handling:

  • Media files go to object storage (S3) via pre-signed URLs
  • Thumbnails generated asynchronously by a media processing service
  • CDN in front of object storage for fast delivery

Multi-region deployment:

  • Chat servers in multiple regions for low latency
  • Messages replicated across regions asynchronously
  • User connects to the nearest region (GeoDNS)
  • Cross-region messages: Server in Region A publishes to Kafka → replicated to Region B → delivered locally

Push notifications at scale:

  • Batch push notifications to APNs/FCM (they support batching)
  • Separate push notification workers so they don’t block message delivery
  • Rate limit notifications per user (nobody wants 100 notifications from an active group)

In simple language, a chat system is a network of WebSocket connections with a message queue in the middle. The client connects to a chat server. The server knows where every other user is connected (Redis registry). Messages flow through Kafka for reliability and ordering. Offline messages get queued and delivered on reconnect. The hardest parts are keeping messages in order, handling group fan-out efficiently, and managing millions of persistent connections. But the core pattern is surprisingly straightforward: accept message, find recipient, deliver.


41

Design a Social Media Feed (Twitter/X)

advanced 4-7 YOE news-feed timeline fan-out system-design social-media

We’re designing the home timeline for a social media platform like Twitter/X. When we open the app, we see a feed of recent posts from people we follow, ranked in some order. Sounds simple, but this is one of the hardest problems in system design.

The core challenge: when we follow 500 people and each posts 5 times a day, how does the system assemble our personalized feed from those 2,500 posts in under 200ms? Multiply that by 200M daily active users opening the app dozens of times a day. That’s the problem.

Step 1: Requirements

Functional Requirements

  • Users can create posts (text, images, videos)
  • Users can follow/unfollow other users
  • Home feed shows posts from people we follow, in a ranked order
  • Feed supports pagination (infinite scroll)
  • Real-time updates — new posts appear without refreshing
  • Like, retweet, and reply on posts

Non-Functional Requirements

  • Low latency — feed should load in < 200ms
  • High availability — the feed is THE product, it can’t go down
  • Eventually consistent — it’s okay if a new post takes a few seconds to appear in everyone’s feed
  • Scale — 500M users, 200M DAU, 1B tweets generated per day

Step 2: Estimation

Assumptions:

  • 500M total users, 200M DAU
  • Average user follows 200 people
  • Average user posts 5 tweets/day
  • Average user reads their feed 10 times/day
  • Each tweet: ~300 bytes text + metadata

QPS:

Tweet creation:  200M × 5 / 86,400 ≈ ~12,000 writes/sec
Feed reads:      200M × 10 / 86,400 ≈ ~23,000 reads/sec
Peak:            ~50,000 reads/sec

Storage:

Tweets/day:  200M × 5 = 1B tweets/day
Text storage: 1B × 300 bytes = 300 GB/day
Media (20% of tweets, avg 500 KB): 1B × 0.2 × 500 KB = 100 TB/day

Feed cache:

Each user's feed = 200 tweet IDs × 8 bytes = 1.6 KB
200M DAU × 1.6 KB = 320 GB (fits in a Redis cluster)

Step 3: High-Level Design

News Feed — High-Level Architecture
User Posts a Tweet
User Opens Feed
Write Path (Tweet Creation)
1. Tweet → API Server
2. Store tweet in Tweets DB
3. Media → Object Storage
4. Fan-out Service reads follower list
5. Push tweet ID to each follower's feed cache
Read Path (Feed Generation)
1. User request → API Server
2. Fetch tweet IDs from feed cache
3. Hydrate: fetch full tweet data
4. Merge with pull-based tweets (celebrities)
5. Rank and return the feed
Fan-out Service Feed Cache (Redis) Tweets DB User/Graph DB Object Storage

The system has two distinct paths — the write path (someone posts a tweet) and the read path (someone opens their feed). The magic (and the hard part) is in how these two paths connect. That’s where fan-out comes in.

Step 4: API Design

POST /api/v1/tweets
Body: { "content": "Hello world!", "media_ids": ["media_123"] }
Response: { "tweet_id": "tw_789", "created_at": "2026-03-30T10:00:00Z" }

GET /api/v1/feed?cursor=<last_tweet_id>&limit=20
Response: {
  "tweets": [
    { "tweet_id": "tw_789", "user": {...}, "content": "...", "likes": 42, "created_at": "..." },
    ...
  ],
  "next_cursor": "tw_750"
}

POST /api/v1/users/{user_id}/follow
POST /api/v1/users/{user_id}/unfollow

POST /api/v1/tweets/{tweet_id}/like
POST /api/v1/tweets/{tweet_id}/retweet

POST /api/v1/media/upload     — returns media_id (pre-signed URL for upload)

Cursor-based pagination is important here. We can’t use offset-based pagination (page=1, page=2) because new tweets keep getting added. If a new tweet gets inserted while we’re scrolling, offset-based pagination would show us duplicates or skip tweets. With cursors, we say “give me the next 20 tweets after tweet_750” — that’s stable regardless of new inserts.

Step 5: Data Model

-- Users table
CREATE TABLE users (
    user_id     BIGINT PRIMARY KEY,
    username    VARCHAR(50) UNIQUE,
    display_name VARCHAR(100),
    avatar_url  TEXT,
    follower_count  INT DEFAULT 0,
    following_count INT DEFAULT 0,
    created_at  TIMESTAMP
);

-- Follow graph (who follows whom)
CREATE TABLE follows (
    follower_id  BIGINT,          -- the person doing the following
    followee_id  BIGINT,          -- the person being followed
    created_at   TIMESTAMP,
    PRIMARY KEY (follower_id, followee_id),
    INDEX idx_followee (followee_id)    -- find all followers of a user
);

-- Tweets table
CREATE TABLE tweets (
    tweet_id    BIGINT PRIMARY KEY,     -- Snowflake ID (ordered by time)
    user_id     BIGINT NOT NULL,
    content     TEXT,
    media_urls  JSON,                    -- array of media URLs
    reply_to    BIGINT,                  -- null if not a reply
    retweet_of  BIGINT,                  -- null if not a retweet
    like_count  INT DEFAULT 0,
    retweet_count INT DEFAULT 0,
    created_at  TIMESTAMP,
    INDEX idx_user_time (user_id, created_at DESC)
);

-- Feed cache (Redis sorted set per user)
-- Key: feed:{user_id}
-- Value: sorted set of tweet_ids, scored by timestamp
-- Each entry: (score=timestamp, member=tweet_id)

Why Snowflake IDs? We can’t use auto-increment IDs across distributed databases (they’d collide). Twitter invented Snowflake IDs — 64-bit IDs that encode the timestamp, machine ID, and a sequence number. They’re globally unique AND chronologically sorted. Perfect for cursoring through a feed.

Step 6: Deep Dives

Deep Dive 1: Fan-Out Strategies

This is THE question in a feed system design. When someone posts a tweet, how does it end up in their followers’ feeds?

Fan-Out on Write (Push Model)

When a user posts a tweet, we immediately push the tweet ID into every follower’s feed cache.

@Alice posts a tweet (tweet_id = 789)
Alice has 1,000 followers

Fan-out service:
  → Get Alice's follower list (1,000 user IDs)
  → For each follower: ZADD feed:{follower_id} timestamp 789
  → 1,000 Redis writes

Pros: Reading the feed is blazing fast — just ZRANGE feed:{user_id} 0 19. The feed is pre-computed. Cons: Posting a tweet is expensive. If we have 1,000 followers, that’s 1,000 cache writes. And if a celebrity with 50M followers posts? That’s 50M writes per tweet. Not gonna work.

Fan-Out on Read (Pull Model)

We don’t pre-compute anything. When a user opens their feed, we fetch the latest tweets from everyone they follow, merge them, and sort by time.

User opens feed. They follow 200 people.
  → Get the list of 200 people they follow
  → For each: fetch their latest N tweets
  → Merge all tweets, sort by rank/time
  → Return top 20

Pros: Posting a tweet is simple — just one write. No fan-out cost. Cons: Reading the feed is slow. We’re doing 200+ queries and merging results on every feed load. For 200M users loading feeds, that’s brutal.

The Hybrid Approach (What Twitter Actually Does)

The smart move: use push for normal users and pull for celebrities.

Is the poster a "celebrity" (> 10K followers)?
  → YES: Don't fan out. Their tweets get pulled at read time.
  → NO:  Fan out immediately to all followers' feed caches.

When a user reads their feed:
  1. Get pre-computed feed from Redis (pushed tweets)
  2. Fetch latest tweets from celebrities they follow (pulled tweets)
  3. Merge, rank, and return

This gives us the best of both worlds. The 99% of users with normal follower counts get fast fan-out on write. The 1% of celebrities don’t clog the fan-out pipeline. And at read time, we only need to pull from a handful of celebrity accounts (most people follow maybe 10-20 celebrities at most).

The threshold: Twitter uses somewhere around 10K-50K followers as the cutoff. Above that, the user is treated as a celebrity and excluded from fan-out on write.

Deep Dive 2: Feed Ranking

A purely chronological feed is simple but not engaging. Modern feeds use ranking algorithms to show the most relevant content first.

Ranking signals:

SignalWeightWhy
RecencyHighNewer posts are more relevant
EngagementHighPosts with many likes/retweets are probably good
User relationshipMediumPosts from people we interact with often
Content typeLowImages/videos might get boosted
Past interactionMediumDid we like similar posts before?

How ranking works in practice:

  1. Candidate generation — gather the ~2,000 most recent tweet IDs from the feed cache
  2. Feature extraction — for each tweet, compute ranking features (engagement, recency, etc.)
  3. Scoring — a ranking model (could be a simple weighted score or an ML model) assigns each tweet a score
  4. Sorting — return the top N by score
Score = w1 × recency_score
      + w2 × engagement_score
      + w3 × relationship_score
      + w4 × content_type_bonus

For an interview, we don’t need to design the ML model. We just need to show we understand that ranking happens after candidate generation, and that it runs on a relatively small set of candidates (not the entire tweet database).

The “For You” vs “Following” split: Many platforms now offer both a ranked feed and a chronological feed. The chronological one is just the raw feed sorted by time. The ranked one goes through the scoring pipeline.

Deep Dive 3: Real-Time Feed Updates

When someone we follow posts a new tweet, should it appear in our feed without refreshing?

Approach: WebSocket for active users

If the user has the app open and is on the feed screen, we maintain a WebSocket connection. When a new tweet gets fanned out to their feed cache, we also send a lightweight notification through the WebSocket:

{ "type": "new_tweet", "tweet_id": "tw_999", "preview": "Just posted a new..." }

The client can then:

  • Show a “New tweets” banner at the top (like Twitter does)
  • Or silently prepend the tweet to the feed

We don’t push the full tweet data through the WebSocket — just the tweet ID. The client fetches the full data when the user taps “Show new tweets.”

For inactive users: No real-time update needed. When they next open the app, a regular feed fetch gets the latest content.

Rate limiting updates: If someone follows 200 very active accounts, we don’t want to send 50 WebSocket notifications per minute. We batch them: “You have 12 new tweets” instead of notifying for each one.

Step 7: Scaling

Tweet storage:

  • Shard the tweets table by tweet_id (Snowflake IDs make this easy — consistent hashing on the ID)
  • Recent tweets (last 7 days) in hot storage (SSD). Older tweets in cold storage (HDD).
  • Media in object storage (S3) + CDN for delivery

Feed cache (Redis):

  • Shard by user_id across a Redis Cluster
  • Each user’s feed is a sorted set with ~800 tweet IDs (keep the last few days)
  • Total: 200M users × 1.6 KB = ~320 GB across the cluster
  • Set a max size per feed (e.g., 800 entries) — old entries get evicted automatically

Fan-out service:

  • This is the most CPU-intensive part. When a tweet is published, it goes to a message queue (Kafka)
  • Fan-out workers consume from the queue and push to Redis
  • Scale workers horizontally based on queue depth
  • Priority queue: tweets from popular accounts get processed first (more people waiting)

Follow graph:

  • The follows table gets hammered on both reads (who do I follow?) and writes (follow/unfollow)
  • Cache the follower lists in Redis for fast fan-out
  • For very large follower lists (celebrities), store them in a distributed cache with pagination

Database read replicas:

  • User profiles and tweet metadata get read way more than written
  • Add read replicas to handle the read load
  • Write to the primary, read from replicas (eventual consistency is fine for a feed)

Global deployment:

  • Deploy in multiple regions. Each region has its own feed cache and tweet replicas.
  • Fan-out happens locally in each region
  • Tweet replication across regions: publish to Kafka → cross-region replication → local consumers process the fan-out
  • User connects to the nearest region (GeoDNS)

Handling viral tweets:

  • A tweet going viral means millions of likes/retweets in minutes
  • Don’t update the like_count on every single like — that’d be millions of writes to one row
  • Instead, use a counter service: batch count updates in Redis, flush to DB every few seconds
  • The displayed count can be slightly stale (“1.2M likes” doesn’t need to be exact)

In simple language, a social media feed is an exercise in pre-computation. We pre-build each user’s feed when tweets are created (fan-out on write) so that reading the feed is just a cache lookup. The celebrity problem breaks this — we can’t fan out to 50M followers — so we pull their tweets at read time and merge. The ranking layer decides what goes on top. And the whole thing is held together by Redis (feed cache), Kafka (async fan-out), and a lot of horizontal scaling. The insight is that we trade write cost for read speed, because reads outnumber writes by a huge margin.


42

Design a Video Streaming Platform (YouTube)

advanced 4-7 YOE video-streaming system-design cdn transcoding hls

We’re designing a video streaming platform like YouTube or Netflix. This is a favorite in senior-level interviews because it touches on almost everything — massive storage, heavy compute (transcoding), CDNs, adaptive streaming, and recommendation systems.

The core challenge: users upload hundreds of hours of video every minute. We need to process each video into multiple formats and resolutions, store it all, and then deliver it to a billion viewers worldwide with zero buffering. Let’s break it down.

Step 1: Requirements

Functional Requirements

  • Users can upload videos (up to 1 hour, up to 10 GB)
  • Videos are transcoded into multiple resolutions (360p, 480p, 720p, 1080p, 4K)
  • Users can stream videos with adaptive bitrate (quality adjusts to network speed)
  • Video search by title, tags, and description
  • Like, comment, and subscribe
  • Personalized video recommendations on the home page

Non-Functional Requirements

  • High availability — the platform should be up 99.99% of the time
  • Low latency playback — video should start playing in < 2 seconds
  • Smooth streaming — no buffering on a decent connection
  • Durability — uploaded videos must never be lost
  • Global reach — fast video delivery worldwide via CDN
  • Scale — 500 hours of video uploaded per minute, 1B video views per day

Step 2: Estimation

Assumptions:

  • 2B total users, 800M daily active users
  • 500 hours of video uploaded per minute (YouTube’s real number)
  • 1B video views per day
  • Average video length: 5 minutes
  • Average video size after transcoding: 500 MB across all resolutions

QPS:

Upload rate:     500 hours/min = 30,000 hours/day = ~720,000 videos/day
Upload QPS:      720,000 / 86,400 ≈ ~8 uploads/sec

View QPS:        1B / 86,400 ≈ ~12,000 views/sec
Peak view QPS:   ~30,000 views/sec

Storage:

Raw upload/day:     720K videos × 1 GB avg raw = 720 TB/day
Transcoded/day:     720K videos × 500 MB (all resolutions) = 360 TB/day
Total storage/day:  ~1 PB/day (raw + transcoded)
Per year:           ~365 PB

Bandwidth:

Outgoing (streaming): 1B views × 5 min avg × 2.5 MB/min (720p avg) = ~12.5 PB/day
                      12.5 PB / 86,400 ≈ ~150 GB/sec outgoing

That outgoing bandwidth number is exactly why we need CDNs. No single data center can push 150 GB/sec.

Step 3: High-Level Design

Video Streaming — High-Level Architecture
Upload Path
1. Creator uploads video
2. Upload to Object Storage (S3)
3. Message Queue triggers transcoding
4. Transcode to 360p/480p/720p/1080p/4K
5. Store chunks in Object Storage
6. Generate thumbnails
7. Update Metadata DB (ready)
8. Push to CDN edge nodes
Streaming Path
1. Viewer requests video
2. API returns video metadata
3. Player fetches manifest file (HLS)
4. Manifest lists available qualities
5. Player picks quality based on bandwidth
6. Fetch video segments from CDN
7. Adaptive: switch quality mid-stream
Transcoding Workers CDN (Edge Nodes) Object Storage (S3) Metadata DB Message Queue
Upload: Creator → Object Storage → Queue → Transcoder → Object Storage → CDN
Stream: Viewer → API → CDN Edge Node → Video Segments (HLS/DASH)

The key insight is that upload and streaming are completely separate paths. Uploading is async and compute-heavy (transcoding). Streaming is read-heavy and latency-sensitive (served from CDN). These two paths scale independently.

Component breakdown:

  • API Servers — handle user requests (upload metadata, search, likes, comments, feed)
  • Object Storage (S3) — stores raw uploads and transcoded video chunks. Cheap, durable, infinitely scalable.
  • Transcoding Workers — CPU/GPU-intensive workers that convert raw video to multiple formats and resolutions
  • Message Queue (Kafka/SQS) — decouples upload from transcoding. The upload finishes fast, transcoding happens async.
  • Metadata DB — video titles, descriptions, view counts, user data. PostgreSQL or a similar relational DB.
  • CDN — the star of the show. Distributes video segments to edge servers worldwide. 90%+ of video traffic is served from CDN, not our origin servers.

Step 4: API Design

POST /api/v1/videos/upload
  → Returns a pre-signed URL for direct upload to object storage
  Body: { "title": "My Video", "description": "...", "tags": ["coding"] }
  Response: { "video_id": "vid_123", "upload_url": "https://s3.../presigned" }

GET /api/v1/videos/{video_id}
  → Returns video metadata + streaming URLs
  Response: {
    "video_id": "vid_123",
    "title": "My Video",
    "status": "ready",
    "manifest_url": "https://cdn.example.com/vid_123/master.m3u8",
    "thumbnail_url": "https://cdn.example.com/vid_123/thumb.jpg",
    "views": 1500000,
    "likes": 82000,
    "created_at": "2026-03-30T10:00:00Z"
  }

GET /api/v1/videos/search?q=system+design&limit=20&cursor=abc
GET /api/v1/feed?limit=20&cursor=abc                — personalized home feed

POST /api/v1/videos/{video_id}/like
POST /api/v1/videos/{video_id}/comments
  Body: { "text": "Great video!" }
POST /api/v1/users/{user_id}/subscribe

Why pre-signed URLs for upload? We don’t want the video file to flow through our API servers. That would bottleneck them. Instead, we give the client a pre-signed URL that lets them upload directly to object storage (S3). Our server never touches the raw video bytes — it just manages metadata.

Step 5: Data Model

-- Users table (PostgreSQL)
CREATE TABLE users (
    user_id         BIGINT PRIMARY KEY,
    username        VARCHAR(50) UNIQUE,
    display_name    VARCHAR(100),
    avatar_url      TEXT,
    subscriber_count INT DEFAULT 0,
    created_at      TIMESTAMP
);

-- Videos table (PostgreSQL)
CREATE TABLE videos (
    video_id        BIGINT PRIMARY KEY,      -- Snowflake ID
    user_id         BIGINT NOT NULL,
    title           VARCHAR(200),
    description     TEXT,
    status          VARCHAR(20),             -- 'uploading', 'transcoding', 'ready', 'failed'
    duration_sec    INT,
    manifest_url    TEXT,                    -- HLS master playlist URL
    thumbnail_url   TEXT,
    view_count      BIGINT DEFAULT 0,
    like_count      BIGINT DEFAULT 0,
    tags            TEXT[],
    created_at      TIMESTAMP,
    INDEX idx_user_time (user_id, created_at DESC)
);

-- Video chunks (stored in object storage, tracked in metadata)
-- Each resolution has its own set of chunks
-- Path pattern: s3://videos/{video_id}/{resolution}/{segment_number}.ts
-- Manifest:     s3://videos/{video_id}/master.m3u8

-- Comments table (PostgreSQL or Cassandra for heavy-write scenarios)
CREATE TABLE comments (
    comment_id      BIGINT PRIMARY KEY,
    video_id        BIGINT NOT NULL,
    user_id         BIGINT NOT NULL,
    text            TEXT,
    like_count      INT DEFAULT 0,
    created_at      TIMESTAMP,
    INDEX idx_video_time (video_id, created_at DESC)
);

-- Likes table
CREATE TABLE likes (
    user_id         BIGINT,
    video_id        BIGINT,
    created_at      TIMESTAMP,
    PRIMARY KEY (user_id, video_id)
);

-- Subscriptions table
CREATE TABLE subscriptions (
    subscriber_id   BIGINT,
    creator_id      BIGINT,
    created_at      TIMESTAMP,
    PRIMARY KEY (subscriber_id, creator_id),
    INDEX idx_creator (creator_id)
);

Step 6: Deep Dives

Deep Dive 1: Video Upload and Transcoding Pipeline

When a creator uploads a video, a LOT happens behind the scenes before viewers can watch it. Let’s trace the full pipeline.

Step 1: Upload

The creator’s app gets a pre-signed URL from our API. The raw video (could be 5 GB) uploads directly to object storage. This bypasses our servers entirely — S3 handles the heavy lifting.

For large files, we use multipart upload. The file gets split into chunks (say 10 MB each), each chunk uploads in parallel, and S3 reassembles them. If a chunk fails, we retry just that chunk — not the whole file.

Step 2: Trigger transcoding

Once the upload completes, S3 sends an event notification to our message queue (SQS/Kafka). A transcoding orchestrator picks up the event.

Step 3: Transcoding

This is the compute-heavy part. We need to produce multiple versions of the video:

Input:  raw_video.mp4 (1080p, 5 GB)

Output:
  → 360p  @ 400 kbps   (mobile on slow network)
  → 480p  @ 800 kbps   (mobile on decent network)
  → 720p  @ 2.5 Mbps   (laptop)
  → 1080p @ 5 Mbps     (desktop/TV)
  → 4K    @ 20 Mbps    (high-end TV)

Each resolution also gets chunked into small segments (typically 2-10 seconds each). These segments are the fundamental unit of streaming — we’ll see why in the next deep dive.

Transcoding is embarrassingly parallel. We can split the raw video into sections and transcode each section independently on different workers. A 30-minute video can be transcoded in minutes by throwing enough workers at it.

Step 4: Generate thumbnails and metadata

While transcoding, we also extract thumbnails (multiple frames for the preview scrubber), video duration, and other metadata.

Step 5: Upload transcoded chunks to object storage

All the segments and manifest files get stored in S3 with a predictable path structure:

s3://videos/vid_123/360p/segment_001.ts
s3://videos/vid_123/360p/segment_002.ts
...
s3://videos/vid_123/1080p/segment_001.ts
s3://videos/vid_123/master.m3u8

Step 6: Update metadata and notify CDN

The video status changes from “transcoding” to “ready” in the database. The CDN gets the new content pushed to its edge nodes (or more commonly, it pulls on first request).

The whole pipeline takes 5-30 minutes depending on video length and our transcoding capacity.

Deep Dive 2: Adaptive Bitrate Streaming (HLS)

This is probably the most interesting technical detail. When we watch a YouTube video, the quality adjusts automatically based on our network speed. That’s adaptive bitrate streaming, and it uses a protocol called HLS (HTTP Live Streaming).

Here’s how it works:

The manifest file (master.m3u8):

When the video player starts, it fetches a master manifest file. This file lists all available quality levels:

#EXTM3U
#EXT-X-STREAM-INF:BANDWIDTH=400000,RESOLUTION=640x360
360p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=800000,RESOLUTION=854x480
480p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=2500000,RESOLUTION=1280x720
720p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=5000000,RESOLUTION=1920x1080
1080p/playlist.m3u8

The quality-specific playlist:

Each quality level has its own playlist listing the individual segments:

#EXTM3U
#EXT-X-TARGETDURATION:6
#EXTINF:6.0,
segment_001.ts
#EXTINF:6.0,
segment_002.ts
#EXTINF:6.0,
segment_003.ts

The adaptive magic:

  1. The player starts by measuring available bandwidth
  2. It picks the highest quality that fits within the bandwidth
  3. It downloads segments one by one
  4. After each segment download, it measures the actual download speed
  5. If the network got slower, it switches to a lower quality for the next segment
  6. If the network got faster, it switches up

In simple language, the video is chopped into tiny pieces (6-second segments), and each piece exists in multiple quality levels. The player picks the best quality for each piece based on current network speed. That’s why we sometimes see quality shift mid-video — we literally switched to a different resolution’s segment.

Why segments over CDN? Each 6-second segment is a regular HTTP file. It gets cached by the CDN just like any other file. No special streaming protocol needed — just plain HTTP requests. That’s the genius of HLS. It piggybacks on all existing HTTP infrastructure.

Deep Dive 3: Video Recommendation Basics

When we open YouTube’s home page, we see a personalized feed of videos. How does the recommendation engine work at a high level?

Two main approaches:

Collaborative filtering — “Users who watched X also watched Y.” We don’t even need to understand the video content. We just look at patterns in watch behavior.

User A watched: [Video 1, Video 2, Video 3]
User B watched: [Video 1, Video 2, Video 4]

User A and B are similar (both watched 1 and 2).
Recommend Video 4 to User A, Video 3 to User B.

Content-based filtering — “This video has similar tags/categories to videos you liked.” We analyze the content attributes (title, tags, category, description) and find similar videos.

In practice, it’s a pipeline:

  1. Candidate generation — narrow down from billions of videos to a few thousand candidates using quick, rough signals (user’s watch history, subscriptions, trending in their region)
  2. Ranking — score each candidate using detailed features (watch time prediction, click-through rate, user engagement signals)
  3. Re-ranking — apply business rules (diversity, freshness, remove duplicates, filter out stuff they’ve already watched)

The ranking model is typically a deep neural network trained on features like:

  • User’s watch history and preferences
  • Video metadata (title, category, upload date)
  • Engagement metrics (average watch time, like ratio)
  • Context (time of day, device type)

For a system design interview, we don’t need to design the ML model. We just need to show we understand the pipeline: generate candidates → rank → re-rank → serve.

Step 7: Scaling

Object storage:

  • S3 handles the storage scaling for us — it’s practically infinite
  • Use S3 lifecycle policies: move videos with < 10 views after 90 days to cheaper storage tiers (S3 Glacier)
  • Hot/cold separation: popular videos on fast storage, old/unpopular on archive storage

CDN is everything:

  • 90%+ of video traffic should be served from CDN edge nodes
  • Use multiple CDN providers for redundancy (YouTube uses Google’s own CDN, Netflix uses Open Connect)
  • Popular videos get cached at every edge location. Long-tail videos get cached on demand.
  • Pre-warm popular content: push trending videos to edge nodes proactively

Transcoding at scale:

  • Use a managed service (AWS Elastic Transcoder, MediaConvert) or a fleet of GPU instances
  • Auto-scale based on queue depth — if there are 10,000 videos waiting, spin up more workers
  • Priority queue: premium creators or popular channels get transcoded first
  • Cost optimization: don’t transcode to 4K if the source is only 720p

Database scaling:

  • Videos metadata: shard by video_id across PostgreSQL instances
  • View counts: don’t update the DB on every view. Use Redis counters, flush to DB in batches
  • Comments: for viral videos with millions of comments, shard by video_id. Use Cassandra if write volume is extreme.

Search:

  • Elasticsearch cluster for video search (title, description, tags)
  • Index gets updated whenever a video status changes to “ready”
  • Autocomplete using prefix queries on Elasticsearch

Global deployment:

  • Upload to the nearest region’s object storage, then replicate
  • Transcoding can happen in any region — send the job to wherever we have spare capacity
  • Metadata DB: primary in one region, read replicas globally
  • The CDN handles global delivery — that’s its whole job

In simple language, a video streaming platform is really two separate systems glued together. The upload side is an async processing pipeline — take a raw video, transcode it into segments at multiple qualities, and push to storage. The streaming side is just serving static files (video segments) through a CDN using the HLS protocol. The player fetches a manifest file, picks a quality, and downloads segments one by one. The CDN does the heavy lifting of global delivery. Everything else — search, recommendations, comments — is supporting infrastructure around these two core paths.


43

Design a Ride-Sharing Service (Uber)

advanced 4-7 YOE ride-sharing system-design geospatial real-time matching

We’re designing a ride-sharing service like Uber or Lyft. This is a fantastic interview question because it hits real-time systems, geospatial queries, matching algorithms, and distributed state — all in one problem.

The core challenge: a rider requests a ride, and within seconds we need to find the nearest available driver, match them, track both in real time, calculate the fare, and process payment. All of this for 20 million rides happening per day across hundreds of cities.

Step 1: Requirements

Functional Requirements

  • Riders can request a ride by setting pickup and dropoff locations
  • The system matches riders with the nearest available driver
  • Real-time tracking of the driver’s location (for both rider and driver)
  • ETA calculation for driver arrival and trip duration
  • Fare estimation before the ride and final fare calculation after
  • Payment processing at the end of the ride
  • Rating system for riders and drivers
  • Ride history for both riders and drivers

Non-Functional Requirements

  • Low latency matching — find a driver in < 10 seconds
  • Real-time updates — location updates should reflect in < 1 second
  • High availability — riders stranded without a working app is a very bad look
  • Consistency for payments — we can never double-charge or lose a payment
  • Scale — 20M rides/day, 5M active drivers, location updates every 4 seconds

Step 2: Estimation

Assumptions:

  • 100M total riders, 5M active drivers
  • 20M rides per day
  • Each active driver sends location every 4 seconds
  • Average ride duration: 15 minutes
  • Peak hours: 3x average load

QPS:

Ride requests:      20M / 86,400 ≈ ~230 requests/sec
Peak ride requests: ~700 requests/sec

Location updates:   5M drivers × (1 update / 4 sec) = 1.25M updates/sec
Peak:               ~2M updates/sec

That location update number is wild. 1.25 million writes per second just for driver locations. This is the single hardest scaling challenge in this system.

Storage:

Location update size: ~100 bytes (driver_id, lat, lng, timestamp, heading)
Location writes/day:  1.25M/sec × 86,400 = ~108B writes/day
If we keep 30 days of history: 108B × 100 bytes × 30 = ~324 TB

Ride data: 20M rides/day × 1 KB per ride = 20 GB/day

We don’t need to store every single location update forever. Current driver locations live in Redis (real-time). Historical location traces can go to a time-series database for analytics and trip reconstruction.

Step 3: High-Level Design

Ride-Sharing — High-Level Architecture
Rider App
Driver App
│ REST + WebSocket
│ REST + WebSocket
API Gateway / Load Balancer
Matching Service
find nearby drivers
Location Service
track all drivers
Trip Service
manage ride lifecycle
Pricing Service
fare + surge
Payment Service
charge rider
Redis (driver locations)
PostgreSQL (rides, users)
Kafka (events)
Ride flow: Rider requests → Matching finds driver → Trip manages ride → Payment charges at end
Location flow: Driver app → Location Service → Redis (every 4 sec) → pushed to rider via WebSocket

The ride lifecycle:

  1. Rider requests a ride — sends pickup/dropoff location to the API
  2. Pricing Service — calculates the estimated fare (including surge pricing if applicable)
  3. Matching Service — queries the Location Service for nearby available drivers, picks the best one
  4. Driver notified — push notification + WebSocket update. Driver accepts or declines.
  5. Trip starts — driver heads to pickup. Rider sees driver’s real-time location.
  6. In transit — driver picks up rider, heads to destination. Location tracked throughout.
  7. Trip ends — driver marks ride complete. Final fare calculated based on actual distance/time.
  8. Payment — rider charged, driver paid (minus platform fee)

Step 4: API Design

POST /api/v1/rides/estimate
  Body: { "pickup": { "lat": 37.7749, "lng": -122.4194 },
          "dropoff": { "lat": 37.7849, "lng": -122.4094 } }
  Response: { "estimated_fare": "$12.50", "estimated_time": "15 min",
              "surge_multiplier": 1.2 }

POST /api/v1/rides/request
  Body: { "pickup": { "lat": 37.7749, "lng": -122.4194 },
          "dropoff": { "lat": 37.7849, "lng": -122.4094 },
          "ride_type": "standard" }
  Response: { "ride_id": "ride_456", "status": "matching",
              "estimated_pickup": "4 min" }

GET /api/v1/rides/{ride_id}
  → Current ride status, driver info, location, ETA

POST /api/v1/rides/{ride_id}/cancel
POST /api/v1/rides/{ride_id}/rate
  Body: { "rating": 5, "comment": "Great ride!" }

-- Driver endpoints:
PUT /api/v1/drivers/location
  Body: { "lat": 37.7750, "lng": -122.4195, "heading": 180, "speed": 30 }

POST /api/v1/rides/{ride_id}/accept
POST /api/v1/rides/{ride_id}/start      -- driver picked up the rider
POST /api/v1/rides/{ride_id}/complete    -- driver arrived at destination

Real-time communication happens over WebSocket. Both the rider and driver apps maintain a persistent WebSocket connection. Through this, we push:

  • Driver location updates to the rider
  • Ride status changes (driver assigned, arriving, trip started, etc.)
  • Navigation updates to the driver

Step 5: Data Model

-- Users table (PostgreSQL)
CREATE TABLE users (
    user_id         BIGINT PRIMARY KEY,
    type            VARCHAR(10),             -- 'rider' or 'driver'
    name            VARCHAR(100),
    email           VARCHAR(255) UNIQUE,
    phone           VARCHAR(20),
    rating          DECIMAL(3,2) DEFAULT 5.0,
    total_rides     INT DEFAULT 0,
    created_at      TIMESTAMP
);

-- Driver details (PostgreSQL)
CREATE TABLE drivers (
    driver_id       BIGINT PRIMARY KEY REFERENCES users(user_id),
    vehicle_make    VARCHAR(50),
    vehicle_model   VARCHAR(50),
    vehicle_plate   VARCHAR(20),
    vehicle_color   VARCHAR(30),
    license_number  VARCHAR(50),
    status          VARCHAR(20),             -- 'available', 'busy', 'offline'
    current_city    VARCHAR(50)
);

-- Rides table (PostgreSQL)
CREATE TABLE rides (
    ride_id         BIGINT PRIMARY KEY,
    rider_id        BIGINT NOT NULL,
    driver_id       BIGINT,
    status          VARCHAR(20),             -- 'matching', 'accepted', 'arriving',
                                             -- 'in_progress', 'completed', 'cancelled'
    pickup_lat      DECIMAL(10,7),
    pickup_lng      DECIMAL(10,7),
    dropoff_lat     DECIMAL(10,7),
    dropoff_lng     DECIMAL(10,7),
    estimated_fare  DECIMAL(10,2),
    actual_fare     DECIMAL(10,2),
    surge_multiplier DECIMAL(3,2) DEFAULT 1.0,
    distance_km     DECIMAL(10,2),
    duration_min    INT,
    requested_at    TIMESTAMP,
    started_at      TIMESTAMP,
    completed_at    TIMESTAMP,
    INDEX idx_rider (rider_id, requested_at DESC),
    INDEX idx_driver (driver_id, requested_at DESC)
);

-- Payments table (PostgreSQL)
CREATE TABLE payments (
    payment_id      BIGINT PRIMARY KEY,
    ride_id         BIGINT UNIQUE NOT NULL,
    rider_id        BIGINT NOT NULL,
    driver_id       BIGINT NOT NULL,
    amount          DECIMAL(10,2),
    platform_fee    DECIMAL(10,2),
    driver_payout   DECIMAL(10,2),
    status          VARCHAR(20),             -- 'pending', 'charged', 'paid_out', 'refunded'
    payment_method  VARCHAR(20),
    processed_at    TIMESTAMP
);

-- Driver locations (Redis — real-time, not persistent)
-- Using Redis GEO commands for geospatial queries
-- GEOADD drivers:available {lng} {lat} {driver_id}
-- GEORADIUS drivers:available {lng} {lat} 5 km COUNT 20 ASC

Step 6: Deep Dives

Deep Dive 1: Geospatial Indexing — Finding Nearby Drivers

When a rider requests a ride, we need to answer: “Which available drivers are within 5 km of this location?” And we need to answer it in milliseconds, across millions of drivers.

Option A: Brute force (don’t do this)

Scan all 5M drivers, calculate the distance to the rider for each one, filter by radius. That’s O(n) for every request. Way too slow.

Option B: Geohashing

Think of it like a zip code for GPS coordinates. We divide the entire earth into a grid, and each cell gets a hash string. The clever part: cells that are geographically close share a common prefix in their hash.

Geohash: 9q8yyk → a grid cell in San Francisco
Geohash: 9q8yym → the cell right next to it

They share prefix "9q8yy" → they're neighbors

How we use it:

  1. When a driver sends a location update, we compute their geohash and store it
  2. When a rider requests a ride, we compute the rider’s geohash
  3. We search the rider’s geohash cell AND all neighboring cells for available drivers
  4. Since geohash cells have a fixed size, this narrows our search from 5M drivers to maybe 50-100 in the area

Option C: Redis GEO (what we’d actually use)

Redis has built-in geospatial support using a sorted set with geohash encoding under the hood.

GEOADD drivers:available -122.4194 37.7749 driver_42
GEOADD drivers:available -122.4095 37.7850 driver_99

GEORADIUS drivers:available -122.4194 37.7749 5 km COUNT 20 ASC
→ Returns the 20 closest drivers within 5 km, sorted by distance

In simple language, Redis GEO does the geohashing for us. We just say “add this driver at this coordinate” and “find me drivers near this point.” It’s incredibly fast because it’s all in-memory and uses a sorted set internally.

Why not a quadtree? Quadtrees work great too — Uber actually used a custom quadtree for a while. But Redis GEO is simpler to operate and scales well for most ride-sharing scenarios. At Uber’s scale, they moved to a custom solution (H3 — a hexagonal grid system), but for an interview, Redis GEO or geohashing is the right answer.

Deep Dive 2: Driver-Rider Matching

Finding nearby drivers is step one. But which driver do we actually assign? The closest one isn’t always the best choice.

Simple approach: Closest driver

Find the nearest available driver, send them the request. If they decline, move to the next closest. Simple, but not optimal.

Better approach: ETA-based matching

The closest driver by straight-line distance might be on the other side of a highway. A driver slightly farther away might actually arrive sooner because of road layout.

Driver A: 1.2 km away (straight line), but ETA = 8 minutes (blocked by river)
Driver B: 1.8 km away (straight line), but ETA = 4 minutes (clear road)

→ We should pick Driver B

We compute the actual driving ETA (using a routing engine like OSRM or Google Maps Directions API) for the top 5-10 closest drivers, then pick the one with the shortest ETA.

Advanced approach: Scoring function

Uber uses a scoring function that considers multiple factors:

Score = w1 × (1 / ETA)                  -- shorter ETA is better
      + w2 × driver_rating              -- higher-rated drivers preferred
      + w3 × acceptance_rate            -- drivers who accept more rides
      + w4 × earnings_fairness          -- distribute rides fairly

The matching service computes this score for the top candidates and sends the request to the highest-scoring driver. If they don’t accept within 10 seconds, it moves to the next one.

Batch matching:

During peak hours, there might be many riders and drivers in the same area. Instead of matching one-by-one, we can batch — collect all ride requests and available drivers in a time window (say 2 seconds), and solve the optimal matching problem for the whole batch. This gives globally better matches but adds a small delay.

Deep Dive 3: Real-Time Location Tracking

Every active driver sends their GPS location to our system every 4 seconds. At 5M active drivers, that’s 1.25M location updates per second. How do we handle this firehose?

The write path:

  1. Driver app sends location to the API gateway
  2. API gateway routes to the Location Service
  3. Location Service updates Redis (current location) AND publishes to Kafka (event stream)
Driver → Location Service → Redis GEOADD (current position)
                          → Kafka topic: "driver-locations" (for history/analytics)

Why Redis for current locations?

We only care about the current location for matching. We don’t need a durable database for this. Redis is in-memory, so writes are microseconds fast. If Redis loses data, the next location update (4 seconds later) will repopulate it. No big deal.

Why Kafka for the stream?

We publish every location update to Kafka for multiple consumers:

  • Trip Service — to track the active ride and compute distance/fare
  • ETA Service — to update arrival estimates
  • Analytics — to build heatmaps, optimize driver positioning
  • Fraud detection — to verify the driver is actually driving the route

Pushing location to the rider:

When a rider is waiting for their driver, we need to push the driver’s location to the rider’s app in real time.

Driver sends location every 4 sec
→ Location Service updates Redis
→ Location Service publishes to Kafka
→ Trip consumer reads from Kafka
→ Trip consumer pushes to rider via WebSocket
→ Rider's app updates the map

The rider sees the little car moving on the map, updating every 4 seconds. Smooth enough to feel real-time.

Scaling location updates:

1.25M writes/sec is a lot. We can handle it by:

  • Sharding Redis by city — each city gets its own Redis cluster. Drivers in NYC only exist in the NYC shard. This also makes sense because we’d never match a driver in NYC with a rider in London.
  • Kafka partitioning by city — same idea. Each city is a partition (or set of partitions).
  • Batching on the client — instead of sending every single GPS point, the driver app can batch 2-3 points and send them together. Reduces QPS by 2-3x.

Step 7: Scaling

Location Service:

  • Shard by city/region — each region gets its own Redis cluster
  • 5M drivers across maybe 500 cities = ~10K drivers per city on average
  • A single Redis instance can handle 100K+ ops/sec. Even busy cities are fine.
  • For mega-cities (NYC, London, Mumbai), use Redis Cluster with multiple shards

Matching Service:

  • Stateless — can scale horizontally with more instances
  • The bottleneck is the geospatial query + ETA computation
  • Cache ETA results for common routes (e.g., airport to downtown)
  • During peak hours, spin up more matching workers

Trip Service:

  • Each active ride is a state machine: matching → accepted → arriving → in_progress → completed
  • Store active rides in Redis for fast status updates
  • Persist completed rides to PostgreSQL

Surge pricing:

  • Divide each city into hexagonal zones
  • Track supply (available drivers) and demand (ride requests) per zone in real time
  • When demand exceeds supply, apply a surge multiplier (1.2x, 1.5x, 2x)
  • Surge data lives in Redis — it changes every few minutes

Payment processing:

  • Process payments asynchronously after the ride ends
  • Use a payment queue to handle retries and failures
  • Double-charge prevention: use idempotency keys on every payment request
  • The ride can only be marked “completed” after payment succeeds (saga pattern)

Database scaling:

  • Rides table: partition by date range (current month in hot storage, older in archive)
  • Read replicas for analytics queries
  • The users table is relatively small — standard PostgreSQL with caching handles it

Multi-region deployment:

  • Each region operates independently (a ride in NYC doesn’t need to talk to London)
  • User accounts are global (replicated across regions)
  • When a user travels, their account data is fetched from the global store and cached locally

In simple language, a ride-sharing system is built around three core problems: knowing where all the drivers are (Location Service + Redis GEO), finding the best driver for a rider (Matching Service with geospatial queries + ETA), and managing the ride from start to finish (Trip Service as a state machine). The location firehose (1M+ updates/sec) is the biggest scaling challenge, and we solve it by sharding by city and using Redis for current positions. Everything else — payments, pricing, ratings — is standard microservice territory.


44

Design a File Storage Service (Dropbox)

advanced 4-7 YOE file-storage system-design sync deduplication chunking

We’re designing a cloud file storage service like Dropbox or Google Drive. Users store files in the cloud, and those files sync automatically across all their devices. Edit a file on your laptop, and it appears on your phone seconds later.

The core challenge: how do we sync files across devices efficiently, handle conflicts when two people edit the same file, and avoid storing duplicate data when millions of users upload the same file? Let’s figure it out.

Step 1: Requirements

Functional Requirements

  • Upload, download, and delete files
  • Automatic sync across all devices linked to an account
  • File versioning — ability to view and restore previous versions
  • Shared folders (multiple users can access the same files)
  • Offline editing — changes sync when the device comes back online
  • Notifications when a file is changed by someone else

Non-Functional Requirements

  • Reliability — files must never be lost. Zero data loss. Period.
  • Sync speed — changes should sync in < 5 seconds on a decent connection
  • Bandwidth efficiency — only transfer what changed, not the entire file
  • Consistency — all devices should eventually see the same file state
  • Scale — 500M users, 10M DAU, average 2 GB stored per user
  • Large file support — handle files up to 50 GB

Step 2: Estimation

Assumptions:

  • 500M total users, 10M daily active users
  • Average storage per user: 2 GB
  • Average file size: 500 KB
  • Each DAU syncs ~10 file changes per day
  • 20% of changes are to large files (> 10 MB)

QPS:

File operations:    10M DAU × 10 changes/day = 100M operations/day
Write QPS:          100M / 86,400 ≈ ~1,200 ops/sec
Peak QPS:           ~3,000 ops/sec

Storage:

Total storage:      500M users × 2 GB = 1 EB (1,000 PB)
Daily new uploads:  100M operations × 500 KB avg = 50 TB/day
With deduplication: ~15 TB/day (many users upload similar files)
Versioning overhead: ~20% extra storage for keeping old versions

Bandwidth:

Upload:   50 TB/day / 86,400 = ~600 MB/sec
Download: 3× upload (people download more than they upload) ≈ ~1.8 GB/sec

The big number here is total storage: 1 exabyte. That’s why deduplication is so critical — it can save us 50-70% of storage costs.

Step 3: High-Level Design

File Storage — High-Level Architecture
Desktop Client
Mobile App
Web Browser
│ HTTPS + WebSocket
API Gateway
Sync Service
track changes
Chunking Service
split + dedup
Notification Service
push changes
Sharing Service
permissions
Metadata DB
(PostgreSQL)
Block Storage
(S3 / custom)
Message Queue
(Kafka)
Upload: Client detects change → chunks file → uploads only new chunks → updates metadata
Sync: Notification pushed to other devices → client fetches changed chunks → reassembles file

How the pieces fit together:

  • Desktop Client — the local agent running on the user’s machine. It watches the sync folder for changes, splits files into chunks, computes hashes, and communicates with the server.
  • Sync Service — the brain. It knows the latest version of every file, tracks which chunks make up each file, and coordinates updates.
  • Chunking Service — splits files into fixed-size chunks and deduplicates them. More on this in the deep dives.
  • Block Storage — where the actual file chunks live. Could be S3 or a custom storage system (Dropbox built their own called Magic Pocket).
  • Metadata DB — stores everything about the files (names, paths, sizes, versions, who owns them) but NOT the file data itself.
  • Notification Service — tells other devices that something changed so they can sync.

Step 4: API Design

POST /api/v1/files/upload/init
  Body: { "file_name": "report.pdf", "file_size": 52428800,
          "parent_folder_id": "folder_42",
          "chunk_hashes": ["abc123", "def456", "ghi789"] }
  Response: { "file_id": "file_99", "upload_id": "up_555",
              "chunks_needed": ["def456", "ghi789"],   -- abc123 already exists!
              "upload_urls": { "def456": "https://s3.../presigned", ... } }

PUT /api/v1/files/upload/{upload_id}/chunk/{chunk_hash}
  Body: <binary chunk data>
  Response: { "status": "ok" }

POST /api/v1/files/upload/{upload_id}/complete
  Response: { "file_id": "file_99", "version": 3 }

GET /api/v1/files/{file_id}
  → Returns file metadata + download URLs for chunks

GET /api/v1/files/{file_id}/versions
  → Returns list of all versions

GET /api/v1/sync?cursor=<last_sync_token>
  → Returns all changes since the last sync point
  Response: {
    "changes": [
      { "type": "modified", "file_id": "file_99", "version": 3, ... },
      { "type": "deleted", "file_id": "file_50", ... }
    ],
    "cursor": "sync_token_abc",
    "has_more": false
  }

POST /api/v1/shares
  Body: { "file_id": "file_99", "user_email": "bob@example.com",
          "permission": "edit" }

The key insight in the upload API: Notice how the client sends chunk hashes before uploading any data. The server checks which chunks it already has (from other users or previous versions). If a chunk already exists, we skip the upload. This is deduplication in action — the client might only need to upload 1 out of 10 chunks for an edited file.

Step 5: Data Model

-- Users table (PostgreSQL)
CREATE TABLE users (
    user_id         BIGINT PRIMARY KEY,
    email           VARCHAR(255) UNIQUE,
    name            VARCHAR(100),
    storage_used    BIGINT DEFAULT 0,        -- bytes
    storage_limit   BIGINT DEFAULT 2147483648, -- 2 GB free tier
    created_at      TIMESTAMP
);

-- Files table (metadata only, PostgreSQL)
CREATE TABLE files (
    file_id         BIGINT PRIMARY KEY,
    owner_id        BIGINT NOT NULL,
    parent_folder_id BIGINT,                 -- null for root
    file_name       VARCHAR(255),
    is_folder       BOOLEAN DEFAULT FALSE,
    latest_version  INT DEFAULT 1,
    file_size       BIGINT,
    mime_type       VARCHAR(100),
    is_deleted      BOOLEAN DEFAULT FALSE,   -- soft delete
    created_at      TIMESTAMP,
    updated_at      TIMESTAMP,
    INDEX idx_parent (parent_folder_id, file_name),
    INDEX idx_owner (owner_id)
);

-- File versions (PostgreSQL)
CREATE TABLE file_versions (
    file_id         BIGINT,
    version         INT,
    file_size       BIGINT,
    chunk_hashes    TEXT[],                  -- ordered list of chunk hashes
    modified_by     BIGINT,
    created_at      TIMESTAMP,
    PRIMARY KEY (file_id, version)
);

-- Chunks (PostgreSQL — metadata. Actual data in block storage)
CREATE TABLE chunks (
    chunk_hash      VARCHAR(64) PRIMARY KEY, -- SHA-256 hash
    size            INT,
    reference_count INT DEFAULT 1,           -- for garbage collection
    storage_path    TEXT,                    -- path in block storage
    created_at      TIMESTAMP
);

-- Workspaces / shared folders (PostgreSQL)
CREATE TABLE workspace_members (
    workspace_id    BIGINT,
    user_id         BIGINT,
    permission      VARCHAR(10),             -- 'owner', 'edit', 'view'
    created_at      TIMESTAMP,
    PRIMARY KEY (workspace_id, user_id)
);

-- Sync history (for cursor-based sync)
CREATE TABLE sync_log (
    log_id          BIGINT PRIMARY KEY,      -- monotonically increasing
    user_id         BIGINT NOT NULL,
    file_id         BIGINT NOT NULL,
    action          VARCHAR(10),             -- 'create', 'modify', 'delete', 'move'
    version         INT,
    timestamp       TIMESTAMP,
    INDEX idx_user_sync (user_id, log_id)
);

Why track chunks separately? Because the same chunk can appear in millions of files across different users. We store the chunk data once and keep a reference count. When no file references a chunk anymore, we garbage-collect it.

Step 6: Deep Dives

Deep Dive 1: File Chunking and Deduplication

This is the most important optimization in a file storage system. Without it, storage costs would be astronomical.

How chunking works:

Instead of storing a file as one big blob, we split it into fixed-size chunks (typically 4 MB each). Each chunk gets a SHA-256 hash computed from its content.

report.pdf (50 MB)
  → Chunk 1: bytes[0..4MB]     → hash: "abc123"
  → Chunk 2: bytes[4MB..8MB]   → hash: "def456"
  → Chunk 3: bytes[8MB..12MB]  → hash: "ghi789"
  ... (13 chunks total)

The file is stored as: [abc123, def456, ghi789, ...]

Why chunking saves bandwidth:

Imagine we edit a 50 MB file and only change a few paragraphs in the middle. Without chunking, we’d have to re-upload the entire 50 MB. With chunking, only the chunks that changed need to be re-uploaded. If we changed text in chunk 5, we upload one 4 MB chunk instead of 50 MB.

Original file: [abc, def, ghi, jkl, mno, pqr]
Edited file:   [abc, def, ghi, jkl, XYZ, pqr]
                                     ^^^
                              Only this chunk changed!

Upload needed: just 4 MB (the new chunk) instead of 50 MB

How deduplication works:

Here’s where it gets really clever. Since chunks are identified by their content hash (SHA-256), two identical chunks produce the same hash — regardless of who uploaded them.

User A uploads vacation_photos.zip → chunks: [aaa, bbb, ccc]
User B uploads the SAME file       → chunks: [aaa, bbb, ccc]

When User B tries to upload, the server says:
"I already have chunks aaa, bbb, and ccc. Upload nothing."

We just saved 100% of the storage and bandwidth for User B's upload.

This is called content-addressable storage. The address (hash) is derived from the content. Same content = same address = stored only once.

Dedup in practice saves a ton:

  • Email attachments that millions of people receive (same PDF, same spreadsheet)
  • Common libraries and frameworks (how many copies of React exist in Dropbox?)
  • OS files and application binaries

Dropbox reported that deduplication saved them 75% of their storage.

One important consideration: We can also use variable-length chunking (rolling hash / Rabin fingerprint) instead of fixed 4 MB chunks. This handles file insertions better — if we insert text at the beginning of a file, fixed-size chunks would shift everything and all chunks would change. Variable-length chunking is smarter about finding natural boundaries. But fixed-size is simpler and works well enough for most cases.

Deep Dive 2: Sync Conflict Resolution

What happens when two people edit the same file at the same time? Or when we edit a file offline on two different devices?

The problem:

Device A (offline): edits report.pdf → creates version 3
Device B (offline): edits report.pdf → creates version 3

Both come online. Which version 3 wins?

Option A: Last Writer Wins (LWW)

The simplest approach. Whichever device syncs last overwrites the other. The “losing” version gets saved as a previous version so nothing is lost.

Device A syncs at 10:00:01 → version 3 saved
Device B syncs at 10:00:05 → version 4 (overwrites, but version 3 is kept in history)

This is what Dropbox does for regular files. If there’s a conflict, Dropbox saves the conflicting copy as “report (Manish’s conflicted copy).pdf” right next to the original. The user resolves it manually.

Option B: Operational Transform (OT) / CRDTs

For real-time collaborative editing (like Google Docs), we need something smarter. Instead of saving the whole file, we track individual operations (“insert ‘hello’ at position 42”, “delete characters 10-15”).

These operations can be transformed to work together even when they arrive out of order. This is how Google Docs lets 10 people type in the same document at the same time.

But OT is incredibly complex to implement. For a file storage system (not a real-time editor), LWW with conflict copies is the pragmatic choice.

Our approach for the interview:

  1. Every file has a version number in the metadata DB
  2. When a client uploads changes, it includes the version number it’s based on
  3. If the server’s current version matches, the update succeeds (optimistic concurrency)
  4. If someone else updated the file in between, we have a conflict
  5. We save the conflicting version as a separate file and notify the user
  6. The user resolves the conflict manually
Client A: "Update file_99, I'm based on version 2"
Server:   "Current version is 2. Accepted. Now version 3."

Client B: "Update file_99, I'm based on version 2"
Server:   "Current version is 3, not 2. Conflict!"
Server:   Save Client B's version as "report (conflict).pdf"
Server:   Notify Client B about the conflict

Deep Dive 3: Real-Time Sync Notification

When a file changes on one device, how do all other devices find out? We need a notification system that triggers sync.

Long polling approach (what Dropbox originally used):

Each device maintains a long-poll connection to the Notification Service. The connection stays open for up to 60 seconds. If a change happens during that time, the server responds immediately. If nothing changes, the connection times out and the client opens a new one.

Desktop Client → Notification Service: "Any changes since sync_token_42?"
                 (connection stays open)

... 20 seconds later, another device uploads a change ...

Notification Service → Desktop Client: "Yes! Changes detected."
Desktop Client → Sync Service: "GET /sync?cursor=sync_token_42"
                → Gets list of changed files
                → Downloads changed chunks
                → Updates local files

WebSocket approach (more modern):

A persistent WebSocket connection between each client and the Notification Service. When a change happens, the server pushes immediately. Lower latency than long polling.

The full sync flow:

  1. User A edits report.pdf on their laptop
  2. Desktop client detects the change (filesystem watcher)
  3. Client chunks the file, computes hashes, uploads new chunks
  4. Client calls /upload/complete → Sync Service updates metadata
  5. Sync Service writes to the sync log and publishes an event to the message queue
  6. Notification Service reads the event
  7. Notification Service pushes to User A’s other devices + any shared workspace members
  8. Each notified device calls /sync?cursor=... to get the change details
  9. Each device downloads the new chunks and reassembles the file locally

Scaling the notification service:

  • Each active device maintains one connection. 10M DAU with 2.5 devices average = 25M connections.
  • Shard by user_id — each notification server handles a subset of users
  • Use a connection registry (Redis) to map user_id to notification server
  • The message queue (Kafka) decouples the sync service from notifications — the sync service publishes the event, and however many notification servers consume it

Step 7: Scaling

Block storage:

  • Use S3 or a custom object store (Dropbox built Magic Pocket to move off S3 and save costs)
  • Replicate across 3+ data centers for durability
  • Use storage tiers: hot (frequently accessed), warm (less frequent), cold (old versions)
  • Content-addressable storage means dedup is automatic — same hash, same object

Metadata DB:

  • Shard by user_id — all of a user’s files live on the same shard
  • Read replicas for the sync endpoint (reads vastly outnumber writes)
  • The files table is the most critical table — it must be consistent and durable
  • Index on (parent_folder_id, file_name) for directory listings

Sync performance:

  • The sync endpoint uses cursor-based pagination on the sync_log table
  • Clients only request changes since their last sync point — not the full file tree
  • For the initial sync (new device), send the full file tree in batches

Chunking optimization:

  • 4 MB chunk size is a good balance: small enough to limit re-upload on edits, large enough to avoid too many chunks per file
  • Compute chunk hashes on the client side — this avoids uploading data the server already has
  • Use SHA-256 for chunk hashing — collision probability is effectively zero

CDN for downloads:

  • Popular shared files (company-wide documents) can be served from CDN
  • Most file access is personal (user’s own files), so CDN helps less than in video streaming
  • CDN is more useful for the web client’s static assets

Rate limiting and quotas:

  • Storage quotas per user (2 GB free, 2 TB paid)
  • Rate limit file operations per user (prevent abuse)
  • Throttle sync for clients that are doing massive initial uploads

Multi-region:

  • Store files in the region closest to the user
  • For shared workspaces with users in different regions, replicate data to both regions
  • Metadata must be globally consistent — use a global metadata store with cross-region replication

In simple language, a file storage service is built on three key ideas. First, chunking — split files into 4 MB pieces so we only transfer what changed. Second, deduplication — store each unique chunk exactly once by using content hashes as identifiers. Third, sync notifications — use WebSocket or long polling to tell devices when something changed, then let the device fetch the changes. The metadata DB (which chunks make up which files) is the source of truth. The block storage (the actual bytes) is the heavy storage layer. And the notification service is the glue that keeps everything in sync across devices.


45

Design an E-Commerce Platform (Amazon)

advanced 4-7 YOE e-commerce system-design inventory order-processing search

We’re designing an e-commerce platform like Amazon. This is arguably the most comprehensive system design question because it touches nearly every concept: product catalog, search, shopping cart, order processing, payment, inventory management, and recommendations. Interviewers love it because there are so many directions to go deep.

The core challenge: hundreds of millions of products, millions of concurrent users browsing and buying, and we can never sell something that’s out of stock or lose an order. Let’s build it.

Step 1: Requirements

Functional Requirements

  • Browse and search products (by keyword, category, filters)
  • Product detail pages (images, description, reviews, pricing)
  • Shopping cart (add, remove, update quantity)
  • Checkout and place orders
  • Payment processing (credit card, wallet)
  • Order tracking (order status, shipping updates)
  • Seller management (list products, manage inventory)
  • Product reviews and ratings

Non-Functional Requirements

  • High availability — downtime means lost revenue. Every minute of downtime costs millions.
  • Low latency — product pages should load in < 200ms, search in < 500ms
  • Strong consistency for inventory — we must never oversell a product
  • Eventual consistency for catalog — it’s okay if a new product takes a few seconds to appear in search
  • Scale — 500M products, 300M users, 100K orders/day, millions of concurrent browsers

Step 2: Estimation

Assumptions:

  • 300M total users, 50M daily active users
  • 500M products in the catalog
  • 100K orders per day (peak: 10x during sales events like Prime Day)
  • Average order: 3 items, $50 total
  • Each active user views ~20 product pages per session

QPS:

Product page views: 50M × 20 / 86,400 ≈ ~12,000 views/sec
Search queries:     50M × 5 searches / 86,400 ≈ ~3,000 searches/sec
Cart operations:    50M × 3 / 86,400 ≈ ~1,700 ops/sec
Orders:             100K / 86,400 ≈ ~1 order/sec (peak: ~12/sec on sale days)
Peak (sale events): all above × 10

Storage:

Product catalog:  500M products × 10 KB each = 5 TB
Product images:   500M × 5 images × 500 KB = 1.25 PB
Order data:       100K/day × 2 KB = 200 MB/day, 73 GB/year
User data:        300M × 5 KB = 1.5 TB

Product page views dominate. 12,000 reads/sec for product pages is the hottest path. This is a heavily read-heavy workload.

Step 3: High-Level Design

E-Commerce — High-Level Architecture
Users (Web / Mobile)
CDN (images) API Gateway
Product Service
catalog + details
Search Service
Elasticsearch
Cart Service
Redis + DB
Order Service
order lifecycle
Payment Service
Stripe / internal
Inventory Service
stock management
Product DB Order DB Elasticsearch Redis Cache Message Queue
Browse: User → CDN + API → Product Service → Cache/DB → Product Page
Order: Cart → Order Service → Inventory (reserve) → Payment → Fulfill

This is a microservices architecture. Each service owns its own data and logic:

  • Product Service — manages the product catalog. CRUD for products, categories, and pricing. The read path is heavily cached.
  • Search Service — powered by Elasticsearch. Handles keyword search, autocomplete, filters (price range, rating, category), and sorting.
  • Cart Service — manages shopping carts. Cart for logged-in users is persisted in the database. Cart for anonymous users lives in Redis with a session cookie.
  • Order Service — the orchestrator. Manages the order lifecycle from checkout to delivery.
  • Inventory Service — tracks stock levels. The most critical service for data consistency.
  • Payment Service — integrates with payment providers (Stripe, PayPal). Handles charges, refunds, and receipts.

Step 4: API Design

-- Product APIs
GET  /api/v1/products/{product_id}
GET  /api/v1/products?category=electronics&sort=price_asc&page=1
GET  /api/v1/search?q=wireless+headphones&min_price=50&max_price=200

-- Cart APIs
GET    /api/v1/cart
POST   /api/v1/cart/items
  Body: { "product_id": "prod_42", "quantity": 2 }
PUT    /api/v1/cart/items/{item_id}
  Body: { "quantity": 3 }
DELETE /api/v1/cart/items/{item_id}

-- Order APIs
POST /api/v1/orders/checkout
  Body: { "shipping_address_id": "addr_7",
          "payment_method_id": "pm_stripe_123" }
  Response: { "order_id": "ord_999", "status": "pending",
              "total": "$149.97", "estimated_delivery": "2026-04-02" }

GET  /api/v1/orders/{order_id}           -- order details + tracking
GET  /api/v1/orders                      -- order history

-- Seller APIs
POST /api/v1/seller/products
  Body: { "title": "Wireless Headphones", "price": 49.99,
          "category": "electronics", "stock": 500,
          "images": ["img_1", "img_2"] }
PUT  /api/v1/seller/products/{product_id}/inventory
  Body: { "stock_delta": 100 }           -- add 100 units

Step 5: Data Model

-- Users table (PostgreSQL)
CREATE TABLE users (
    user_id         BIGINT PRIMARY KEY,
    email           VARCHAR(255) UNIQUE,
    name            VARCHAR(100),
    password_hash   VARCHAR(255),
    default_address_id BIGINT,
    created_at      TIMESTAMP
);

-- Products table (PostgreSQL — heavy caching)
CREATE TABLE products (
    product_id      BIGINT PRIMARY KEY,
    seller_id       BIGINT NOT NULL,
    title           VARCHAR(300),
    description     TEXT,
    category_id     BIGINT,
    price           DECIMAL(10,2),
    compare_at_price DECIMAL(10,2),          -- original price for "on sale" display
    image_urls      JSON,
    avg_rating      DECIMAL(2,1) DEFAULT 0,
    review_count    INT DEFAULT 0,
    status          VARCHAR(20),             -- 'active', 'draft', 'out_of_stock'
    created_at      TIMESTAMP,
    updated_at      TIMESTAMP,
    INDEX idx_category (category_id),
    INDEX idx_seller (seller_id)
);

-- Inventory table (PostgreSQL — strict consistency)
CREATE TABLE inventory (
    product_id      BIGINT PRIMARY KEY,
    available_stock INT NOT NULL DEFAULT 0,  -- what's available to sell
    reserved_stock  INT NOT NULL DEFAULT 0,  -- held for pending orders
    version         INT NOT NULL DEFAULT 0,  -- for optimistic locking
    updated_at      TIMESTAMP
);

-- Cart table (PostgreSQL — for logged-in users)
CREATE TABLE cart_items (
    user_id         BIGINT,
    product_id      BIGINT,
    quantity        INT NOT NULL,
    added_at        TIMESTAMP,
    PRIMARY KEY (user_id, product_id)
);

-- Orders table (PostgreSQL)
CREATE TABLE orders (
    order_id        BIGINT PRIMARY KEY,
    user_id         BIGINT NOT NULL,
    status          VARCHAR(20),             -- 'pending', 'paid', 'shipped',
                                             -- 'delivered', 'cancelled', 'refunded'
    total_amount    DECIMAL(10,2),
    shipping_address JSON,
    payment_id      BIGINT,
    ordered_at      TIMESTAMP,
    shipped_at      TIMESTAMP,
    delivered_at    TIMESTAMP,
    INDEX idx_user_orders (user_id, ordered_at DESC)
);

-- Order items table
CREATE TABLE order_items (
    order_id        BIGINT,
    product_id      BIGINT,
    quantity        INT,
    price_at_order  DECIMAL(10,2),           -- snapshot of price at time of order
    PRIMARY KEY (order_id, product_id)
);

-- Payments table (PostgreSQL)
CREATE TABLE payments (
    payment_id      BIGINT PRIMARY KEY,
    order_id        BIGINT UNIQUE NOT NULL,
    amount          DECIMAL(10,2),
    method          VARCHAR(20),             -- 'card', 'wallet', 'bank'
    provider_ref    VARCHAR(100),            -- Stripe charge ID
    status          VARCHAR(20),             -- 'pending', 'charged', 'refunded', 'failed'
    idempotency_key VARCHAR(100) UNIQUE,     -- prevent double charges
    processed_at    TIMESTAMP
);

Important: Notice the price_at_order field in order_items. We snapshot the price when the order is placed. If the seller changes the price tomorrow, it shouldn’t affect existing orders. This is a common mistake in interviews — always capture the price at order time.

Step 6: Deep Dives

Deep Dive 1: Inventory Management (The Hardest Part)

Inventory is where things get tricky. Imagine 100 people trying to buy the last unit of a popular product at the same time. We must ensure exactly one person gets it, not two, not zero.

The race condition problem:

Stock: 1 unit left

Thread A: reads stock = 1 → "okay, can sell!"
Thread B: reads stock = 1 → "okay, can sell!"
Thread A: stock = stock - 1 → sets stock to 0
Thread B: stock = stock - 1 → sets stock to 0 (but we already sold it!)

Result: We sold 2 units but only had 1. Oversold!

Solution: Optimistic locking with version numbers

Every update includes the version number. If someone else updated the row between our read and write, our update fails and we retry.

-- Read the current stock and version
SELECT available_stock, version FROM inventory WHERE product_id = 42;
-- Returns: available_stock = 1, version = 7

-- Try to decrement stock (only succeeds if version hasn't changed)
UPDATE inventory
SET available_stock = available_stock - 1,
    version = version + 1
WHERE product_id = 42
  AND version = 7
  AND available_stock >= 1;

-- If rows_affected = 0, someone else got there first. Retry or show "sold out"

This is called optimistic locking because we’re optimistic that no one else is modifying the row. If they did, we detect it and retry. No heavy database locks needed.

The reserved stock pattern:

We don’t actually decrement stock when the user places an order. We reserve it first.

Available: 10, Reserved: 0

User places order:
  Available: 9, Reserved: 1    (reserved for this order)

Payment succeeds:
  Available: 9, Reserved: 0    (reserved → sold, stock stays at 9)

Payment fails:
  Available: 10, Reserved: 0   (release the reservation)

Why? Because payment might take seconds to process. If we decrement immediately and the payment fails, we’d have to add it back. With reservation, we hold the item during payment processing, and only truly sell it when payment succeeds.

Reservation timeout:

What if the user goes through checkout but never completes payment? We set a timeout (e.g., 10 minutes). If the payment isn’t completed within that window, the reservation expires and the stock becomes available again. A background job cleans up expired reservations.

Deep Dive 2: Order Processing Pipeline

An order goes through multiple steps across multiple services. If any step fails, we need to handle it gracefully. This is where the saga pattern comes in.

The happy path:

1. User clicks "Place Order"
2. Order Service creates order (status: pending)
3. Inventory Service reserves stock
4. Payment Service charges the user
5. Inventory Service confirms (reserved → sold)
6. Order Service updates status to "paid"
7. Notification: "Your order is confirmed!"
8. Fulfillment Service picks + packs + ships
9. Order Service updates status to "shipped"
10. Delivery confirmed → status: "delivered"

What if payment fails at step 4?

We need to compensate — undo everything we did:

  • Release the inventory reservation (step 3 rollback)
  • Mark the order as “failed”
  • Notify the user: “Payment failed, please try again”

The saga pattern:

In simple language, a saga is a sequence of steps where each step has a rollback action. If step N fails, we run the rollback for steps N-1, N-2, … down to step 1.

Step 1: Create order           → Rollback: cancel order
Step 2: Reserve inventory      → Rollback: release reservation
Step 3: Charge payment         → Rollback: refund payment
Step 4: Confirm inventory      → (no rollback needed — order is final)

We implement this using a message queue. Each service publishes events, and the next service in the chain listens for them.

Order Service → publishes "order_created"
  → Inventory Service hears it → reserves stock → publishes "stock_reserved"
    → Payment Service hears it → charges card → publishes "payment_charged"
      → Inventory Service hears it → confirms reservation
      → Order Service hears it → marks order as "paid"

If Payment fails → publishes "payment_failed"
  → Inventory Service hears it → releases reservation
  → Order Service hears it → marks order as "failed"

Idempotency is critical:

What if the payment message gets delivered twice? We’d charge the user twice. To prevent this, every payment request includes an idempotency key (usually the order_id). The payment provider checks: “Have I seen this key before? If yes, return the previous result instead of charging again.”

POST /charge
  { "amount": 149.97, "idempotency_key": "ord_999" }

First call:  charges $149.97, returns success
Second call: returns the SAME success response without charging again

This is a non-negotiable requirement for any payment system.

Deep Dive 3: Product Search and Recommendations

Search with Elasticsearch:

The product catalog lives in PostgreSQL (source of truth), but search queries go to Elasticsearch. Why? Because Elasticsearch is built for full-text search with relevance ranking, fuzzy matching, autocomplete, and faceted filtering. PostgreSQL’s LIKE '%headphones%' would be painfully slow on 500M products.

How we keep Elasticsearch in sync:

Seller updates product in PostgreSQL
  → Change Data Capture (CDC) detects the change
  → Publishes event to Kafka
  → Elasticsearch consumer reads the event
  → Updates the search index

Eventual consistency: there's a 1-5 second delay. That's fine for search.

Search features:

GET /search?q=wireless+headphones&category=electronics&min_price=50&sort=relevance

Elasticsearch query:
{
  "query": {
    "bool": {
      "must": { "multi_match": { "query": "wireless headphones", "fields": ["title^3", "description"] }},
      "filter": [
        { "term": { "category": "electronics" }},
        { "range": { "price": { "gte": 50 }}}
      ]
    }
  },
  "sort": ["_score"]
}

The title^3 means title matches are weighted 3x more than description matches. This is how we control relevance.

Autocomplete: Elasticsearch’s completion suggester gives us type-ahead suggestions as the user types. “wire…” → [“wireless headphones”, “wireless mouse”, “wireless charger”].

Product recommendations:

Two common approaches for an e-commerce platform:

  1. “Customers who bought X also bought Y” — collaborative filtering. We look at purchase patterns across all users. If 70% of people who bought a phone case also bought a screen protector, we recommend screen protectors on the phone case page. Simple, effective, and doesn’t need ML.

  2. “Based on your browsing history” — personalized recommendations. We track what the user browsed, what they added to cart, and what they bought. Then we find similar products using content-based similarity or a trained model.

For the interview, mentioning collaborative filtering and explaining how the “frequently bought together” feature works is usually enough.

Step 7: Scaling

Product Service (heaviest read load):

  • Multi-layer caching: CDN (for product images) → Redis (for product data) → Read replicas (for DB)
  • Cache product pages aggressively. Product data rarely changes.
  • Cache invalidation: when a seller updates a product, invalidate the Redis cache and purge the CDN
  • Shard the products table by product_id

Search Service:

  • Elasticsearch cluster with multiple shards per index
  • Separate read and write nodes — heavy search load shouldn’t impact indexing
  • Warm standby cluster for failover

Order Service:

  • Shard orders by user_id — a user’s orders live on the same shard
  • Event-driven architecture: use Kafka for inter-service communication
  • The order table grows forever, but we only query recent orders. Partition by date.

Inventory Service (most critical for consistency):

  • Single source of truth for stock levels — no caching layer for writes
  • Read replicas for display purposes (“In Stock” badge on product pages)
  • The actual stock check at checkout time must hit the primary database
  • For flash sales: pre-warm inventory data in Redis, use Redis atomic operations (DECR) for ultra-fast stock reservation, then sync to DB

Handling sale events (Black Friday / Prime Day):

  • Auto-scale all services 3-5 days before the event
  • Pre-generate and cache popular product pages
  • Use a queue-based checkout: if the system is overwhelmed, put users in a virtual queue instead of crashing
  • Rate limit cart operations per user (prevent bots from hoarding inventory)
  • Feature flags: disable non-essential features (reviews, recommendations) to save capacity for core shopping flow

Payment processing:

  • Never process payments synchronously in the order flow — use async processing via message queue
  • Retry with exponential backoff for payment failures
  • Idempotency keys on every single payment call
  • PCI compliance: never store raw credit card numbers. Use tokenized payment methods (Stripe tokens)

Database strategy:

  • PostgreSQL for transactional data (orders, inventory, payments) — we need ACID guarantees
  • Elasticsearch for search — eventually consistent with the source of truth
  • Redis for caching (product data, sessions, carts) and real-time counters
  • S3 for product images and static assets, fronted by a CDN

In simple language, an e-commerce platform is a collection of independent services — products, search, cart, orders, inventory, payments — each owning its own data. The hardest problems are inventory management (preventing overselling with optimistic locking and reserved stock), order processing (coordinating multiple services with the saga pattern), and search (keeping Elasticsearch in sync with the catalog). Everything is heavily cached because reads dominate writes by 100:1. And the payment system needs bulletproof idempotency because charging someone twice is the quickest way to lose trust. The beauty of this architecture is that each service scales independently — we can throw 10x more resources at the product service during a sale without touching the payment service.