High-Level Design — Quick Summary

Quick revision: every topic, key terms, and mnemonics for High-Level Design.

This is a quick revision doc covering all 45 topics in hld. Open the linked notes if you want depth.

Foundations & Approach

What Is System Design?

What it is. Defining the architecture, components, and data flow of a system to meet requirements. Conversation, not test.

Key terms.

Trade-offs — every choice has costs.
Building blocks — load balancers, caches, queues, databases.
Stages of growth — single server → separate DB → LB+multiple servers → distributed system.

Remember. Interviewers evaluate communication, problem breakdown, trade-off awareness, breadth, and depth — not perfect architecture.

How to Approach System Design

What it is. Repeatable framework for any 45-min interview.

Key terms (5-step framework).

Step 1 (5 min) — Requirements & Scope.
Step 2 (5 min) — Back-of-the-envelope estimation.
Step 3 (15 min) — High-level design (boxes + APIs + data flow).
Step 4 (15 min) — Deep dive (2-3 areas).
Step 5 (5 min) — Wrap up + improvements.

Remember. Common mistakes: jumping to solution, over-engineering for Google-scale, silence, no trade-offs, ignoring interviewer hints.

Requirements Gathering

What it is. Turning vague prompts into concrete scope.

Key terms.

Functional requirements — what the system does (features).
Non-functional requirements — how well (scalability, availability, consistency, latency, durability).
Availability nines — 99% (3.65 days/yr), 99.9% (8.7 hrs), 99.99% (52 min), 99.999% (5 min).

Remember. Always ask: How many users? Read/write ratio? Real-time vs batch? Consistency vs availability? When in conflict, state which one wins.

Back-of-the-Envelope Estimation

What it is. Quick math to feel the scale. Order-of-magnitude only.

Key terms.

86,400 seconds/day — round to 100K (10^5).
Peak QPS ≈ 2-3x avg.
Numbers to know — RAM 100ns, SSD 150μs, datacenter RTT 0.5ms, cross-continent ~150ms.
Storage units — 2^10 ≈ 1K, 2^20 ≈ 1M, 2^30 ≈ 1B, 2^40 ≈ 1T.
80/20 rule — 20% of data handles 80% of traffic (cache budget).

Remember. Round aggressively. State assumptions. Don’t spend more than 5 min.

Design Principles and Trade-offs

What it is. Every design choice is a trade-off.

Key terms.

Scalability — vertical (bigger machine) vs horizontal (more machines).
Latency vs Throughput — fast individual reqs vs many req/sec.
SPOF — Single Point of Failure. Eliminate at every layer.
Stateless vs Stateful — push state to external stores so app servers can scale.
Consistency vs Availability — banking picks C, social media picks A.
p50 / p95 / p99 — latency percentiles. p99 catches the slow tail.

Remember. No “best” architecture, only the right one for the requirements.

Core Building Blocks

DNS and How the Internet Works

What it is. Translates names → IPs. The first step of every web request.

Key terms.

Hierarchy — Stub resolver → Recursive resolver → Root → TLD → Authoritative.
Records — A (IPv4), AAAA (IPv6), CNAME (alias), NS, MX (mail).
TTL — caching duration.

Remember. DNS is not just resolution — it does load distribution (round-robin), geo-routing, and failover.

Load Balancers

What it is. Distributes traffic across servers. Removes SPOF.

Key terms.

L4 (transport) — IP+port. Fast, dumb. NLB.
L7 (application) — reads HTTP. Path/host routing, sticky sessions, TLS termination. ALB, NGINX.
Algorithms — Round Robin, Weighted Round Robin, Least Connections, IP Hash, Random.
Health checks — periodic ping; remove dead servers.
Active-passive vs Active-active redundancy.

Remember. Tools: NGINX, HAProxy, AWS ALB/NLB, Caddy. Always have redundant LBs (LB itself is a SPOF).

Caching

What it is. Storing data in faster location to avoid the slow source.

Key terms.

Layers — Browser → CDN → App cache (Redis/Memcached) → DB cache.
Hit / miss / hit ratio — aim for 95%+ on static, 80%+ on dynamic.
Eviction — LRU (default), LFU, FIFO, TTL.
Patterns — Cache-Aside (most common), Write-Through, Write-Behind.

Remember. Don’t cache: frequently-changing data, low-traffic data, write-heavy workloads, must-be-perfectly-consistent data.

Content Delivery Networks (CDNs)

What it is. Geographically distributed cache servers near users.

Key terms.

Edge / PoP — point of presence.
Pull CDN (default) — fetches from origin on miss.
Push CDN — we upload ahead.
Cache invalidation — TTL, explicit purge, versioned URLs (best).

Remember. Use CDN for static content (images, JS, CSS, video). Don’t use for dynamic personalised content. Pair with object storage (S3+CloudFront).

Message Queues

What it is. Async buffer between producer and consumer. Decouples services.

Key terms.

Point-to-Point — one message → one consumer.
Pub/Sub — one message → all subscribers.
DLQ — Dead Letter Queue for poison messages.
Tools — Kafka (high throughput, retains messages, replay), RabbitMQ (complex routing), SQS (managed).

Remember. Use queues for: decoupling services, traffic spikes, retries, async heavy work (email, report, video encoding).

Proxies and Reverse Proxies

What it is. Intermediary servers. Forward = hides client. Reverse = hides server.

Key terms.

Reverse proxy duties — TLS termination, load balancing, compression, caching, request routing, rate limiting, WAF.
Tools — Nginx, Caddy (auto HTTPS), HAProxy, Traefik (k8s native).
Reverse proxy vs LB vs API Gateway — overlapping; many tools do all three.

Remember. “Proxy to access blocked sites” = forward (VPN). “Nginx in front of my app” = reverse. CDN = distributed reverse proxy.

Database Deep Dive

SQL vs NoSQL

What it is. Relational vs non-relational. Pick based on data shape + scale + consistency.

Key terms.

SQL — fixed schema, ACID, vertical scaling, JOINs. Postgres, MySQL.
NoSQL types — Key-Value (Redis, DynamoDB), Document (MongoDB), Columnar (Cassandra), Graph (Neo4j).
ACID vs BASE — strict guarantees vs eventual consistency.

Remember. Real systems use both. PostgreSQL primary + Redis cache + Elasticsearch search is the classic combo.

Database Indexing

What it is. Separate sorted structure for fast lookups. Like book index.

Key terms.

B-Tree — default, O(log n) lookups + range queries.
Hash index — O(1) but no ranges.
Composite index — leftmost prefix rule on (a, b, c).
Covering index — INCLUDE all needed columns; index-only scan.
Unique index — enforces uniqueness.

Remember. Indexes speed reads, slow writes, take disk space. Don’t index small tables, low-cardinality columns, write-heavy tables.

Database Replication

What it is. Copies of data on multiple servers. HA + read scaling + DR.

Key terms.

Single-leader (master-slave) — one writer, many readers. Most common.
Multi-leader — for multi-region writes; conflict resolution hard (LWW, CRDTs).
Leaderless (Dynamo) — quorum-based: w + r > n.
Sync vs Async — sync = safe + slow; async = fast + risky.
Replication lag — read-after-write inconsistency, monotonic reads, causality.

Remember. Replication scales reads, not writes. Always plan for replication lag.

Database Sharding

What it is. Splitting data across multiple servers. Each shard = subset.

Key terms.

Hash-based — even distribution, no range queries, painful resharding.
Range-based — easy ranges, hot spots possible.
Directory-based — flexible, lookup service is SPOF.
Cross-shard JOINs — expensive; denormalize to avoid.
Hot spots — celebrity user problem; mitigate with random suffix.

Remember. Sharding is a last resort. Try vertical scaling, replicas, caching, query optimization first.

Consistent Hashing

What it is. Hash ring that minimizes data movement when nodes change.

Key terms.

The ring — 0 to 2^32. Servers + keys placed via hash.
Walk clockwise — first server hit owns the key.
Adding a node — only nearby keys move (~1/N).
Virtual nodes (vnodes) — 100-200 placements per server for even distribution.

Remember. Used in Memcached, Cassandra, DynamoDB, Akamai CDN. Without it, hash % N reshuffles ~75% of keys when adding one node.

ACID and Transactions

What it is. Four guarantees for reliable transactions. (Same as DBMS — see that.)

Isolation Levels Cheatsheet.

Level	Dirty	Non-Repeatable	Phantom
Read Uncommitted	possible	possible	possible
Read Committed (Postgres default)	prevented	possible	possible
Repeatable Read (MySQL default)	prevented	prevented	possible
Serializable	prevented	prevented	prevented

Remember. ACID for money/inventory/auth. BASE for feeds/likes/views.

Scalability Patterns

Horizontal vs Vertical Scaling

What it is. Bigger machine vs more machines.

Key terms.

Vertical (scale up) — simple, low complexity, hard ceiling, SPOF.
Horizontal (scale out) — distributed complexity, no theoretical limit, fault tolerant.

Remember. Start vertical for simplicity. Design stateless so you can go horizontal later. Most large systems use a mix.

Microservices vs Monolith

What it is. One app vs many small apps.

Key terms.

Monolith — simple, fast iteration, scales together. Good for small teams.
Microservices — independent deploy/scale, fault isolation, polyglot. High operational overhead.
Communication — Sync (HTTP/gRPC) vs Async (message queues). Most use both.
Service discovery — Consul, Eureka, k8s DNS.

Remember. “Monolith first” (Fowler). Amazon, Netflix, Uber all started monolith. Extract services when pain points appear.

API Gateway

What it is. Single entry point for microservices.

Key terms.

Duties — request routing, auth, rate limit, load balance, transformation, response aggregation, caching, logging.
Response aggregation — gateway fans out to N services and merges. Big win for mobile.
Tools — Kong, AWS API Gateway, Nginx, Traefik, Express Gateway.
Difference from LB — LB distributes across copies of one service; gateway routes between services.

Remember. Without it, every client knows every service URL. With it, one door. Make it HA — it’s a SPOF if not.

Denormalization and Read-Write Separation

What it is. Trade duplication for fast reads. Split reads from writes.

Key terms.

Denormalization — copy fields, cached counters, summary tables.
Read replicas — primary handles writes; replicas handle reads.
Replication lag — read-your-own-writes problem.
CQRS Lite — separate read/write DB.

Remember. “Update name” requires updating all denormalized copies — that’s the price.

Blob Storage and Object Storage

What it is. Storage for files. Object = flat key-value blobs over HTTP.

Key terms.

Block — raw disk (EBS, EFS).
File — shared filesystem (NFS).
Object — S3, GCS, Azure Blob. Virtually unlimited.
Pre-signed URLs — let client upload/download directly to/from S3 without going through our server.
Storage classes — Standard / IA / Archive (Glacier). Use lifecycle policies.

Remember. Object storage + CDN is the gold standard. S3 alone has 11 nines durability.

Reliability & Consistency

CAP Theorem

What it is. During a network partition, choose C or A. Same as DBMS — covered there.

Key terms.

C — Every read sees latest write.
A — Every request gets a response.
P — System works through partitions (not optional).
CP — MongoDB, HBase, etcd, Zookeeper.
AP — Cassandra, DynamoDB, CouchDB, Riak.

Remember. Trade-off only kicks in during partition. PACELC extends with Else (no partition) → Latency or Consistency.

Cheatsheet — CAP Triad

Pick	Sacrifice	Examples
CP	Availability during partition	MongoDB, etcd
AP	Consistency during partition	Cassandra, DynamoDB
CA	Partition tolerance (single-node only)	Single Postgres

Consistency Models

What it is. Spectrum from strong to eventual.

Key terms.

Strong / Linearizable — every read = latest write. Spanner, CockroachDB.
Sequential — global total order, slightly weaker than linearizable.
Causal — preserves cause-and-effect order. MongoDB causal sessions.
Read-Your-Writes — you see your own writes. Common UX fix.
Monotonic Reads — never see a value older than what we already saw.
Eventual — converges given time. DynamoDB, Cassandra default.

Remember. Mix and match per use case. Inventory → strong. Likes → eventual. Profile updates → read-your-writes.

Cheatsheet — Consistency Models

Model	Strength	Latency	Example
Linearizable	Strongest	Highest	Spanner
Sequential	Strong	High	Some DBs
Causal	Medium	Medium	MongoDB causal
Read-your-writes	Weak+	Low+	Sticky sessions
Eventual	Weakest	Lowest	DynamoDB default

Failover and Redundancy

What it is. Automatic switch from failed component to backup.

Key terms.

Active-Passive — one works, one waits. Standby uses warm/hot replication.
Active-Active — all serve traffic; survivors absorb load.
Health checks (pull) vs Heartbeats (push).
SLI / SLO / SLA — Indicator (metric) / Objective (internal target) / Agreement (customer promise).
Standby types — cold (off, restore from backup), warm (replicating, idle), hot (synced, ready).

Remember. Always have redundant LBs. Going from 99.9% to 99.99% is exponentially harder. Aim for 3 nines on web apps, 4-5 for critical infra.

Circuit Breaker and Bulkhead Patterns

What it is. Prevent cascading failures.

Key terms.

Circuit Breaker states — Closed (normal) → Open (fail fast) → Half-Open (test one request).
Fallback — cached data, default value, queued, degraded.
Bulkhead — separate thread/connection pools per service.
Retry with exponential backoff + jitter.
Don’t retry — 4xx errors, non-idempotent operations.
Libraries — Resilience4j (Java), Polly (.NET), Hystrix (deprecated).

Remember. Three patterns combine: Bulkhead isolates → Circuit breaker fails fast → Retry handles transients. Together they prevent cascading collapse.

Monitoring, Logging, and Alerting

What it is. Three pillars of observability.

Key terms.

Metrics — numeric, aggregated. “What’s happening?” Prometheus, Datadog.
Logs — discrete events, detailed. “What happened?” ELK, Loki.
Traces — request journey across services. “Where is it slow?” Jaeger, Zipkin.
Four Golden Signals (Google SRE) — Latency, Traffic, Errors, Saturation.
Latency percentiles — p50, p95, p99 matters most.
Structured logs — JSON for searchability.
Alert on symptoms, not causes. SLO breaches.
Avoid alert fatigue — every alert must be actionable + have runbook.

Remember. Mention monitoring at end of system design — shows production maturity.

Communication Protocols

REST API Design

What it is. Resource-based architecture over HTTP. Stateless.

Key terms.

Methods — GET (read), POST (create), PUT (replace), PATCH (partial), DELETE.
Idempotency — GET/PUT/PATCH/DELETE yes; POST no.
Status codes — 2xx success, 3xx redirect, 4xx client error, 5xx server error.
URL design — nouns plural, nested resources for relationships, query params for filters.
Pagination — cursor-based > offset-based.
Versioning — /v1/, header, or query param. URL is most explicit.

Remember. “4xx = client’s fault. 5xx = our fault.” Always use ISO 8601 dates. HTTPS everywhere.

GraphQL

What it is. Client-specified queries to one endpoint. Solves over/under-fetching.

Key terms.

Schema + Types — strongly typed contract.
Queries — read.
Mutations — write.
Subscriptions — real-time over WebSocket.
N+1 in resolvers — fix with DataLoader for batching.
Caching is harder — single endpoint with POST. Use Apollo/urql client cache.
Security — query depth limits + complexity analysis.

Remember. GraphQL shines when frontend needs lots of related data. REST simpler for CRUD, public APIs, file uploads.

Cheatsheet — REST vs GraphQL vs gRPC vs WebSocket

Use case	Pick
Public CRUD API, browser-friendly	REST
Frontend with many related entities	GraphQL
Backend-to-backend, perf critical	gRPC
Bidirectional real-time	WebSocket
Server-only push	SSE

WebSockets

What it is. Persistent bidirectional connection. Both sides send anytime.

Key terms.

HTTP Upgrade handshake → 101 Switching Protocols.
ws:// / wss:// (TLS).
Use cases — chat, live notifications, dashboards, collab editing, gaming.
Scaling — sticky sessions OR Redis Pub/Sub backbone.
Heartbeats / ping-pong — detect dead connections.

Remember. Don’t use for occasional updates (use SSE) or simple CRUD. Each connection consumes server memory — plan for ~100K connections per server.

gRPC and Protocol Buffers

What it is. RPC framework using HTTP/2 + Protobuf binary serialization.

Key terms.

Protobuf — schema-first, binary, 3-10x smaller than JSON.
Field tags — int32 id = 1 — never reuse.
Four call types — Unary, Server streaming, Client streaming, Bidirectional streaming.
HTTP/2 multiplexing — many calls per connection.
gRPC-Web — needed for browsers (proxy).

Remember. gRPC inside services, REST outside for public APIs. Trade-off: faster + typed but harder to debug.

Polling, Long Polling, and Server-Sent Events

What it is. Server push without full WebSocket complexity.

Key terms.

Short polling — fixed interval. Wasteful.
Long polling — server holds until data ready. Server resource heavy.
SSE — text/event-stream, one-way server→client, EventSource API auto-reconnects.
Last-Event-ID — resume after disconnect.

Remember. Decision tree: real-time + bidirectional → WebSocket. Server-only push → SSE (underrated). Occasional updates → polling.

Advanced Patterns

Rate Limiting

What it is. Cap on requests per client per time window.

Key terms.

Token Bucket — refills steadily, allows bursts. Most popular.
Leaking Bucket — strict outflow rate, no bursts.
Fixed Window Counter — boundary burst issue.
Sliding Window Log — accurate but memory-heavy.
Sliding Window Counter — weighted hybrid, memory-efficient.
Headers — X-RateLimit-Limit / Remaining / Reset, 429 Too Many Requests, Retry-After.
Distributed — Redis + Lua script for atomicity.

Remember. Fail open if rate limiter is down. Layer global + per-client + per-endpoint. Token Bucket is the safe interview answer.

Advanced Caching Patterns

What it is. Five caching strategies + handling stampedes.

Key terms.

Cache-Aside — most common. App manages cache.
Read-Through — cache loads from DB on miss.
Write-Through — write to both, synchronously.
Write-Behind — fast cache write, async DB. Risky.
Refresh-Ahead — proactively refresh hot keys before TTL.
Cache stampede / thundering herd — fix with: locking, staggered TTL (jitter), refresh-ahead, never-expire-with-bg-refresh.

Remember. Always add jitter to TTL. Pick cache-aside for safe default. Write-behind only if data loss is acceptable.

Search and Indexing

What it is. Inverted indexes for fast text search. Elasticsearch.

Key terms.

Inverted index — word → list of documents.
Elasticsearch concepts — Index, Document, Shard, Replica.
Analyzer — tokenize → normalize (lowercase, stem, remove stop words).
BM25 — relevance scoring (TF + IDF + field length).
Sync from DB — best via CDC (Debezium).

Remember. Don’t use SQL LIKE %x% for search. ES sits beside DB, not instead. Use it for full-text, autocomplete, faceted filtering, geosearch.

Event Sourcing and CQRS

What it is. Store events not state. Separate read/write models.

Key terms.

Event store — append-only log of immutable events.
Projections — materialized read views built from events.
CQRS — Command (write, validated, normalized) vs Query (read, denormalized, fast).
Snapshots — periodic state dumps to avoid replaying everything.

Remember. Adds significant complexity. Use for audit trails (finance/healthcare/legal), complex domains, replay needs. Most CRUD apps don’t need it.

Distributed Consensus

What it is. Many nodes agreeing. Leader election. Raft.

Key terms.

Raft states — Follower → Candidate → Leader.
Term — election round.
Heartbeats — leader signal to followers.
Quorum — majority required (prevents split-brain).
Odd cluster sizes — 3, 5, 7. Tolerate (N-1)/2 failures.
Tools — etcd (k8s), ZooKeeper (older, ZAB protocol), Consul (HashiCorp).

Remember. Quorum prevents split-brain mathematically. Raft is the modern, understandable algorithm — Paxos is the OG.

Real System Design Questions

Design a URL Shortener (TinyURL)

Core flow. POST long URL → generate short key → return. GET short key → 302 redirect to long URL.

Key terms.

Short key generation — Hash + collision check / Auto-increment + Base62 / Pre-Generated Key Service (KGS).
KGS — wins because no collisions, no coordination, scalable.
301 vs 302 — 302 keeps analytics visibility (each click hits server).
Base62 (a-z, A-Z, 0-9) — 62^7 = 3.5 trillion URLs.

Architecture. Client → LB → App Server → Redis (hot keys) → DB. Click events → Kafka → analytics workers.

Remember. Read-heavy (100:1) → cache aggressively. Async analytics through Kafka so redirects stay fast.

Design a Rate Limiter

Core flow. Every request → check counter → allow or 429.

Key terms.

Algorithms — Token Bucket, Sliding Window Counter (best practical picks).
Redis + Lua script for atomic INCR + EXPIRE.
Identification — API key, user_id, IP, or combination.
Layering — global + per-client + per-endpoint.

Remember. Fail open. Use Token Bucket for bursts. Multi-region: per-region limits are usually good enough.

Design a Chat System (WhatsApp)

Core flow. Persistent WebSocket → message → server routes to recipient (online) or push notification (offline) → persist.

Key terms.

WebSocket for real-time delivery.
Connection registry (Redis) — user_connection:{user_id} → chat-server-X.
Kafka as backbone for routing across chat servers.
Cassandra for messages — partition by conversation_id, cluster by message_id (TimeUUID).
Statuses — sent, delivered, read.
Per-conversation sequence numbers for ordering.
Group fan-out — push for small groups (≤500), pull for huge channels.
Presence — heartbeat in Redis with 60s TTL.

Remember. 5,000+ chat servers for 500M concurrent connections. Pub/sub bridges users on different servers. Push notifications via APNs/FCM for offline.

Core flow. Tweet posted → fan out to followers’ feed caches (write-side) → user opens feed → fetch from cache + pull from celebrities → rank → return.

Key terms.

Snowflake IDs — globally unique, time-ordered.
Fan-out on write (push) — fast reads, expensive writes for celebs.
Fan-out on read (pull) — fast writes, slow reads.
Hybrid (Twitter approach) — push for normal users, pull for celebrities (>10K-50K followers).
Feed cache — Redis sorted set per user.
Ranking — recency + engagement + relationship signals.
Real-time updates — WebSocket push for active users; new-tweet banner.

Remember. Pre-compute feeds when tweets created (write expense) so reads are cache lookups (read speed). Celebrity problem requires hybrid.

Design a Video Streaming Platform (YouTube)

Core flow. Upload → S3 → Kafka → transcode workers → multiple resolutions + chunks → S3 → CDN. Watch → manifest from CDN → fetch chunks adaptively.

Key terms.

Pre-signed URL for direct upload to S3.
Transcoding — generate 360p/480p/720p/1080p/4K, chunked into 2-10s segments.
HLS (HTTP Live Streaming) — master.m3u8 lists qualities; player picks based on bandwidth.
Adaptive bitrate — quality switches mid-stream segment-by-segment.
CDN — 90%+ of traffic served from edge, not origin.
Recommendations — candidate generation → ranking → re-ranking.

Remember. Upload and stream are completely separate paths. CDN does the heavy lifting. HLS works because each segment is just a regular HTTP file (cacheable).

Core flow. Driver location every 4s → Redis GEO + Kafka. Rider request → Matching Service queries nearby drivers → ETA-based pick → notify driver → ride lifecycle.

Key terms.

Redis GEO — GEOADD, GEORADIUS for nearby driver search.
Geohashing — neighboring cells share prefix.
Quadtree / H3 (Uber) — alternatives.
Matching score = ETA + driver rating + acceptance rate + fairness.
Trip state machine — matching → accepted → arriving → in_progress → completed.
Surge pricing — supply/demand per hex zone.
Sharding by city — handles 1.25M location updates/sec.
Saga + idempotency keys for payment.

Remember. Location firehose (1M+ writes/sec) is the hardest part. Shard by city. Redis for current locations, Kafka for stream, time-series DB for history.

Design a File Storage Service (Dropbox)

Core flow. Client chunks file → hashes each chunk → uploads only new chunks → server tracks chunks per file. Sync notifications via WebSocket → other devices fetch new chunks.

Key terms.

Chunking — split into 4 MB pieces, SHA-256 hash each.
Content-addressable storage — same content → same hash → stored once.
Deduplication — Dropbox saves ~75% on storage.
Variable-length chunking (Rabin) — handles inserts better than fixed.
Versioning — keep old versions; restore via chunk hash list.
Conflict resolution — Last Writer Wins + save conflicting copy.
Magic Pocket — Dropbox’s custom block storage (left S3 for cost).

Remember. Three pillars: chunking (transfer only changes), dedup (store once), sync notifications (push changes to devices). Metadata in PostgreSQL, blocks in S3.

Design an E-Commerce Platform (Amazon)

Core flow. Product catalog (cached heavily) → cart (Redis + DB) → checkout → reserve inventory → charge payment → confirm → fulfill.

Key terms.

Inventory — optimistic locking with version column. Reserved stock pattern (reserve at checkout, confirm at payment success, release on timeout).
Saga pattern — order → reserve → pay → confirm. Rollbacks (release reservation, refund) on failure.
Idempotency keys — non-negotiable for payments.
Search — Elasticsearch synced via CDC.
Recommendations — collaborative filtering (“bought together”), content-based, personalized.
Price snapshot — store price_at_order in order_items.

Remember. Never oversell. Optimistic locking + reserved stock + version numbers. Sale events: queue-based checkout, pre-warm caches, feature flags to disable non-essentials. Read-heavy (100:1) so cache everything, shard by user_id.

High-Level Design — Quick Summary

Foundations & Approach

What Is System Design?

How to Approach System Design

Requirements Gathering

Back-of-the-Envelope Estimation

Design Principles and Trade-offs

Core Building Blocks

DNS and How the Internet Works

Load Balancers

Caching

Content Delivery Networks (CDNs)

Message Queues

Proxies and Reverse Proxies

Database Deep Dive

SQL vs NoSQL

Database Indexing

Database Replication

Database Sharding

Consistent Hashing

ACID and Transactions

Scalability Patterns

Horizontal vs Vertical Scaling

Microservices vs Monolith

API Gateway

Denormalization and Read-Write Separation

Blob Storage and Object Storage

Reliability & Consistency

CAP Theorem

Cheatsheet — CAP Triad

Consistency Models

Cheatsheet — Consistency Models

Failover and Redundancy

Circuit Breaker and Bulkhead Patterns

Monitoring, Logging, and Alerting

Communication Protocols

REST API Design

GraphQL

Cheatsheet — REST vs GraphQL vs gRPC vs WebSocket

WebSockets

gRPC and Protocol Buffers

Polling, Long Polling, and Server-Sent Events

Advanced Patterns

Rate Limiting

Advanced Caching Patterns

Search and Indexing

Event Sourcing and CQRS

Distributed Consensus

Real System Design Questions

Design a URL Shortener (TinyURL)

Design a Rate Limiter

Design a Chat System (WhatsApp)

Design a Social Media Feed (Twitter/X)

Design a Video Streaming Platform (YouTube)

Design a Ride-Sharing Service (Uber)

Design a File Storage Service (Dropbox)

Design an E-Commerce Platform (Amazon)