Design Principles and Trade-offs

beginner 0-2 YOE system-design trade-offs scalability availability CAP

System design is all about trade-offs. There’s no perfect system — every choice we make comes with a cost. The key is understanding what we’re gaining and what we’re giving up with each decision.

Key Design Principles

Scalability

The ability to handle more load by adding resources. Two flavors:

  • Vertical scaling (scale up): Bigger machine. More RAM, faster CPU. Simple but has a ceiling.
  • Horizontal scaling (scale out): More machines. Harder to implement but virtually unlimited.

Think of it like a restaurant. Vertical scaling = get a bigger kitchen. Horizontal scaling = open more locations.

Availability

The system is up and working when users need it. Measured in “nines”:

AvailabilityDowntime/YearDowntime/Month
99% (two 9s)3.65 days7.3 hours
99.9% (three 9s)8.76 hours43.8 min
99.99% (four 9s)52.6 min4.38 min
99.999% (five 9s)5.26 min26.3 sec

Reliability

The system does what it’s supposed to do correctly. A system can be available (it’s responding) but unreliable (it’s giving wrong answers). We need both.

Performance

How fast the system responds. Two key metrics:

  • Latency — Time to handle a single request (usually p50, p95, p99)
  • Throughput — How many requests we handle per second

Maintainability

Can other engineers (or future us) understand and modify this system? Simple designs beat clever ones.

The Big Trade-offs

Common Trade-off Spectrums
Consistency ◄━━━━━━━━━━━► Availability
  Banking, inventory               Social media, DNS
Low Latency ◄━━━━━━━━━━━► High Throughput
  Gaming, trading                Batch processing, analytics
Simplicity ◄━━━━━━━━━━━► Performance
  Monolith, single DB            Microservices, sharding
Cost ◄━━━━━━━━━━━━━━━► Performance
  Single region                  Multi-region, redundancy

Consistency vs Availability

This is the famous CAP theorem in disguise. In simple language: when a network issue happens, we have to choose — do we give users potentially stale data (availability) or do we tell them “try again later” (consistency)?

  • Bank account: Must be consistent. We can’t show the wrong balance.
  • Social media feed: Availability wins. If a like takes 2 seconds to show up globally, nobody cares.

Latency vs Throughput

We can make one request super fast (low latency) or handle a massive number of requests (high throughput), but optimizing for one often hurts the other. Batching requests improves throughput but increases latency for individual requests.

Single Point of Failure (SPOF)

A SPOF is any component whose failure takes down the entire system. Every part of our design should have a backup plan:

  • One server? Add another behind a load balancer.
  • One database? Add a replica.
  • One data center? Deploy across multiple regions.
  • One load balancer? Use active-passive failover.

The rule is simple: if it can fail, assume it will. Then plan for it.

Stateless vs Stateful Services

Stateful services store information about the current session (like “this user is logged in”). If that server goes down, the state is lost.

Stateless services don’t remember anything between requests. Every request carries all the information needed (like a JWT token). Any server can handle any request.

Stateless services are much easier to scale — we just add more servers behind a load balancer and it just works. That’s why we push state to external stores (databases, Redis, sessions stores) and keep our application servers stateless.

Key Takeaway

In simple language, there’s no “best” architecture. There’s only the right architecture for our specific requirements. A system design interview is our chance to show we understand these trade-offs and can make informed decisions — not just draw boxes on a whiteboard.