Monitoring, Logging, and Alerting - High-Level Design

We can design the most elegant system in the world, but if we can’t see what’s happening inside it, we’re flying blind. When something breaks at 3 AM, the question is: do we have enough information to figure out what went wrong, or are we guessing?

Monitoring, logging, and alerting are how we answer that question. In system design interviews, they’re often the final piece that shows we think about real production systems, not just architecture diagrams.

The Three Pillars of Observability

Three Pillars of Observability

Metrics

Numeric measurements

Aggregated over time

Cheap to store

"What is happening?"

Prometheus, Datadog, CloudWatch

Logs

Discrete events

Detailed context

Expensive at scale

"What happened?"

ELK Stack, Loki, Splunk

Traces

Request journey

Across services

Shows bottlenecks

"Where is it slow?"

Jaeger, Zipkin, Datadog APM

Each pillar answers a different question. We need all three for full observability.

Metrics — The Numbers

Metrics are numeric values measured over time. They tell us the health of our system at a glance.

The Four Golden Signals (from Google SRE)

Latency — How long requests take. Not just the average — we care about percentiles.
- p50 (median): Half the requests are faster than this
- p95: 95% of requests are faster — this catches the slow tail
- p99: 99% are faster — the really unlucky users
- A p50 of 50ms with a p99 of 5 seconds means most users are happy but 1% are miserable
Traffic — How many requests per second we’re handling. Helps us know when to scale.
Error rate — Percentage of requests that fail (5xx errors, timeouts). A sudden spike means something broke.
Saturation — How full our resources are. CPU at 90%? Memory almost maxed? Disk filling up? This tells us how close we are to falling over.

How Metrics Work

Most metrics systems use a pull model. Our application exposes metrics at an endpoint (like /metrics), and a collector (Prometheus) scrapes it periodically.

App Server → exposes /metrics
Prometheus → scrapes every 15 seconds
Grafana   → queries Prometheus, shows dashboards

Logs — The Details

Logs are records of discrete events. When something breaks and metrics show a spike, logs tell us why.

Structured Logging

Plain text logs are hard to search:

[2024-03-15 14:23:01] ERROR: Payment failed for user 12345

Structured logs (JSON) are much better:

{
  "timestamp": "2024-03-15T14:23:01Z",
  "level": "error",
  "service": "payment",
  "user_id": 12345,
  "error": "card_declined",
  "amount": 49.99,
  "request_id": "abc-123"
}

Structured logs let us filter and search: “show me all errors from the payment service where amount > $100 in the last hour.”

Log Aggregation

With 50 servers, we can’t SSH into each one to read logs. We ship all logs to a central system:

ELK Stack — Elasticsearch (store/search) + Logstash (process) + Kibana (visualize). Self-hosted, powerful, complex.
Loki + Grafana — Lightweight log aggregation. Pairs perfectly with Prometheus.
Splunk — Enterprise log management. Expensive but very powerful.
Datadog Logs — SaaS. Easy setup, good integration with metrics and traces.

Log Levels

Use them correctly:

DEBUG — Verbose detail for development. Never in production.
INFO — Normal operations. “User signed up,” “Order placed.”
WARN — Something unexpected but not broken. “Cache miss rate high.”
ERROR — Something failed. “Payment processing failed.”
FATAL — The system is going down. “Database connection lost.”

Distributed Tracing — The Journey

In a microservices architecture, a single user request might touch 10 different services. When that request is slow, which service is the bottleneck?

Distributed tracing follows a request across all services by propagating a trace ID:

User Request (trace-id: abc-123)
  → API Gateway     [12ms]
    → Auth Service   [5ms]
    → Order Service  [350ms]  ← bottleneck!
      → Inventory DB [300ms]  ← root cause!
      → Payment Svc  [45ms]
    → Notification   [8ms]

Each step is a span. All spans with the same trace ID form a complete picture of the request’s journey. We can see that the order service is slow because the inventory database query took 300ms.

Alerting — The Wake-Up Call

Monitoring is useless if nobody looks at the dashboards. Alerting bridges that gap — it tells us when something needs attention.

What to Alert On

Good alerts are actionable. Every alert should require a human to do something. If we get an alert and the response is “meh, it’ll fix itself,” that’s a bad alert.

Alert on symptoms, not causes:

Good: “Error rate exceeded 5% for 5 minutes” (symptom)
Bad: “CPU usage above 80%” (cause — might be normal during a deploy)

Alert on SLO breaches:

“p99 latency exceeded 500ms for the last 10 minutes”
“Availability dropped below 99.9% for this billing period”

Alert Fatigue

The worst thing we can do is alert on everything. When teams get 200 alerts a day, they start ignoring all of them. And then they miss the real one.

Rules for good alerting:

Every alert should have a clear owner
Every alert should have a runbook (what to do when it fires)
Review alerts monthly — if we keep dismissing one, delete it or fix the root cause
Use severity levels: critical (page someone at 3 AM) vs warning (check it tomorrow)

The Monitoring Stack

A typical production setup:

Tool	Purpose
Prometheus	Metrics collection and storage
Grafana	Dashboards and visualization
ELK / Loki	Log aggregation and search
Jaeger / Zipkin	Distributed tracing
PagerDuty / OpsGenie	Alert routing and on-call management
Datadog	All-in-one SaaS (metrics + logs + traces + alerts)

In System Design Interviews

When wrapping up a system design answer, mentioning monitoring shows maturity:

“We’d track p99 latency and error rate per service with Prometheus and Grafana”
“Structured JSON logs shipped to ELK for debugging”
“Distributed tracing with Jaeger so we can find bottlenecks across services”
“Alerting on SLO breaches — if p99 exceeds 500ms for 5 minutes, page the on-call”

We don’t need to go deep. Just showing we think about operational concerns sets us apart.

Key Takeaway

In simple language, metrics tell us what’s happening (numbers over time), logs tell us why it happened (detailed events), and traces show us where it happened (a request’s journey across services). Together, they give us observability — the ability to understand our system’s behavior from the outside. Without them, debugging production issues is just guessing.