Monitoring and Alerting - DevOps

Our app is deployed. Users are hitting it. How do we know if it’s healthy? If response times are climbing? If the disk is filling up? We can’t just SSH in and check every 5 minutes. We need monitoring — systems that watch our infrastructure and applications, collect data, and alert us when something goes wrong.

The goal is simple: know about problems before our users do.

Metric Types

Before diving into tools, let’s understand the four types of metrics we’ll work with:

Counter — a value that only goes up. Total requests served, total errors, total bytes sent. We usually look at the rate (requests per second).
Gauge — a value that goes up and down. Current CPU usage, memory in use, active connections.
Histogram — tracks the distribution of values. Great for response times — we can see the median, 95th percentile, 99th percentile.
Summary — similar to histogram but calculated on the client side. Less flexible but cheaper.

Prometheus

Prometheus is the most popular open-source monitoring system. It uses a pull model — Prometheus scrapes metrics from our applications at regular intervals, rather than our apps pushing metrics to it.

Prometheus Architecture

App A

/metrics

App B

/metrics

← scrapes ←

Prometheus

TSDB + PromQL

→ queries →

Grafana

Dashboards

Alertmanager → Slack / PagerDuty / Email

Our apps expose a /metrics endpoint. Prometheus scrapes it every 15-30 seconds and stores the data in its time-series database. We then query it with PromQL.

# Some basic PromQL examples

# Total HTTP requests in the last 5 minutes
rate(http_requests_total[5m])

# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Current memory usage
process_resident_memory_bytes

Grafana

Grafana is a visualization tool that connects to Prometheus (and many other data sources) and lets us build beautiful dashboards. We write PromQL queries, and Grafana draws the graphs.

Most teams have dashboards for: system resources (CPU, memory, disk), application metrics (request rate, error rate, latency), and business metrics (signups, orders).

The USE Method (for Infrastructure)

When debugging infrastructure problems, check these three things for every resource (CPU, memory, disk, network):

Utilization — how busy is it? (CPU at 90%)
Saturation — is work queuing up? (disk I/O queue length)
Errors — are there failures? (disk read errors)

The RED Method (for Services)

For application-level monitoring, track:

Rate — requests per second
Errors — failed requests per second
Duration — how long requests take (latency)

If we nail USE for infra and RED for services, we’ll catch most problems.

Alerting Best Practices

Setting up alerts is easy. Setting up good alerts is hard. The biggest mistake is alert fatigue — too many alerts that don’t need human action, so people start ignoring them.

Alert on symptoms, not causes. Alert on “API error rate > 5%” not “CPU > 80%”. High CPU might be fine during a batch job.
Every alert should be actionable. If there’s nothing someone can do about it, it’s not an alert — it’s a dashboard metric.
Set severity levels. Not everything needs to page someone at 3 AM. Use critical (page), warning (Slack), and info (dashboard only).
Include runbook links. Every alert should link to a doc explaining what to check and how to fix it.

In simple language, monitoring is our early warning system. Prometheus collects the data, Grafana shows us the pretty graphs, and Alertmanager wakes us up when something’s actually broken. USE for infra, RED for services — that covers 90% of what we need.

Metric Types

Prometheus

Grafana

The USE Method (for Infrastructure)

The RED Method (for Services)

Alerting Best Practices

References