Logging & Monitoring Basics - DevOps Basics

It’s 2 AM. Our app is down. Users are tweeting about it. We have no idea what happened. No logs, no metrics, no alerts. We’re flying blind.

This is what happens when we skip logging and monitoring. They’re not glamorous, but they’re the difference between “we detected the issue and fixed it in 5 minutes” and “we found out from a user’s angry email 3 hours later.”

Logging — What Happened?

Logs are the record of what our application is doing. Every request, every error, every decision the app makes can be captured in a log. When something goes wrong, logs are our detective.

Log Levels

Not all logs are equal. We use levels to categorize severity:

debug — Verbose details for development. “User object loaded with 15 fields.” Turn off in production.
info — Normal operations. “Server started on port 3000.” “User logged in.”
warn — Something unexpected but not broken. “API response took 5s.” “Disk usage at 80%.”
error — Something failed. “Database connection refused.” “Payment API returned 500.”
fatal — The app is going down. “Out of memory.” “Unhandled exception — shutting down.”

// Node.js with Pino (fast, structured logger)
const pino = require("pino");
const logger = pino({ level: "info" }); // only info and above in prod

logger.debug("Fetching user data");        // won't show in production
logger.info("Server started on port 3000");
logger.warn({ responseTime: 5200 }, "Slow API response");
logger.error({ err, userId: 123 }, "Failed to process payment");
logger.fatal("Unhandled exception — shutting down");

Rule of thumb: In production, set level to info. In development, set it to debug. Never log at debug level in production — it generates too much noise and eats disk space.

Structured Logging (JSON Logs)

Plain text logs like "User 123 logged in at 2024-03-15" are easy to read but terrible to search. When we have millions of logs, we need structure.

// Bad — plain text (hard to parse, hard to search)
console.log("User 123 logged in from 192.168.1.1");

// Good — structured JSON (easy to filter, search, aggregate)
logger.info({
  event: "user_login",
  userId: 123,
  ip: "192.168.1.1",
  method: "oauth"
});
// Output: {"level":"info","event":"user_login","userId":123,"ip":"192.168.1.1","method":"oauth","time":1710489600}

With JSON logs, we can query things like “show me all errors where userId is 123” or “find all logins from this IP in the last hour.” Tools like Elasticsearch or CloudWatch can index and search these instantly.

Centralized Logging

When our app runs on multiple servers or containers, logs are scattered everywhere. Centralized logging collects them in one place.

ELK Stack — Elasticsearch (stores/indexes logs) + Logstash (processes logs) + Kibana (visual dashboard). Self-hosted, powerful, but heavy to run.
CloudWatch — AWS’s built-in logging. Zero setup for AWS services.
Datadog — SaaS logging + monitoring + alerting. Easy to set up, expensive at scale.
Grafana + Loki — Lightweight alternative to ELK. Loki stores logs, Grafana visualizes.

The pattern is always the same: app writes logs → a collector ships them → a store indexes them → a dashboard lets us search and visualize.

Monitoring — Is It Healthy?

Logging tells us what happened. Monitoring tells us what’s happening right now. It continuously tracks the health and performance of our system.

Key Metrics to Watch

Response time — How fast our API responds. Track p50 (median), p95 (95th percentile), and p99 (slowest 1%). If p99 is 10 seconds, 1% of users are having a terrible experience.
Error rate — What percentage of requests are failing (5xx responses). Normal is near 0%. Above 1% is a red flag.
Uptime — Is the service reachable? 99.9% uptime = 8.7 hours of downtime per year.
CPU / Memory — Resource utilization. Spiking CPU might mean a runaway process or traffic surge.
Request rate — Requests per second (RPS). A sudden spike could mean a DDoS attack or a viral feature.

Health Check Endpoints

Every production app should expose a health check endpoint. Load balancers and monitoring tools hit this endpoint to know if the app is alive.

// Basic health check — is the server running?
app.get("/health", (req, res) => {
  res.status(200).json({ status: "ok" });
});

// Deep health check — are dependencies healthy too?
app.get("/ready", async (req, res) => {
  try {
    await db.query("SELECT 1");               // database alive?
    await redis.ping();                        // cache alive?
    res.status(200).json({
      status: "ok",
      db: "connected",
      redis: "connected",
      uptime: process.uptime()               // seconds since start
    });
  } catch (err) {
    res.status(503).json({                    // 503 = Service Unavailable
      status: "degraded",
      error: err.message
    });
  }
});

The /health endpoint is simple — is the process running? The /ready endpoint is deeper — can the app actually serve requests? Load balancers use /ready to decide whether to route traffic to this instance.

Monitoring vs Logging vs Alerting

Logging

What happened?

"User 123 got a 500 error at 2:03 AM"

Monitoring

Is it healthy right now?

"Error rate is at 5%, p99 latency is 8s"

Alerting

Tell me when it breaks!

"Slack: error rate > 2% for 5 minutes"

They work together. Monitoring detects the problem (“error rate spiked”). Alerting notifies us (“Slack message at 2 AM”). Logging helps us diagnose it (“here’s the stack trace from 2:03 AM”).

Rate Limiting

Rate limiting protects our API from abuse — whether it’s a misbehaving client, a bot, or a DDoS attack. It restricts how many requests a client can make in a given time window.

How It Works

Two common algorithms:

Token Bucket — Imagine a bucket that holds 100 tokens. Each request takes one token. The bucket refills at a steady rate (say, 10 tokens per second). If the bucket is empty, the request is rejected. This allows short bursts while enforcing an average rate.

Sliding Window — Count requests in a moving time window. “Max 100 requests per minute.” If a client has made 100 requests in the last 60 seconds, the next one is rejected.

When rate limited, the server responds with:

HTTP/1.1 429 Too Many Requests
Retry-After: 30          # try again in 30 seconds

// Simple rate limiting with express-rate-limit
const rateLimit = require("express-rate-limit");

const limiter = rateLimit({
  windowMs: 15 * 60 * 1000,     // 15-minute window
  max: 100,                      // max 100 requests per window per IP
  message: {
    error: "Too many requests, please try again later",
    retryAfter: 900              // seconds until window resets
  }
});

app.use("/api/", limiter);       // apply to all /api/ routes

Rate limiting is usually applied per IP address or per API key. Public APIs almost always have rate limits — GitHub’s API allows 5000 requests per hour with auth, 60 without.

Putting It All Together

A production-ready app needs all three: logging to record events, monitoring to track health, and alerting to wake us up when something breaks. Start simple — structured JSON logs, a /health endpoint, and basic rate limiting. We can add more sophisticated tools as the app grows.

In simple language, logging records what happened, monitoring watches what’s happening right now, and alerting tells us when something goes wrong — together they make sure we find problems before our users do.