When our app was a single server, we’d SSH in and tail -f /var/log/app.log. Easy. But now we’ve got 20 containers across 5 machines, some scaling up and down automatically. Where do the logs go? How do we find the one error message that explains why a user’s request failed?
We need centralized logging — all logs from all services flowing into one searchable place.
Structured Logging
The first step is making our logs machine-readable. Unstructured logs look like this:
[2024-03-15 14:23:01] ERROR - Failed to process order #4521 for user john@example.com
Good for humans, terrible for searching and filtering. Structured logs use JSON with consistent fields:
{
"timestamp": "2024-03-15T14:23:01Z",
"level": "error",
"message": "Failed to process order",
"service": "order-service",
"order_id": 4521,
"user_email": "john@example.com",
"error": "payment_declined",
"correlation_id": "abc-123-def"
}
Now we can filter by level: error, search by order_id, or group by service. Every log entry follows the same shape, which makes aggregation tools much more useful.
Log Levels
Use them consistently across all services:
- DEBUG — verbose detail for development. Never in production unless we’re actively debugging.
- INFO — normal operations. “Server started”, “Order created”, “User logged in.”
- WARN — something unexpected but the system handled it. “Retry succeeded after 2 attempts.”
- ERROR — something failed and needs attention. “Database connection lost”, “Payment API returned 500.”
- FATAL — the process is crashing. “Out of memory”, “Cannot bind to port.”
A good rule: production should run at INFO level, and we should be able to flip to DEBUG without restarting (via config or env var).
The ELK Stack
ELK is the most popular log aggregation setup. It stands for:
- Elasticsearch — a search engine that stores and indexes logs. We can query logs by any field in milliseconds.
- Logstash — collects logs from various sources, parses them, transforms fields, and ships them to Elasticsearch.
- Kibana — the web UI. We search logs, build dashboards, and set up alerts.
A popular alternative is EFK — replacing Logstash with Fluentd. Fluentd is lighter, more cloud-native, and widely used in Kubernetes environments. Same idea, different collector.
Correlation IDs
In a microservices world, a single user request might hit 5 different services. When something fails, how do we trace the request through all of them?
We generate a unique correlation ID (also called trace ID) at the entry point and pass it through every service call. Every log line includes this ID.
{"correlation_id": "abc-123", "service": "api-gateway", "message": "Received order request"}
{"correlation_id": "abc-123", "service": "order-service", "message": "Creating order"}
{"correlation_id": "abc-123", "service": "payment-service", "message": "Charging card"}
{"correlation_id": "abc-123", "service": "payment-service", "level": "error", "message": "Card declined"}
Now we search for correlation_id: abc-123 in Kibana and see the entire journey. This is the foundation of distributed tracing — tools like Jaeger and Zipkin take this further with visual timeline views.
Log Retention and Rotation
Logs grow fast. We need a plan:
- Rotation — tools like
logrotateon Linux automatically compress and archive old log files. Docker also supports log rotation via its logging drivers. - Retention — keep hot logs (last 7-30 days) in Elasticsearch for fast searching. Move older logs to cold storage (S3, Glacier) for compliance. Delete after the retention period.
- Index lifecycle management — Elasticsearch has built-in ILM policies that automatically move, shrink, and delete indices based on age.
# Docker log rotation — set in daemon.json or per-container
docker run -d \
--log-opt max-size=50m \
--log-opt max-file=3 \
my-app:latest
In simple language, centralized logging is about getting all our logs into one searchable place. Structure them as JSON, slap a correlation ID on every request, pipe everything through ELK (or EFK), and we’ll be able to debug anything across any number of services.