Two ops-y topics that come up in basically every Elasticsearch interview: “what does cluster status mean?” and “how do you back up an Elasticsearch cluster?”. Both are simpler than they sound.
Cluster health: green, yellow, red
GET /_cluster/health returns a status. In simple language:
All primaries and all replicas are assigned. Cluster is happy.
All primaries assigned, but some replicas are not. Reads/writes still work; HA is degraded.
At least one primary shard is unassigned. That data is unavailable. Action required.
Common causes by color
- Yellow on a single-node dev cluster is normal — replicas can’t be assigned to the same node as the primary, so they stay unassigned.
- Yellow on a multi-node cluster after a node leaves — replicas need to be re-created on remaining nodes. Wait, or force-allocate.
- Red after a disk fills up — primaries get unassigned. Free disk or expand storage.
- Red after a corrupted shard — restore from snapshot.
Drilling deeper
GET /_cluster/health?level=indices
GET /_cluster/allocation/explain
allocation/explain is the single best command for “why is this shard not where I want it?”. It tells you exactly which allocation decider blocked the assignment (disk watermark, awareness rules, filtering, max_shards_per_node, etc.).
Disk watermarks (a frequent red-cluster culprit)
- Low watermark (default 85%) — Elasticsearch stops allocating new shards to that node.
- High watermark (default 90%) — Elasticsearch tries to move shards off the node.
- Flood stage (default 95%) — indices on that node become read-only. This often turns a yellow cluster red.
When this happens, free disk and then lift the read-only block:
PUT /_all/_settings
{ "index.blocks.read_only_allow_delete": null }
Snapshots — the only real backup mechanism
You can’t just tar an Elasticsearch data directory while it’s running. The proper way to back up is snapshots to a registered repository.
Step 1: register a repository
A repository is a storage backend — S3, GCS, Azure Blob, or a shared filesystem.
PUT /_snapshot/my_s3_repo
{
"type": "s3",
"settings": {
"bucket": "es-backups-prod",
"region": "us-east-1",
"base_path": "cluster-a"
}
}
The S3 plugin needs to be installed on every node (it ships built-in since 8.x).
Step 2: take a snapshot
PUT /_snapshot/my_s3_repo/snap_2026_05_26?wait_for_completion=false
By default this snapshots all indices. Limit it:
PUT /_snapshot/my_s3_repo/snap_products_only
{
"indices": "products,orders-*",
"ignore_unavailable": true,
"include_global_state": false
}
Snapshots are incremental at the segment level. If a Lucene segment hasn’t changed since the last snapshot, it’s not re-uploaded. The first snapshot is expensive; subsequent ones are cheap.
Step 3: restore
POST /_snapshot/my_s3_repo/snap_2026_05_26/_restore
{
"indices": "products",
"rename_pattern": "products",
"rename_replacement": "products_restored"
}
The rename lets us restore alongside a live index without conflict — handy for “compare prod vs yesterday”.
Snapshot Lifecycle Management (SLM)
Don’t write cron jobs to call the snapshot API. Use SLM, which is built in:
PUT /_slm/policy/daily-snapshots
{
"schedule": "0 30 1 * * ?",
"name": "<daily-snap-{now/d}>",
"repository": "my_s3_repo",
"config": { "indices": ["*"] },
"retention": {
"expire_after": "30d",
"min_count": 5,
"max_count": 50
}
}
SLM handles scheduling, retention, and cleanup. Set it up once and forget.
TL;DR
- Green = healthy, Yellow = HA degraded, Red = data unavailable.
cluster/allocation/explainis your best debugging tool.- Disk watermarks at 85/90/95% cause most prod incidents.
- Snapshots → S3 (or similar) repository → SLM for scheduling. Incremental, cheap, restorable.