Cluster Health & Snapshots

intermediate elasticsearch cluster-health snapshots backup

Two ops-y topics that come up in basically every Elasticsearch interview: “what does cluster status mean?” and “how do you back up an Elasticsearch cluster?”. Both are simpler than they sound.

Cluster health: green, yellow, red

GET /_cluster/health returns a status. In simple language:

GREEN
All primaries and all replicas are assigned. Cluster is happy.
YELLOW
All primaries assigned, but some replicas are not. Reads/writes still work; HA is degraded.
RED
At least one primary shard is unassigned. That data is unavailable. Action required.

Common causes by color

  • Yellow on a single-node dev cluster is normal — replicas can’t be assigned to the same node as the primary, so they stay unassigned.
  • Yellow on a multi-node cluster after a node leaves — replicas need to be re-created on remaining nodes. Wait, or force-allocate.
  • Red after a disk fills up — primaries get unassigned. Free disk or expand storage.
  • Red after a corrupted shard — restore from snapshot.

Drilling deeper

GET /_cluster/health?level=indices
GET /_cluster/allocation/explain

allocation/explain is the single best command for “why is this shard not where I want it?”. It tells you exactly which allocation decider blocked the assignment (disk watermark, awareness rules, filtering, max_shards_per_node, etc.).

Disk watermarks (a frequent red-cluster culprit)

  • Low watermark (default 85%) — Elasticsearch stops allocating new shards to that node.
  • High watermark (default 90%) — Elasticsearch tries to move shards off the node.
  • Flood stage (default 95%) — indices on that node become read-only. This often turns a yellow cluster red.

When this happens, free disk and then lift the read-only block:

PUT /_all/_settings
{ "index.blocks.read_only_allow_delete": null }

Snapshots — the only real backup mechanism

You can’t just tar an Elasticsearch data directory while it’s running. The proper way to back up is snapshots to a registered repository.

Step 1: register a repository

A repository is a storage backend — S3, GCS, Azure Blob, or a shared filesystem.

PUT /_snapshot/my_s3_repo
{
  "type": "s3",
  "settings": {
    "bucket": "es-backups-prod",
    "region": "us-east-1",
    "base_path": "cluster-a"
  }
}

The S3 plugin needs to be installed on every node (it ships built-in since 8.x).

Step 2: take a snapshot

PUT /_snapshot/my_s3_repo/snap_2026_05_26?wait_for_completion=false

By default this snapshots all indices. Limit it:

PUT /_snapshot/my_s3_repo/snap_products_only
{
  "indices": "products,orders-*",
  "ignore_unavailable": true,
  "include_global_state": false
}

Snapshots are incremental at the segment level. If a Lucene segment hasn’t changed since the last snapshot, it’s not re-uploaded. The first snapshot is expensive; subsequent ones are cheap.

Step 3: restore

POST /_snapshot/my_s3_repo/snap_2026_05_26/_restore
{
  "indices": "products",
  "rename_pattern": "products",
  "rename_replacement": "products_restored"
}

The rename lets us restore alongside a live index without conflict — handy for “compare prod vs yesterday”.

Snapshot Lifecycle Management (SLM)

Don’t write cron jobs to call the snapshot API. Use SLM, which is built in:

PUT /_slm/policy/daily-snapshots
{
  "schedule": "0 30 1 * * ?",
  "name": "<daily-snap-{now/d}>",
  "repository": "my_s3_repo",
  "config": { "indices": ["*"] },
  "retention": {
    "expire_after": "30d",
    "min_count": 5,
    "max_count": 50
  }
}

SLM handles scheduling, retention, and cleanup. Set it up once and forget.

TL;DR

  • Green = healthy, Yellow = HA degraded, Red = data unavailable.
  • cluster/allocation/explain is your best debugging tool.
  • Disk watermarks at 85/90/95% cause most prod incidents.
  • Snapshots → S3 (or similar) repository → SLM for scheduling. Incremental, cheap, restorable.