Bulk API & Reindexing

intermediate elasticsearch bulk reindex aliases

Indexing one doc at a time is painfully slow — every request pays network and refresh overhead. The Bulk API lets us batch up writes for serious throughput. Reindexing (via aliases) lets us change mappings on a live index without downtime.

Bulk API — batching writes

The bulk endpoint takes a stream of action/document pairs separated by newlines. Each pair is a single op (index/create/update/delete).

POST /_bulk
{ "index": { "_index": "products", "_id": "1" } }
{ "name": "Macbook Pro", "price": 1999 }
{ "index": { "_index": "products", "_id": "2" } }
{ "name": "iPad", "price": 599 }
{ "delete": { "_index": "products", "_id": "old-99" } }

Note: each line ends with a newline, including the last one. This is NDJSON, not regular JSON. Lots of bugs come from sending a single big array instead.

How big should a batch be?

Rule of thumb: 5–15 MB per request, or roughly 1000–5000 docs depending on doc size. Bigger isn’t always better — over ~100 MB and we risk OOM on the coordinating node.

Practical tuning:

  • Start at 1000 docs/batch
  • Run, watch indexing rate
  • Double the batch size until throughput stops improving or you see rejections
  • Pick the largest batch size before plateau

Parallel bulk indexers

A single thread can’t saturate the cluster. Use 4–16 parallel bulk threads (test on your hardware). The Elasticsearch Python client and Java client both have helpers (parallel_bulk, BulkProcessor) for this.

Handling partial failures

A bulk request can succeed overall but have individual failures. Always check response.errors:

{
  "took": 30,
  "errors": true,
  "items": [
    { "index": { "_id": "1", "status": 201 } },
    { "index": { "_id": "2", "status": 400, "error": { ... } } }
  ]
}

Retry failed items with exponential backoff, especially on 429 Too Many Requests (the rejected-execution exception).

Reindexing — when mappings need to change

Most field mappings in Elasticsearch are immutable. Want to change text to keyword? Want a different analyzer? Want fewer primary shards? You must reindex into a new index.

The Reindex API

POST /_reindex
{
  "source": { "index": "products_v1" },
  "dest":   { "index": "products_v2" }
}

Optionally transform docs inline:

POST /_reindex
{
  "source": { "index": "products_v1" },
  "dest":   { "index": "products_v2" },
  "script": {
    "source": "ctx._source.price_cents = (int)(ctx._source.price * 100); ctx._source.remove('price')"
  }
}

Run it async with ?wait_for_completion=false for big reindexes — it returns a task ID you can monitor with GET /_tasks/{task_id}.

Zero-downtime reindex via aliases

This is the killer pattern. We never expose raw index names to the app — we expose an alias.

Zero-downtime reindex flow
1. App reads/writes via alias products → pointing to products_v1.
2. Create products_v2 with new mapping/settings.
3. Reindex v1 → v2 (live). New writes still hit v1.
4. Dual-write or replay the delta since reindex started.
5. Atomically swap alias: remove from v1, add to v2.
6. Verify, then drop products_v1.

The atomic alias swap:

POST /_aliases
{
  "actions": [
    { "remove": { "index": "products_v1", "alias": "products" } },
    { "add":    { "index": "products_v2", "alias": "products" } }
  ]
}

Both actions happen in one cluster state update — there’s no moment where the alias points to nothing.

Tips for fast reindexing

  • Set number_of_replicas: 0 on the destination during the copy. Add replicas after.
  • Set refresh_interval: -1 on destination during copy.
  • Use slices: "auto" in the reindex request to parallelize across source shards.
  • Reindex from a snapshot (_reindex source can be a remote cluster) if you’re upgrading major versions.

The interview-quality answer: “always front your indices with an alias from day one, so you can reindex without touching app code”.