Common Pitfalls

advanced elasticsearch pitfalls production

A loaded interview question: “What’s gone wrong on an Elasticsearch cluster you’ve worked with?” Here are the most common production landmines, what causes them, and how to dodge each.

Pitfall → Consequence → Fix
Pitfall Consequence Fix
Mapping explosion OOM, slow cluster state Disable dynamic mapping; use flattened
Deep pagination 10k window, OOM coordinator Use search_after + PIT
Hot shards Uneven CPU, slow tail latency Better routing keys, more shards
Oversized docs Slow indexing, GC churn Strip blobs, split into child docs
Refresh misuse Throughput collapse Raise refresh_interval, never refresh=true

1. Mapping explosion

Dynamic mapping auto-creates fields when it sees new JSON keys. Index 100k docs where each has a unique key (think event_id_abc123: { ... }) and you’ve got 100k fields. Each field uses memory in the cluster state, which is replicated to every node on every change.

// Bad: a free-form JSON blob with dynamic mapping
{ "properties": { "user_attributes": { "type": "object" } } }

Fixes:

  • Disable dynamic mapping with "dynamic": "strict" so unexpected fields throw an error instead of being added.
  • For genuinely open-ended data, use the flattened field type — it treats the whole object as a single field.
{ "properties": { "user_attributes": { "type": "flattened" } } }

Hard limit: index.mapping.total_fields.limit defaults to 1000. Hitting it means you’ve already lost.

2. Deep pagination

Covered in detail in the pagination note. Short version: from + size > 10000 is forbidden by default, and even before that, deep pagination ships massive amounts of data across the network. Use search_after with a PIT. Don’t raise max_result_window as a “fix”.

3. Hot shards

Custom routing or natural data skew (one tenant = 80% of traffic) leads to one shard being a CPU bottleneck while siblings idle. Symptoms: high p99 latency, uneven hot_threads output across nodes.

Diagnose:

GET /_cat/shards?v&s=store:desc
GET /_nodes/hot_threads

Fixes:

  • Compound the routing key (tenant_id + "_" + shard_bucket)
  • Use routing_partition_size to spread one tenant across multiple shards
  • For time-series, switch to data streams with rollover so writes always go to the newest index

4. Oversized documents

Stuffing a 5 MB PDF as base64 into a doc field is asking for pain. The doc gets parsed on every refresh, fielddata blows up, network transfer is slow.

Rules of thumb:

  • Aim for < 100 KB per doc for search workloads
  • Strip binaries before indexing — put them in S3 and store the URL
  • For genuinely nested arrays (think comments: [...] with thousands of entries), consider parent-child or splitting into separate docs

The http.max_content_length setting caps request size at 100 MB, but you should never approach that.

5. Refresh interval misuse

POST /_doc?refresh=true forces a refresh on every single write. We see this in tests and it leaks into prod code. Each refresh creates a new tiny Lucene segment, kicks off merges, and tanks throughput.

Symptoms: writes work fine in dev (low volume), crawl in prod (high volume).

# Wrong (production)
POST /events/_doc?refresh=true
{ ... }

# Right
POST /events/_doc
{ ... }
# Trust the 1-second default, or use refresh=wait_for if you must

For bulk-load jobs, go further:

PUT /events/_settings
{ "index.refresh_interval": "-1", "index.number_of_replicas": 0 }

Then restore after the load.

6. Bonus: replica = 0 in production

A single-replica setup means losing one node = losing data. We’ve seen teams disable replicas “for performance” and forget to re-enable them. Always run with number_of_replicas >= 1 in production. Use replicas for HA, not just read throughput.

7. Bonus: searching across hundreds of indices

GET /logs-*/_search looks innocent but can fan out to thousands of shards on a cluster with daily indices. Each shard adds coordinator overhead. Mitigations:

  • Use _search with a tight date range and index name patterns that prune time
  • Consider rollover + ILM (Index Lifecycle Management) to consolidate old data
  • Use pre_filter_shard_size so the coordinator skips shards that can’t match

The meta-lesson: most Elasticsearch pitfalls aren’t bugs in Elasticsearch — they’re defaults that work great at small scale and bite at large scale. Know which knobs change with traffic, and turn them before the page.