A loaded interview question: “What’s gone wrong on an Elasticsearch cluster you’ve worked with?” Here are the most common production landmines, what causes them, and how to dodge each.
| Pitfall | Consequence | Fix |
|---|---|---|
| Mapping explosion | OOM, slow cluster state | Disable dynamic mapping; use flattened |
| Deep pagination | 10k window, OOM coordinator | Use search_after + PIT |
| Hot shards | Uneven CPU, slow tail latency | Better routing keys, more shards |
| Oversized docs | Slow indexing, GC churn | Strip blobs, split into child docs |
| Refresh misuse | Throughput collapse | Raise refresh_interval, never refresh=true |
1. Mapping explosion
Dynamic mapping auto-creates fields when it sees new JSON keys. Index 100k docs where each has a unique key (think event_id_abc123: { ... }) and you’ve got 100k fields. Each field uses memory in the cluster state, which is replicated to every node on every change.
// Bad: a free-form JSON blob with dynamic mapping
{ "properties": { "user_attributes": { "type": "object" } } }
Fixes:
- Disable dynamic mapping with
"dynamic": "strict"so unexpected fields throw an error instead of being added. - For genuinely open-ended data, use the
flattenedfield type — it treats the whole object as a single field.
{ "properties": { "user_attributes": { "type": "flattened" } } }
Hard limit: index.mapping.total_fields.limit defaults to 1000. Hitting it means you’ve already lost.
2. Deep pagination
Covered in detail in the pagination note. Short version: from + size > 10000 is forbidden by default, and even before that, deep pagination ships massive amounts of data across the network. Use search_after with a PIT. Don’t raise max_result_window as a “fix”.
3. Hot shards
Custom routing or natural data skew (one tenant = 80% of traffic) leads to one shard being a CPU bottleneck while siblings idle. Symptoms: high p99 latency, uneven hot_threads output across nodes.
Diagnose:
GET /_cat/shards?v&s=store:desc
GET /_nodes/hot_threads
Fixes:
- Compound the routing key (
tenant_id + "_" + shard_bucket) - Use
routing_partition_sizeto spread one tenant across multiple shards - For time-series, switch to data streams with rollover so writes always go to the newest index
4. Oversized documents
Stuffing a 5 MB PDF as base64 into a doc field is asking for pain. The doc gets parsed on every refresh, fielddata blows up, network transfer is slow.
Rules of thumb:
- Aim for < 100 KB per doc for search workloads
- Strip binaries before indexing — put them in S3 and store the URL
- For genuinely nested arrays (think
comments: [...]with thousands of entries), consider parent-child or splitting into separate docs
The http.max_content_length setting caps request size at 100 MB, but you should never approach that.
5. Refresh interval misuse
POST /_doc?refresh=true forces a refresh on every single write. We see this in tests and it leaks into prod code. Each refresh creates a new tiny Lucene segment, kicks off merges, and tanks throughput.
Symptoms: writes work fine in dev (low volume), crawl in prod (high volume).
# Wrong (production)
POST /events/_doc?refresh=true
{ ... }
# Right
POST /events/_doc
{ ... }
# Trust the 1-second default, or use refresh=wait_for if you must
For bulk-load jobs, go further:
PUT /events/_settings
{ "index.refresh_interval": "-1", "index.number_of_replicas": 0 }
Then restore after the load.
6. Bonus: replica = 0 in production
A single-replica setup means losing one node = losing data. We’ve seen teams disable replicas “for performance” and forget to re-enable them. Always run with number_of_replicas >= 1 in production. Use replicas for HA, not just read throughput.
7. Bonus: searching across hundreds of indices
GET /logs-*/_search looks innocent but can fan out to thousands of shards on a cluster with daily indices. Each shard adds coordinator overhead. Mitigations:
- Use
_searchwith a tight date range and index name patterns that prune time - Consider rollover + ILM (Index Lifecycle Management) to consolidate old data
- Use
pre_filter_shard_sizeso the coordinator skips shards that can’t match
The meta-lesson: most Elasticsearch pitfalls aren’t bugs in Elasticsearch — they’re defaults that work great at small scale and bite at large scale. Know which knobs change with traffic, and turn them before the page.