Common Pitfalls - Elasticsearch

A loaded interview question: “What’s gone wrong on an Elasticsearch cluster you’ve worked with?” Here are the most common production landmines, what causes them, and how to dodge each.

Pitfall → Consequence → Fix

Pitfall	Consequence	Fix
Mapping explosion	OOM, slow cluster state	Disable dynamic mapping; use `flattened`
Deep pagination	10k window, OOM coordinator	Use `search_after` + PIT
Hot shards	Uneven CPU, slow tail latency	Better routing keys, more shards
Oversized docs	Slow indexing, GC churn	Strip blobs, split into child docs
Refresh misuse	Throughput collapse	Raise `refresh_interval`, never `refresh=true`

1. Mapping explosion

Dynamic mapping auto-creates fields when it sees new JSON keys. Index 100k docs where each has a unique key (think event_id_abc123: { ... }) and you’ve got 100k fields. Each field uses memory in the cluster state, which is replicated to every node on every change.

// Bad: a free-form JSON blob with dynamic mapping
{ "properties": { "user_attributes": { "type": "object" } } }

Fixes:

Disable dynamic mapping with "dynamic": "strict" so unexpected fields throw an error instead of being added.
For genuinely open-ended data, use the flattened field type — it treats the whole object as a single field.

{ "properties": { "user_attributes": { "type": "flattened" } } }

Hard limit: index.mapping.total_fields.limit defaults to 1000. Hitting it means you’ve already lost.

2. Deep pagination

Covered in detail in the pagination note. Short version: from + size > 10000 is forbidden by default, and even before that, deep pagination ships massive amounts of data across the network. Use search_after with a PIT. Don’t raise max_result_window as a “fix”.

3. Hot shards

Custom routing or natural data skew (one tenant = 80% of traffic) leads to one shard being a CPU bottleneck while siblings idle. Symptoms: high p99 latency, uneven hot_threads output across nodes.

Diagnose:

GET /_cat/shards?v&s=store:desc
GET /_nodes/hot_threads

Fixes:

Compound the routing key (tenant_id + "_" + shard_bucket)
Use routing_partition_size to spread one tenant across multiple shards
For time-series, switch to data streams with rollover so writes always go to the newest index

4. Oversized documents

Stuffing a 5 MB PDF as base64 into a doc field is asking for pain. The doc gets parsed on every refresh, fielddata blows up, network transfer is slow.

Rules of thumb:

Aim for < 100 KB per doc for search workloads
Strip binaries before indexing — put them in S3 and store the URL
For genuinely nested arrays (think comments: [...] with thousands of entries), consider parent-child or splitting into separate docs

The http.max_content_length setting caps request size at 100 MB, but you should never approach that.

5. Refresh interval misuse

POST /_doc?refresh=true forces a refresh on every single write. We see this in tests and it leaks into prod code. Each refresh creates a new tiny Lucene segment, kicks off merges, and tanks throughput.

Symptoms: writes work fine in dev (low volume), crawl in prod (high volume).

# Wrong (production)
POST /events/_doc?refresh=true
{ ... }

# Right
POST /events/_doc
{ ... }
# Trust the 1-second default, or use refresh=wait_for if you must

For bulk-load jobs, go further:

PUT /events/_settings
{ "index.refresh_interval": "-1", "index.number_of_replicas": 0 }

Then restore after the load.

6. Bonus: replica = 0 in production

A single-replica setup means losing one node = losing data. We’ve seen teams disable replicas “for performance” and forget to re-enable them. Always run with number_of_replicas >= 1 in production. Use replicas for HA, not just read throughput.

7. Bonus: searching across hundreds of indices

GET /logs-*/_search looks innocent but can fan out to thousands of shards on a cluster with daily indices. Each shard adds coordinator overhead. Mitigations:

Use _search with a tight date range and index name patterns that prune time
Consider rollover + ILM (Index Lifecycle Management) to consolidate old data
Use pre_filter_shard_size so the coordinator skips shards that can’t match

The meta-lesson: most Elasticsearch pitfalls aren’t bugs in Elasticsearch — they’re defaults that work great at small scale and bite at large scale. Know which knobs change with traffic, and turn them before the page.