Elasticsearch
Search, indexing, query DSL, aggregations, and scaling concepts for Elasticsearch interviews.
Fundamentals
What is Elasticsearch & When to use it
Elasticsearch is a distributed search and analytics engine built for speed on text and aggregations.
Inverted Index
The core data structure that makes Elasticsearch fast — a map from each term to the documents containing it.
Cluster, Node, Index, Document
The four nouns you need to talk about Elasticsearch — from a single JSON doc up to a whole cluster.
Shards & Replicas
How Elasticsearch splits an index across machines and keeps copies for fault tolerance.
Document Structure
What's inside a returned Elasticsearch document — _id, _source, _index, and the metadata fields you'll see in every response.
Indexing & Mapping
Index Creation & Settings
How to create an index with the right shards, replicas, analyzers, and mappings from day one.
Mapping: Dynamic vs Explicit
Letting ES guess your schema vs declaring it upfront — the difference between a prototype and a production index.
Field Data Types
text vs keyword (the most common ES interview question), plus numeric, date, object, nested, ip, and friends.
Analyzers, Tokenizers & Token Filters
How raw text becomes searchable tokens — character filters, tokenizer, token filters.
Index Templates & Aliases
Two production patterns: templates for auto-applying settings to new indices, aliases for zero-downtime reindexing and log rotation.
Query DSL
Match vs Term Query
The classic ES interview question — when does Elasticsearch analyze your search input, and when does it look for an exact byte-for-byte match?
Bool Query
The Swiss army knife of Elasticsearch — combining must, should, must_not, and filter clauses. And why filter is way faster than must.
Range, Exists, Wildcard, Prefix & Regex Queries
The utility belt of Query DSL — querying numeric/date ranges, checking field existence, and doing pattern matching on keyword fields.
Fuzzy & Multi-match Queries
Handling typos with Levenshtein distance, and searching across multiple fields in a single query.
Compound Queries & Function Score
When default BM25 isn't enough — boosting, decay functions, and writing custom scoring logic on top of search results.
Full-text vs Term-level Queries — When to use which
The mental model for choosing between analyzed full-text queries and exact term-level queries. One of the most common mistakes in ES.
Aggregations
Bucket Aggregations
Grouping documents into buckets — like SQL's GROUP BY but more flexible. terms, date_histogram, range, and filters aggregations.
Metric Aggregations
Computing numbers across docs — avg, sum, min, max, stats, percentiles, cardinality. The SUM and COUNT of Elasticsearch.
Pipeline & Nested Aggregations
Aggregations that operate on other aggregations — moving averages, derivatives, bucket selectors. Plus aggs on nested fields.
Sub-aggregations
The killer feature of ES aggregations — nesting metrics inside buckets, and buckets inside buckets. The standard analytics pattern.
Search Features
Relevance & Scoring (TF-IDF, BM25)
How Elasticsearch computes _score and why BM25 replaced TF-IDF as the default.
Pagination: from/size vs scroll vs search_after
Why deep pagination kills clusters and how to do it right.
Highlighting & Suggesters
Bolding matched terms in results and powering autocomplete/typeahead.
Performance, Scaling & Ops
Sharding Strategy & Routing
How docs map to shards, why num_primary_shards is forever, and how to use custom routing.
Refresh, Flush & Near-Real-Time Search
Why ES is near-real-time, not real-time — the journey from in-memory buffer to durable disk.
Bulk API & Reindexing
High-throughput indexing and zero-downtime reindexing using aliases.
Cluster Health & Snapshots
Green/yellow/red cluster states and backup/restore with snapshot repositories.
Common Pitfalls
Mapping explosion, deep pagination, hot shards, oversized docs, refresh misuse — and how to avoid them.