If there’s one thing to memorize about Elasticsearch, it’s this: it uses an inverted index. In simple language, an inverted index is a map from each word to the list of documents that contain it. That’s it. That’s the secret sauce.
A normal index (like in Postgres) goes “doc 1 has these words.” An inverted index flips it around: “this word lives in docs 1, 7, and 42.” Hence “inverted.”
Why does this matter?
When we search for “wireless headphones”, ES doesn’t scan every document. It looks up “wireless” in the inverted index — gets a list of doc IDs in microseconds. Same for “headphones”. Then it intersects/unions the lists, scores them, and returns the top hits.
Think of it like the index at the back of a textbook. You don’t read the whole book to find “mitochondria” — you flip to the index, see “page 247”, and jump straight there.
Let’s build one by hand
Say we index three documents:
Doc 1: { "title": "Sony wireless headphones with noise cancellation" }
Doc 2: { "title": "Bose noise cancelling headphones" }
Doc 3: { "title": "Apple wireless earbuds" }
ES first runs each title through an analyzer — lowercases, splits on spaces, drops stopwords. Then it builds this:
| Term | Doc IDs | Freq |
|---|---|---|
| apple | [3] | 1 |
| bose | [2] | 1 |
| cancellation | [1] | 1 |
| cancelling | [2] | 1 |
| earbuds | [3] | 1 |
| headphones | [1, 2] | 2 |
| noise | [1, 2] | 2 |
| sony | [1] | 1 |
| wireless | [1, 3] | 2 |
Now when we search wireless headphones:
- Look up
wireless→[1, 3] - Look up
headphones→[1, 2] - Union (OR) →
[1, 2, 3]. Intersection (AND) →[1]. - Score each hit using TF-IDF / BM25 (how rare is the term, how often does it appear in the doc).
Doc 1 matches both terms — it ranks first.
What gets stored besides doc IDs
For each term, ES also stores:
- Term frequency (TF) — how many times the term appears in that doc
- Document frequency (DF) — how many docs contain the term overall
- Positions — where the term appears (needed for phrase queries like
"noise cancelling") - Offsets — character offsets (for highlighting)
These let ES compute a relevance score, not just a yes/no match.
Why “keyword” fields skip this
Important nuance — only text fields get analyzed and tokenized into an inverted index of words. keyword fields are stored as a single term (the whole string). That’s why you can search “wireless headphones” inside a text field but not inside a keyword field. We’ll dig into that in the field types note.
Trade-off
Building the inverted index takes work at write time. That’s why ES isn’t great for high-velocity OLTP writes — you’re paying analysis cost on every document. But reads? Blazing fast.