Inverted Index - Elasticsearch

If there’s one thing to memorize about Elasticsearch, it’s this: it uses an inverted index. In simple language, an inverted index is a map from each word to the list of documents that contain it. That’s it. That’s the secret sauce.

A normal index (like in Postgres) goes “doc 1 has these words.” An inverted index flips it around: “this word lives in docs 1, 7, and 42.” Hence “inverted.”

Why does this matter?

When we search for “wireless headphones”, ES doesn’t scan every document. It looks up “wireless” in the inverted index — gets a list of doc IDs in microseconds. Same for “headphones”. Then it intersects/unions the lists, scores them, and returns the top hits.

Think of it like the index at the back of a textbook. You don’t read the whole book to find “mitochondria” — you flip to the index, see “page 247”, and jump straight there.

Let’s build one by hand

Say we index three documents:

Doc 1: { "title": "Sony wireless headphones with noise cancellation" }
Doc 2: { "title": "Bose noise cancelling headphones" }
Doc 3: { "title": "Apple wireless earbuds" }

ES first runs each title through an analyzer — lowercases, splits on spaces, drops stopwords. Then it builds this:

Inverted Index for the "title" field

Term	Doc IDs	Freq
apple	[3]	1
bose	[2]	1
cancellation	[1]	1
cancelling	[2]	1
earbuds	[3]	1
headphones	[1, 2]	2
noise	[1, 2]	2
sony	[1]	1
wireless	[1, 3]	2

Now when we search wireless headphones:

Look up wireless → [1, 3]
Look up headphones → [1, 2]
Union (OR) → [1, 2, 3]. Intersection (AND) → [1].
Score each hit using TF-IDF / BM25 (how rare is the term, how often does it appear in the doc).

Doc 1 matches both terms — it ranks first.

What gets stored besides doc IDs

For each term, ES also stores:

Term frequency (TF) — how many times the term appears in that doc
Document frequency (DF) — how many docs contain the term overall
Positions — where the term appears (needed for phrase queries like "noise cancelling")
Offsets — character offsets (for highlighting)

These let ES compute a relevance score, not just a yes/no match.

Why “keyword” fields skip this

Important nuance — only text fields get analyzed and tokenized into an inverted index of words. keyword fields are stored as a single term (the whole string). That’s why you can search “wireless headphones” inside a text field but not inside a keyword field. We’ll dig into that in the field types note.

Trade-off

Building the inverted index takes work at write time. That’s why ES isn’t great for high-velocity OLTP writes — you’re paying analysis cost on every document. But reads? Blazing fast.

Why does this matter?

Let’s build one by hand

What gets stored besides doc IDs

Why “keyword” fields skip this

Trade-off

References