Search and Indexing - High-Level Design

We’ve all used search. Type a few words, get results in milliseconds. But have we ever thought about how it works under the hood? Regular databases are terrible at this. Let’s see why, and what we use instead.

Why Databases Are Bad at Search

If we want to find all products containing the word “wireless” in their description, a SQL LIKE '%wireless%' query seems fine. But here’s the problem:

It does a full table scan — checks every single row
No index can help with a leading wildcard (%wireless%)
Case sensitivity, typos, synonyms, relevance ranking — none of that works
With millions of rows, this takes seconds. Users expect milliseconds.

Databases are built for structured queries (find user where id = 5). Search engines are built for unstructured text queries (find documents that best match “wireless bluetooth headphones”).

What Is an Inverted Index?

This is the core data structure behind every search engine. Think of it like the index at the back of a textbook — instead of reading every page to find “photosynthesis,” we look it up in the index and go straight to pages 47, 112, and 203.

An inverted index maps every word to the list of documents that contain it.

Inverted Index

Documents:

Doc 1: "fast red car"

Doc 2: "fast blue bike"

Doc 3: "red bike sale"

→

Inverted Index:

"fast" → [Doc 1, Doc 2]

"red" → [Doc 1, Doc 3]

"car" → [Doc 1]

"blue" → [Doc 2]

"bike" → [Doc 2, Doc 3]

"sale" → [Doc 3]

Search "fast red" → intersect [Doc 1, Doc 2] ∩ [Doc 1, Doc 3] = Doc 1

When someone searches “fast red,” we look up both words in the index, find the intersection of document lists, and return the results. No scanning required. This is why search engines are fast.

How Elasticsearch Works

Elasticsearch (ES) is the most popular search engine for backend systems. It’s built on Apache Lucene, which does the actual indexing and searching. ES wraps Lucene with a distributed system, REST API, and cluster management.

Key Concepts

Index — A collection of related documents. Think of it like a database table. An e-commerce site might have a products index and a reviews index.

Document — A single record in an index. It’s a JSON object. Each product is one document.

Shard — An index is split into shards for horizontal scaling. Each shard is a self-contained Lucene index. A products index with 5 shards distributes data across the cluster.

Replica — A copy of a shard on a different node. Replicas give us fault tolerance and allow read scaling (searches can hit replicas too).

How Data Gets Indexed

When we add a document to ES:

The text is passed through an analyzer
The analyzer tokenizes it — breaks text into individual terms (“Wireless Bluetooth Headphones” becomes [“wireless”, “bluetooth”, “headphones”])
Tokens are normalized — lowercased, stemmed (“running” becomes “run”), stop words removed (“the”, “is”, “a”)
The resulting terms go into the inverted index

This is why ES can find “running shoes” when we search for “run shoe” — the analyzer reduced both to the same root form.

Relevance Scoring

Not all results are equal. ES ranks them by relevance using a scoring algorithm (BM25 by default). The score depends on:

Term Frequency (TF) — How often does the term appear in this document? More = more relevant.
Inverse Document Frequency (IDF) — How rare is this term across all documents? Rare terms are more meaningful. “the” appears everywhere and tells us nothing. “Elasticsearch” is specific.
Field length — A match in a short title is more significant than a match in a long description.

When to Use a Search Engine

Use a search engine when:

We need full-text search across large amounts of text
Users expect typo tolerance, synonyms, and relevance ranking
We’re building product search, log analysis, or autocomplete
We need aggregations on text data (faceted search, like filtering by category)

Stick with the database when:

We’re doing exact lookups (get user by ID)
The dataset is small (a few thousand rows)
We only need simple prefix matching (PostgreSQL’s LIKE 'prefix%' uses indexes just fine)
We don’t want to maintain another system

Common Use Cases

Product search — The classic. Users search “blue running shoes size 10” and we need to match across multiple fields (name, description, category, size) with relevance ranking. Elasticsearch shines here.

Log analysis — Centralize logs from hundreds of servers. The ELK stack (Elasticsearch, Logstash, Kibana) is the standard for log search and visualization.

Autocomplete — As the user types “wire…”, we suggest “wireless headphones”, “wireless charger”, etc. ES has dedicated completion and search_as_you_type field types for this.

Geosearch — Find restaurants within 5km. ES supports geo queries natively with geo_point fields and distance filters.

Keeping Search in Sync with the Database

The database is still the source of truth. ES is a secondary index. We need to keep them in sync.

Dual write — Write to both DB and ES. Simple but risky — if one write fails, they’re out of sync.

Change Data Capture (CDC) — Listen to database changes (like Debezium reading the Postgres WAL) and stream them to ES. More reliable. The database doesn’t even know ES exists.

Periodic sync — A background job re-indexes data from DB to ES every few minutes. Simple but introduces lag.

CDC is the best approach for production systems. It’s reliable, near real-time, and decoupled.

In simple language, search engines use inverted indexes — maps from words to documents — to find things fast. Regular databases scan through rows; search engines look up words directly. Elasticsearch gives us full-text search, relevance ranking, and typo tolerance out of the box. We use it alongside our database, not instead of it.