Relevance & Scoring (TF-IDF, BM25) - Elasticsearch

When we search for “fast laptop”, Elasticsearch returns matching docs sorted by a _score. That score is a number telling us how relevant the doc is. The higher the score, the better the match.

In simple language: scoring is “how strongly does this document match my query, given the words it contains and how common those words are across the whole index”.

TF-IDF (the old way)

Before ES 5, the default scoring used TF-IDF:

TF (Term Frequency) — the more times a term appears in a doc, the higher the score.
IDF (Inverse Document Frequency) — rare terms across the index count more. “laptop” matters more than “the”.
Field length norm — shorter fields score higher (a match in a title beats a match in a long description).

The problem: TF grows unbounded. A doc that mentions “laptop” 100 times scores way higher than one that mentions it 5 times — even though both are obviously about laptops.

BM25 (the current default, since ES 5.0)

BM25 stands for “Best Matching 25”. Think of it like TF-IDF with two important fixes:

TF saturation — repeating a term gives diminishing returns. After 5–10 occurrences, more mentions barely move the needle.
Length normalization is tunable — controlled by a parameter b.

BM25 formula (simplified)

score = IDF(term) × (tf × (k1 + 1)) / (tf + k1 × (1 - b + b × (dl / avgdl)))

k1 (default 1.2) — controls TF saturation. Higher = TF matters more.
b (default 0.75) — controls length norm. 0 = ignore length, 1 = full normalization.
dl = doc length, avgdl = average doc length in the index.

Tuning BM25 per field

We can override k1 and b on a per-index basis:

PUT /products
{
  "settings": {
    "index": {
      "similarity": {
        "custom_bm25": {
          "type": "BM25",
          "k1": 1.5,
          "b": 0.5
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "description": {
        "type": "text",
        "similarity": "custom_bm25"
      }
    }
  }
}

Debugging scores with `explain`

When relevance feels off, use explain to see why a doc scored what it did:

GET /products/_search
{
  "explain": true,
  "query": { "match": { "description": "fast laptop" } }
}

The response includes a breakdown: IDF value, TF value, field length, and the final BM25 product for each term.

When BM25 isn’t enough

BM25 only looks at lexical matches — it has no idea “laptop” and “notebook” mean the same thing. For semantic similarity, we layer on:

Synonyms at analyzer time (cheap, fast)
Function score / script_score to boost recent or popular docs
Dense vector search (kNN) for true semantic matching

For most CRUD-y search interview questions though, “BM25 with TF saturation and length normalization” is the right answer.

TF-IDF (the old way)

BM25 (the current default, since ES 5.0)

Tuning BM25 per field

Debugging scores with explain

When BM25 isn’t enough

References

Debugging scores with `explain`