Analyzers, Tokenizers & Token Filters - Elasticsearch

When we index a text field, ES doesn’t just store the raw string — it runs it through an analyzer that chops it into tokens (the words that go into the inverted index). The same analyzer runs at search time on the query string, so they match up.

In simple language: an analyzer is a pipeline that turns “Sony WH-1000XM5 Headphones!” into [sony, wh, 1000xm5, headphones].

The pipeline: three stages

Analyzer Pipeline

Raw text

"Sony WH-1000XM5!"

→

Char filters

strip HTML, replace chars

→

Tokenizer

split into tokens

→

Token filters

lowercase, stem, dedupe

→

Tokens

[sony, wh, 1000xm5]

1. Character filters

Run on the raw string before tokenization. Strip HTML tags, replace patterns, map characters.

html_strip — removes <p> and friends
mapping — custom char-to-char replacements (”&” → ” and ”)
pattern_replace — regex find/replace

2. Tokenizer (exactly one)

Splits the string into tokens. Examples:

standard — splits on word boundaries (Unicode-aware). Most common.
whitespace — splits on whitespace only. Keeps punctuation in tokens.
keyword — doesn’t split. Whole string as one token.
ngram / edge_ngram — generates substrings (for autocomplete).
path_hierarchy — splits /usr/local/bin into /usr, /usr/local, /usr/local/bin.

3. Token filters (any number, in order)

Operate on the token stream. Each one modifies, adds, or removes tokens.

lowercase — almost always wanted
stop — removes “a”, “the”, “is” etc.
stemmer — reduces “running” → “run”, “buying” → “buy”
synonym — “tv” ↔ “television”
asciifolding — “café” → “cafe”
edge_ngram — generates prefixes for autocomplete

Built-in analyzers (don’t reinvent the wheel)

ES ships with several ready-made analyzers:

standard (default) — standard tokenizer + lowercase. Fine for most use cases.
simple — splits on non-letters + lowercase. No numbers preserved.
whitespace — just splits on whitespace, no lowercasing.
keyword — treats the whole input as one token (similar to a keyword field).
english (and other language analyzers) — adds stemming, stopwords, possessives.

Testing analyzers with `_analyze`

This is the killer debugging tool. Run any analyzer against any text:

POST /_analyze
{
  "analyzer": "english",
  "text": "The running headphones are amazing"
}

Response:

{
  "tokens": [
    { "token": "run",      "position": 1 },
    { "token": "headphon", "position": 2 },
    { "token": "amaz",     "position": 4 }
  ]
}

Notice: “The” and “are” dropped (stopwords), “running” stemmed to “run”. This is why match: "run" finds documents containing “running” with the english analyzer.

Defining a custom analyzer

PUT /products
{
  "settings": {
    "analysis": {
      "char_filter": {
        "ampersand_to_and": {
          "type": "mapping",
          "mappings": ["& => and"]
        }
      },
      "analyzer": {
        "product_analyzer": {
          "type": "custom",
          "char_filter": ["ampersand_to_and", "html_strip"],
          "tokenizer": "standard",
          "filter": ["lowercase", "asciifolding", "stop", "english_stemmer"]
        }
      },
      "filter": {
        "english_stemmer": {
          "type": "stemmer",
          "language": "english"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": { "type": "text", "analyzer": "product_analyzer" }
    }
  }
}

Index-time vs search-time analyzers

By default, the same analyzer runs at both. You CAN specify different ones:

"title": {
  "type": "text",
  "analyzer": "product_analyzer",         // at index time
  "search_analyzer": "standard"           // at query time
}

When does this matter? Autocomplete. Use edge_ngram at index time (generates “s”, “so”, “son”, “sony”), standard at search time (just “son”). Otherwise the search side would explode the query into prefixes too.

The golden rule

If your search returns nothing, run _analyze on both the indexed text AND the query string. The tokens must match. 90% of “ES search isn’t working” issues are analyzer mismatches.