Analyzers, Tokenizers & Token Filters

intermediate elasticsearch analyzers text

When we index a text field, ES doesn’t just store the raw string — it runs it through an analyzer that chops it into tokens (the words that go into the inverted index). The same analyzer runs at search time on the query string, so they match up.

In simple language: an analyzer is a pipeline that turns “Sony WH-1000XM5 Headphones!” into [sony, wh, 1000xm5, headphones].

The pipeline: three stages

Analyzer Pipeline
Raw text
"Sony WH-1000XM5!"
Char filters
strip HTML, replace chars
Tokenizer
split into tokens
Token filters
lowercase, stem, dedupe
Tokens
[sony, wh, 1000xm5]

1. Character filters

Run on the raw string before tokenization. Strip HTML tags, replace patterns, map characters.

  • html_strip — removes <p> and friends
  • mapping — custom char-to-char replacements (”&” → ” and ”)
  • pattern_replace — regex find/replace

2. Tokenizer (exactly one)

Splits the string into tokens. Examples:

  • standard — splits on word boundaries (Unicode-aware). Most common.
  • whitespace — splits on whitespace only. Keeps punctuation in tokens.
  • keyword — doesn’t split. Whole string as one token.
  • ngram / edge_ngram — generates substrings (for autocomplete).
  • path_hierarchy — splits /usr/local/bin into /usr, /usr/local, /usr/local/bin.

3. Token filters (any number, in order)

Operate on the token stream. Each one modifies, adds, or removes tokens.

  • lowercase — almost always wanted
  • stop — removes “a”, “the”, “is” etc.
  • stemmer — reduces “running” → “run”, “buying” → “buy”
  • synonym — “tv” ↔ “television”
  • asciifolding — “café” → “cafe”
  • edge_ngram — generates prefixes for autocomplete

Built-in analyzers (don’t reinvent the wheel)

ES ships with several ready-made analyzers:

  • standard (default) — standard tokenizer + lowercase. Fine for most use cases.
  • simple — splits on non-letters + lowercase. No numbers preserved.
  • whitespace — just splits on whitespace, no lowercasing.
  • keyword — treats the whole input as one token (similar to a keyword field).
  • english (and other language analyzers) — adds stemming, stopwords, possessives.

Testing analyzers with _analyze

This is the killer debugging tool. Run any analyzer against any text:

POST /_analyze
{
  "analyzer": "english",
  "text": "The running headphones are amazing"
}

Response:

{
  "tokens": [
    { "token": "run",      "position": 1 },
    { "token": "headphon", "position": 2 },
    { "token": "amaz",     "position": 4 }
  ]
}

Notice: “The” and “are” dropped (stopwords), “running” stemmed to “run”. This is why match: "run" finds documents containing “running” with the english analyzer.

Defining a custom analyzer

PUT /products
{
  "settings": {
    "analysis": {
      "char_filter": {
        "ampersand_to_and": {
          "type": "mapping",
          "mappings": ["& => and"]
        }
      },
      "analyzer": {
        "product_analyzer": {
          "type": "custom",
          "char_filter": ["ampersand_to_and", "html_strip"],
          "tokenizer": "standard",
          "filter": ["lowercase", "asciifolding", "stop", "english_stemmer"]
        }
      },
      "filter": {
        "english_stemmer": {
          "type": "stemmer",
          "language": "english"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": { "type": "text", "analyzer": "product_analyzer" }
    }
  }
}

Index-time vs search-time analyzers

By default, the same analyzer runs at both. You CAN specify different ones:

"title": {
  "type": "text",
  "analyzer": "product_analyzer",         // at index time
  "search_analyzer": "standard"           // at query time
}

When does this matter? Autocomplete. Use edge_ngram at index time (generates “s”, “so”, “son”, “sony”), standard at search time (just “son”). Otherwise the search side would explode the query into prefixes too.

The golden rule

If your search returns nothing, run _analyze on both the indexed text AND the query string. The tokens must match. 90% of “ES search isn’t working” issues are analyzer mismatches.