When we index a text field, ES doesn’t just store the raw string — it runs it through an analyzer that chops it into tokens (the words that go into the inverted index). The same analyzer runs at search time on the query string, so they match up.
In simple language: an analyzer is a pipeline that turns “Sony WH-1000XM5 Headphones!” into [sony, wh, 1000xm5, headphones].
The pipeline: three stages
1. Character filters
Run on the raw string before tokenization. Strip HTML tags, replace patterns, map characters.
html_strip— removes<p>and friendsmapping— custom char-to-char replacements (”&” → ” and ”)pattern_replace— regex find/replace
2. Tokenizer (exactly one)
Splits the string into tokens. Examples:
standard— splits on word boundaries (Unicode-aware). Most common.whitespace— splits on whitespace only. Keeps punctuation in tokens.keyword— doesn’t split. Whole string as one token.ngram/edge_ngram— generates substrings (for autocomplete).path_hierarchy— splits/usr/local/bininto/usr,/usr/local,/usr/local/bin.
3. Token filters (any number, in order)
Operate on the token stream. Each one modifies, adds, or removes tokens.
lowercase— almost always wantedstop— removes “a”, “the”, “is” etc.stemmer— reduces “running” → “run”, “buying” → “buy”synonym— “tv” ↔ “television”asciifolding— “café” → “cafe”edge_ngram— generates prefixes for autocomplete
Built-in analyzers (don’t reinvent the wheel)
ES ships with several ready-made analyzers:
standard(default) —standardtokenizer + lowercase. Fine for most use cases.simple— splits on non-letters + lowercase. No numbers preserved.whitespace— just splits on whitespace, no lowercasing.keyword— treats the whole input as one token (similar to a keyword field).english(and other language analyzers) — adds stemming, stopwords, possessives.
Testing analyzers with _analyze
This is the killer debugging tool. Run any analyzer against any text:
POST /_analyze
{
"analyzer": "english",
"text": "The running headphones are amazing"
}
Response:
{
"tokens": [
{ "token": "run", "position": 1 },
{ "token": "headphon", "position": 2 },
{ "token": "amaz", "position": 4 }
]
}
Notice: “The” and “are” dropped (stopwords), “running” stemmed to “run”. This is why match: "run" finds documents containing “running” with the english analyzer.
Defining a custom analyzer
PUT /products
{
"settings": {
"analysis": {
"char_filter": {
"ampersand_to_and": {
"type": "mapping",
"mappings": ["& => and"]
}
},
"analyzer": {
"product_analyzer": {
"type": "custom",
"char_filter": ["ampersand_to_and", "html_strip"],
"tokenizer": "standard",
"filter": ["lowercase", "asciifolding", "stop", "english_stemmer"]
}
},
"filter": {
"english_stemmer": {
"type": "stemmer",
"language": "english"
}
}
}
},
"mappings": {
"properties": {
"title": { "type": "text", "analyzer": "product_analyzer" }
}
}
}
Index-time vs search-time analyzers
By default, the same analyzer runs at both. You CAN specify different ones:
"title": {
"type": "text",
"analyzer": "product_analyzer", // at index time
"search_analyzer": "standard" // at query time
}
When does this matter? Autocomplete. Use edge_ngram at index time (generates “s”, “so”, “son”, “sony”), standard at search time (just “son”). Otherwise the search side would explode the query into prefixes too.
The golden rule
If your search returns nothing, run _analyze on both the indexed text AND the query string. The tokens must match. 90% of “ES search isn’t working” issues are analyzer mismatches.