Document Structure - Elasticsearch

When we get a document back from ES, it’s not just our JSON — it’s wrapped in metadata. Knowing what each field means saves a lot of confusion.

Here’s a real response:

{
  "_index": "products",
  "_id": "abc123",
  "_version": 3,
  "_seq_no": 42,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "title": "Sony WH-1000XM5",
    "price": 399,
    "category": "audio"
  }
}

Let’s break it down.

The metadata fields (prefixed with `_`)

Field	What it is
_index	Which index this doc lives in
_id	Unique identifier within the index
_source	The actual JSON we sent in
_version	Increments on every update (for optimistic concurrency)
_seq_no	Per-shard sequence number, used for safer concurrency control
_primary_term	Counter that bumps when a new primary is elected
_score	Relevance score (only on search results)

`_id` — the document ID

We can provide our own (PUT /products/_doc/abc123) or let ES auto-generate one (POST /products/_doc). Auto-generated IDs are URL-safe Base64 strings like Z6X3kYwBq8....

The _id must be unique within the index. It’s used to route the doc to a shard via hash(_id) % shards.

`_source` — the field that matters most

This is our original JSON, stored verbatim. By default, ES stores it so we can retrieve the doc as we sent it. You CAN disable _source to save disk, but then you can’t:

Reindex into a new mapping
Use update API
Use highlighting

In simple language: don’t disable _source unless you really know what you’re doing.

`_version` and concurrency control

Every update bumps _version. We can use it to prevent lost updates:

PUT /products/_doc/abc123?if_seq_no=42&if_primary_term=1
{
  "title": "Sony WH-1000XM5 — Updated"
}

If the doc was modified by someone else in the meantime (different seq_no), this fails. That’s optimistic concurrency control — like SQL’s WHERE version = ?.

Note: ES used to use _version for this directly. The modern way is _seq_no + _primary_term because it’s safer across primary failovers.

`_type` — the ghost of versions past

You might see old tutorials with URLs like /products/product/abc123. That product was the type, a sub-grouping within an index (think tables within a database).

Types are dead. They were deprecated in 6.x, removed in 8.x. Now every index has one implicit type, accessed via _doc:

# Old (don't do this)
PUT /products/product/abc123

# New
PUT /products/_doc/abc123

Why did they kill it? Lucene stores all fields from all types in the same underlying index — so two types in the same index with a name field of different data types caused chaos. Easier to just say “one index, one schema.”

Putting it together

# Index a doc with our own ID
curl -X PUT "localhost:9200/products/_doc/sony-xm5" -H "Content-Type: application/json" -d '
{
  "title": "Sony WH-1000XM5",
  "price": 399
}'

# Get it back
curl "localhost:9200/products/_doc/sony-xm5"

{
  "_index": "products",
  "_id": "sony-xm5",
  "_version": 1,
  "_seq_no": 0,
  "_primary_term": 1,
  "found": true,
  "_source": { "title": "Sony WH-1000XM5", "price": 399 }
}

When you see found: true and your data in _source, you’ve got it. Everything else is plumbing.

The metadata fields (prefixed with _)

_id — the document ID

_source — the field that matters most

_version and concurrency control

_type — the ghost of versions past