Bucket Aggregations

intermediate elasticsearch aggregations buckets

Aggregations are how ES does analytics. They come in three flavors — bucket (group docs), metric (compute numbers), and pipeline (operate on other aggs). This note covers bucket aggs.

In simple language — bucket aggs are like SQL’s GROUP BY. They split docs into groups based on some criterion, and we can then run metrics on each group.

Terms aggregation — the workhorse

Group docs by the unique values of a field. Like GROUP BY category.

GET /products/_search
{
  "size": 0,
  "aggs": {
    "by_category": {
      "terms": {
        "field": "category.keyword",
        "size": 10
      }
    }
  }
}

size: 0 at the top means “don’t return docs, just aggregations” — saves bandwidth. The agg result looks like:

{
  "aggregations": {
    "by_category": {
      "buckets": [
        { "key": "laptops", "doc_count": 142 },
        { "key": "phones",  "doc_count": 98  },
        { "key": "tablets", "doc_count": 47  }
      ]
    }
  }
}

The size + accuracy gotcha

size: 10 returns top 10 buckets. But ES is distributed — each shard returns its top-10, then results merge. This means the global top-10 might be slightly off for skewed data.

To improve accuracy at a cost, bump shard_size:

{
  "terms": {
    "field": "category.keyword",
    "size": 10,
    "shard_size": 100
  }
}

Each shard returns top 100, we keep top 10. Trade more network/CPU for accuracy.

Date histogram — time-series bucketing

Group docs by time buckets. The bread and butter of dashboards.

GET /orders/_search
{
  "size": 0,
  "aggs": {
    "orders_per_day": {
      "date_histogram": {
        "field": "created_at",
        "calendar_interval": "day"
      }
    }
  }
}

Calendar intervals — minute, hour, day, week, month, quarter, year. These respect calendar boundaries (e.g., months have variable lengths).

For fixed intervals (always the same number of milliseconds), use fixed_interval:

{ "date_histogram": { "field": "created_at", "fixed_interval": "30m" } }

Use fixed_interval for sub-day buckets (15m, 30m, 1h), calendar_interval for day/week/month.

Range aggregation — custom numeric buckets

GET /products/_search
{
  "size": 0,
  "aggs": {
    "price_brackets": {
      "range": {
        "field": "price",
        "ranges": [
          { "to": 100 },
          { "from": 100, "to": 500 },
          { "from": 500, "to": 1000 },
          { "from": 1000 }
        ]
      }
    }
  }
}

Result groups: < $100, $100-500, $500-1000, > $1000. Perfect for e-commerce price filters.

Filters aggregation — named arbitrary buckets

When buckets don’t follow a single rule, define them as named filters:

GET /logs/_search
{
  "size": 0,
  "aggs": {
    "by_status": {
      "filters": {
        "filters": {
          "errors":   { "range": { "status_code": { "gte": 500 } } },
          "warnings": { "range": { "status_code": { "gte": 400, "lt": 500 } } },
          "success":  { "range": { "status_code": { "gte": 200, "lt": 300 } } }
        }
      }
    }
  }
}

This gives us 3 named buckets — errors, warnings, success — each defined by its own filter. More flexible than range/terms when buckets cross fields.

Histogram — numeric bucketing

Like date_histogram but for numbers. Useful for distribution charts.

{
  "aggs": {
    "rating_distribution": {
      "histogram": {
        "field": "rating",
        "interval": 1
      }
    }
  }
}

Buckets at intervals of 1 — 1.0, 2.0, 3.0, 4.0, 5.0. Plot it as a bar chart and we have a star-rating histogram.

Visualizing the structure

Documents → Buckets
[doc1, doc2, doc3, doc4, doc5, doc6, doc7, doc8]
↓ terms agg on "category"
laptops
doc1, doc4, doc7
doc_count: 3
phones
doc2, doc5, doc8
doc_count: 3
tablets
doc3, doc6
doc_count: 2

Combining with queries

Aggs run on the query result set. So:

GET /orders/_search
{
  "size": 0,
  "query": {
    "range": { "created_at": { "gte": "now-30d/d" } }
  },
  "aggs": {
    "orders_per_day": {
      "date_histogram": { "field": "created_at", "calendar_interval": "day" }
    }
  }
}

This gives us “orders per day, last 30 days”. The query filters first, the agg buckets the survivors.

Quick rules

  • size: 0 saves bandwidth when we only want aggs.
  • terms on a text field requires .keyword subfield (or fielddata: true, which is memory-heavy).
  • Top-N from terms is approximate across shards. Increase shard_size if accuracy matters.
  • date_histogram for time, histogram for numbers, range for custom numeric brackets, filters for arbitrary named buckets.
  • Aggs operate on the queried subset — combine query + agg for “stats about my filtered data”.