Pipeline & Nested Aggregations

advanced elasticsearch aggregations pipeline nested

These are the two “advanced” aggregation topics that show up in senior interviews — pipeline aggs (computing on top of other aggs) and nested aggs (handling the nested field type).

Pipeline aggregations — “aggs of aggs”

In simple language — pipeline aggs don’t look at documents. They look at the output of other aggs and compute new values from it. Think SQL window functions.

Common pipeline aggs — avg_bucket, sum_bucket, max_bucket, min_bucket, derivative, moving_avg (deprecated → moving_fn), cumulative_sum, bucket_selector.

Example — daily revenue + moving average

GET /orders/_search
{
  "size": 0,
  "aggs": {
    "daily": {
      "date_histogram": {
        "field": "created_at",
        "calendar_interval": "day"
      },
      "aggs": {
        "revenue": {
          "sum": { "field": "amount" }
        },
        "7d_moving_avg": {
          "moving_fn": {
            "buckets_path": "revenue",
            "window": 7,
            "script": "MovingFunctions.unweightedAvg(values)"
          }
        }
      }
    }
  }
}

Two nested levels here:

  1. daily — date_histogram, one bucket per day.
  2. Inside each day — revenue (a metric) + 7d_moving_avg (a pipeline agg that looks at the last 7 revenue values).

The magic is buckets_path: "revenue" — that’s how pipeline aggs reference other aggs.

Cumulative sum — running totals

{
  "aggs": {
    "daily": {
      "date_histogram": { "field": "created_at", "calendar_interval": "day" },
      "aggs": {
        "revenue": { "sum": { "field": "amount" } },
        "cumulative": {
          "cumulative_sum": { "buckets_path": "revenue" }
        }
      }
    }
  }
}

Each day’s cumulative bucket = sum of all previous days’ revenue. Classic for “total sales to date” charts.

Stats across buckets — avg_bucket, max_bucket

“What’s the average daily revenue?” — that’s avg_bucket:

{
  "aggs": {
    "daily": {
      "date_histogram": { "field": "created_at", "calendar_interval": "day" },
      "aggs": {
        "revenue": { "sum": { "field": "amount" } }
      }
    },
    "avg_daily_revenue": {
      "avg_bucket": { "buckets_path": "daily>revenue" }
    }
  }
}

Notice daily>revenue — the > is “into the sub-agg”. This is sibling position — avg_daily_revenue is a sibling of daily, not a child.

bucket_selector — filtering buckets

Drop buckets that don’t meet a condition.

{
  "aggs": {
    "by_category": {
      "terms": { "field": "category.keyword", "size": 50 },
      "aggs": {
        "revenue": { "sum": { "field": "amount" } },
        "min_revenue_filter": {
          "bucket_selector": {
            "buckets_path": { "rev": "revenue" },
            "script": "params.rev > 10000"
          }
        }
      }
    }
  }
}

This keeps only categories with > $10k revenue. The bucket_selector doesn’t compute a value — it just decides whether each bucket is kept or dropped.

Pipeline agg positions
PARENT (inside the bucket)
moving_fn, cumulative_sum, derivative
Works across consecutive buckets — needs ordered data (e.g., date_histogram)
SIBLING (next to the bucket)
avg_bucket, sum_bucket, max_bucket, min_bucket, stats_bucket
Computes one number from all buckets — no ordering needed

Nested aggregations — the nested field type

Different concept entirely. ES flattens arrays of objects by default, which causes a famous bug — fields in the same object lose their relationship.

Say we index:

{
  "name": "MacBook Pro",
  "reviews": [
    { "author": "alice", "rating": 5 },
    { "author": "bob",   "rating": 1 }
  ]
}

Without nested mapping, ES stores this as reviews.author: ["alice", "bob"] and reviews.rating: [5, 1]. A query like “author is alice AND rating is 1” would match this doc (because alice exists AND rating=1 exists), even though they’re from different objects.

The fix — declare reviews as nested in the mapping. Then each review object is indexed as a hidden child document, preserving relationships.

PUT /products
{
  "mappings": {
    "properties": {
      "reviews": {
        "type": "nested",
        "properties": {
          "author": { "type": "keyword" },
          "rating": { "type": "integer" }
        }
      }
    }
  }
}

Aggregating nested fields

To aggregate over nested objects, we need the nested aggregation as an intermediate step:

GET /products/_search
{
  "size": 0,
  "aggs": {
    "reviews_agg": {
      "nested": { "path": "reviews" },
      "aggs": {
        "avg_rating": {
          "avg": { "field": "reviews.rating" }
        },
        "by_author": {
          "terms": { "field": "reviews.author" }
        }
      }
    }
  }
}

The nested agg “enters” the nested context. Now reviews.rating and reviews.author work correctly — relationships preserved.

reverse_nested — climbing back up

What if inside a nested agg we want to count parent docs, not child reviews?

{
  "aggs": {
    "reviews_agg": {
      "nested": { "path": "reviews" },
      "aggs": {
        "by_rating": {
          "terms": { "field": "reviews.rating" },
          "aggs": {
            "parent_products": {
              "reverse_nested": {}
            }
          }
        }
      }
    }
  }
}

For each rating bucket, reverse_nested tells us how many products (parent docs) contain a review at that rating — not how many review objects.

Quick rules

  • Pipeline aggs reference others via buckets_pathname for direct, name>sub for nested.
  • moving_fn, cumulative_sum, derivative are parent pipelines (live inside the bucket).
  • avg_bucket, max_bucket, etc. are sibling pipelines (next to the bucket).
  • bucket_selector filters buckets post-aggregation — like SQL HAVING.
  • For arrays of objects with field relationships that matter — use nested mapping + nested agg.
  • reverse_nested lets us count parent docs from inside a nested agg.