Resource Management and Scaling

advanced kubernetes hpa resources autoscaling

If we don’t tell Kubernetes how much CPU and memory our Pods need, it’s flying blind. Pods could hog all the resources on a node, starve other workloads, or get killed randomly. Resource management is how we keep things predictable.

Requests vs Limits

Every container can have two resource settings:

  • Requests — the minimum guaranteed resources. The scheduler uses this to decide which node has enough room.
  • Limits — the maximum a container can use. It’s the ceiling.
spec:
  containers:
    - name: app
      image: my-app:latest
      resources:
        requests:
          cpu: "250m"           # 250 millicores = 0.25 CPU
          memory: "128Mi"       # 128 mebibytes
        limits:
          cpu: "500m"           # can burst up to 0.5 CPU
          memory: "256Mi"       # hard ceiling

A quick note on units: 1 CPU = 1 vCPU/core. 250m = 0.25 cores. Memory uses Mi (mebibytes) or Gi (gibibytes).

What Happens When Limits Are Exceeded

This is a common interview question, and the answer is different for CPU vs memory:

  • CPU limit exceeded — the container gets throttled. It won’t crash, but it’ll run slower. The kernel simply won’t give it more CPU time.
  • Memory limit exceeded — the container gets OOMKilled (Out Of Memory Killed). Kubernetes terminates it immediately. This is harsh but necessary to protect the node.
CPU vs Memory — Over Limit Behavior
CPU Over Limit
Throttled (slowed down)
Pod stays alive, just slower
Memory Over Limit
OOMKilled (terminated)
Pod is killed and restarted

QoS Classes

Kubernetes assigns a Quality of Service class to every Pod based on its resource settings. When a node runs out of memory, K8s uses QoS to decide which Pods to evict first.

  • Guaranteed — requests equal limits for all containers. Last to be evicted. Set this for critical workloads.
  • Burstable — requests are set but are lower than limits (or limits aren’t set). Evicted after BestEffort.
  • BestEffort — no requests or limits set at all. First to be evicted. Avoid this in production.
# Guaranteed — requests == limits
resources:
  requests:
    cpu: "500m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "256Mi"

# Burstable — requests < limits
resources:
  requests:
    cpu: "250m"
    memory: "128Mi"
  limits:
    cpu: "500m"
    memory: "256Mi"

LimitRanges and ResourceQuotas

Cluster admins use these to enforce guardrails.

LimitRange — sets default and max/min resources per container in a namespace. If a developer forgets to set requests, the LimitRange fills in defaults.

ResourceQuota — sets total resource caps per namespace. For example, “the dev namespace can’t use more than 20 CPU cores and 64Gi memory total.”

apiVersion: v1
kind: ResourceQuota
metadata:
  name: dev-quota
  namespace: dev
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "64Gi"
    limits.cpu: "40"
    limits.memory: "128Gi"
    pods: "50"                  # max 50 Pods in this namespace

Horizontal Pod Autoscaler (HPA)

HPA automatically scales the number of Pod replicas based on metrics like CPU or memory usage. This is the most common autoscaling approach.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70    # scale up when CPU > 70%
# Quick way to create an HPA
kubectl autoscale deployment my-app --min=2 --max=10 --cpu-percent=70

HPA checks metrics every 15 seconds by default. It scales up quickly but scales down slowly (5-minute stabilization window) to avoid flapping.

Vertical Pod Autoscaler (VPA)

Instead of adding more Pods, VPA adjusts the CPU and memory requests of existing Pods. Useful when we don’t know the right resource values upfront — VPA watches actual usage and recommends (or automatically applies) better values.

The catch: VPA has to restart Pods to apply new resource values, so it’s often used in “recommend-only” mode where it suggests values and we apply them ourselves.

Cluster Autoscaler

Operates at the infrastructure level. When Pods can’t be scheduled because there aren’t enough nodes, the Cluster Autoscaler adds more nodes to the cluster. When nodes are underutilized, it removes them.

  • HPA scales Pods (horizontal)
  • VPA right-sizes Pods (vertical)
  • Cluster Autoscaler scales nodes (infrastructure)

They work together: HPA creates more Pods → Pods become unschedulable → Cluster Autoscaler adds nodes.

In simple language, requests tell the scheduler what we need, limits protect the node from greedy containers, and autoscalers keep everything right-sized based on actual traffic. Always set requests and limits in production — a Pod without them is a ticking time bomb.