Skip to main content

Resource Requests, Limits, QoS Classes, and the HPA

What This Concept Is

Every container in a Pod can declare two numbers per resource (cpu, memory):

  • requests -- the amount the scheduler reserves on a node for this container. Used for placement and, for CPU, as a cgroup cpu.weight baseline.
  • limits -- the maximum the container is allowed to use. Enforced by the kernel's cgroup: cpu.max (throttling) and memory.max (OOMKill when exceeded).

From the requests/limits pairs across a Pod's containers, Kubernetes computes the Pod's QoS class, which drives eviction order when a node is under pressure:

QoS classConditionEviction priority
Guaranteedevery container has equal requests == limits for cpu and memoryevicted last
Burstableat least one container has a request or limit set, but not Guaranteedmiddle
BestEffortno requests or limits anywhereevicted first

The Horizontal Pod Autoscaler (autoscaling/v2) is a controller that scales a Deployment/StatefulSet replica count based on metrics. The classic CPU rule, for a target utilization T and current utilization U:

desiredReplicas = ceil(currentReplicas * (U / T))

with a stabilization window and configurable scale-up / scale-down policies to prevent flapping. Metrics come from the metrics-server (for CPU/memory) or from custom/external adapters.

Why It Matters Here

This is where the scheduler, the kernel, and the cluster autoscaler all interact. Getting it wrong produces the most common production symptoms:

  • Pending pods because requests don't fit on any node
  • OOMKilled containers because memory limits are below real usage
  • CPU-starved services during bursts because no limits or unrealistically low limits
  • pods evicted first on a noisy node because they are BestEffort
  • the HPA never scales because the metrics-server is not installed, or it scales to zero because requests are unset

Concrete Example

A Deployment plus an HPA:

apiVersion: apps/v1
kind: Deployment
metadata: { name: web }
spec:
replicas: 2
selector: { matchLabels: { app: web } }
template:
metadata: { labels: { app: web } }
spec:
containers:
- name: app
image: nginx:1.27
resources:
requests: { cpu: "250m", memory: "256Mi" }
limits: { cpu: "500m", memory: "512Mi" }
livenessProbe:
httpGet: { path: /healthz, port: 80 }
periodSeconds: 10
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: web }
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target: { type: Utilization, averageUtilization: 70 }

Walkthrough:

  1. Scheduler places each Pod on a node that has at least 250m CPU and 256Mi memory free among its allocatable resources (not the raw node size).
  2. Because requests != limits for both CPU and memory, the Pod is QoS class Burstable.
  3. The kernel enforces cpu.max (throttle at 500m) and memory.max (OOMKill at 512Mi).
  4. The HPA queries the metrics-server every 15s. When average CPU usage rises to 560m per pod against the 250m request (≈224% of request, but target is phrased as utilization of request = 70%), it computes a new desired replica count.

Common Confusion / Misconception

"Requests are soft; limits are hard."

The scheduler treats requests as hard for placement: if a Pod's requests don't fit, it does not get scheduled. The kernel treats limits as hard for enforcement. What is soft is the relationship between request and actual usage: a container is free to use less than its request (and free to use more, up to the limit, if the node has spare capacity -- "bursting").

A second confusion: "Setting limits equal to requests is always best." It gives you Guaranteed QoS (evicted last), at the cost of no bursting, so your tail latency during spikes is worse. For latency-sensitive services this is often a good trade; for bursty workloads it is not.

A third confusion: "CPU limit = number of CPUs." cpu: "1" is one core-second per second. At cpu: "500m", the container gets 500 millicores, i.e. half a core's worth of time, which may be served across all cores on the node. Memory, by contrast, is a hard byte count.

A fourth confusion: "HPA scales based on actual CPU." It scales based on usage / request. If requests.cpu is unset, there is nothing to divide by, and HPA cannot compute utilization.

How To Use It

Starting defaults:

  1. Always set requests for CPU and memory. Without them, the scheduler has no guidance and HPA cannot work.
  2. Set limits for memory almost always. Memory is not compressible; OOMKilling a runaway container is safer than swapping or starving its neighbors.
  3. Be deliberate about CPU limits. For latency-sensitive paths, experiment without CPU limits once memory is bounded.
  4. Pick QoS class by eviction risk: Guaranteed for critical, Burstable for typical, BestEffort only for dev.

Check Yourself

  1. What is the exact difference between requests and limits from the scheduler's and the kernel's points of view?
  2. How do requests, limits, and QoS class relate?
  3. Why does an HPA fail silently when requests.cpu is unset?

Mini Drill or Application

Deploy a CPU-stress container (image polinux/stress) with requests.cpu: 100m and limits.cpu: 200m. Hook an HPA with a 50% utilization target. Generate load. Run:

kubectl top pod
kubectl describe hpa
kubectl get events --sort-by=.lastTimestamp

Record how many replicas the HPA settles on at steady state. Change limits to 100m (equal to request) and repeat. Write a paragraph comparing behavior.

Beyond HPA: VPA, KEDA, and the Cluster Autoscaler

HPA scales pods horizontally against a metric. Three siblings cover the other axes:

  • Vertical Pod Autoscaler (VPA). Adjusts requests and limits based on observed usage. Useful for workloads you cannot easily scale horizontally (single-writer DBs, legacy apps). Caution: by default it recreates pods to apply the new values.
  • KEDA (Kubernetes Event-Driven Autoscaling). Scales based on queue depth, pub/sub backlog, HTTP concurrency, Prometheus queries, etc. Handles scale-to-zero (useful for bursty workloads), which HPA alone does not.
  • Cluster Autoscaler / Karpenter. Scales nodes up when Pods are Pending because no node has room, and scales down when nodes are empty. HPA adds pods; the cluster autoscaler adds the nodes those pods need.

For a full autoscaling story you usually combine HPA (pods) with the Cluster Autoscaler (nodes) and occasionally VPA (rightsizing). Treat it as a single architectural layer with three knobs, not three independent features.

Read This Only If Stuck