Resource Requests, Limits, QoS Classes, and the HPA
What This Concept Is
Every container in a Pod can declare two numbers per resource (cpu, memory):
requests-- the amount the scheduler reserves on a node for this container. Used for placement and, for CPU, as a cgroupcpu.weightbaseline.limits-- the maximum the container is allowed to use. Enforced by the kernel's cgroup:cpu.max(throttling) andmemory.max(OOMKill when exceeded).
From the requests/limits pairs across a Pod's containers, Kubernetes computes the Pod's QoS class, which drives eviction order when a node is under pressure:
| QoS class | Condition | Eviction priority |
|---|---|---|
Guaranteed | every container has equal requests == limits for cpu and memory | evicted last |
Burstable | at least one container has a request or limit set, but not Guaranteed | middle |
BestEffort | no requests or limits anywhere | evicted first |
The Horizontal Pod Autoscaler (autoscaling/v2) is a controller that scales a Deployment/StatefulSet replica count based on metrics. The classic CPU rule, for a target utilization T and current utilization U:
desiredReplicas = ceil(currentReplicas * (U / T))
with a stabilization window and configurable scale-up / scale-down policies to prevent flapping. Metrics come from the metrics-server (for CPU/memory) or from custom/external adapters.
Why It Matters Here
This is where the scheduler, the kernel, and the cluster autoscaler all interact. Getting it wrong produces the most common production symptoms:
Pendingpods because requests don't fit on any nodeOOMKilledcontainers because memory limits are below real usage- CPU-starved services during bursts because no limits or unrealistically low limits
- pods evicted first on a noisy node because they are
BestEffort - the HPA never scales because the metrics-server is not installed, or it scales to zero because requests are unset
Concrete Example
A Deployment plus an HPA:
apiVersion: apps/v1
kind: Deployment
metadata: { name: web }
spec:
replicas: 2
selector: { matchLabels: { app: web } }
template:
metadata: { labels: { app: web } }
spec:
containers:
- name: app
image: nginx:1.27
resources:
requests: { cpu: "250m", memory: "256Mi" }
limits: { cpu: "500m", memory: "512Mi" }
livenessProbe:
httpGet: { path: /healthz, port: 80 }
periodSeconds: 10
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: web }
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target: { type: Utilization, averageUtilization: 70 }
Walkthrough:
- Scheduler places each Pod on a node that has at least
250mCPU and256Mimemory free among its allocatable resources (not the raw node size). - Because
requests != limitsfor both CPU and memory, the Pod is QoS class Burstable. - The kernel enforces
cpu.max(throttle at 500m) andmemory.max(OOMKill at 512Mi). - The HPA queries the metrics-server every 15s. When average CPU usage rises to 560m per pod against the
250mrequest (≈224% of request, but target is phrased as utilization of request = 70%), it computes a new desired replica count.
Common Confusion / Misconception
"Requests are soft; limits are hard."
The scheduler treats requests as hard for placement: if a Pod's requests don't fit, it does not get scheduled. The kernel treats limits as hard for enforcement. What is soft is the relationship between request and actual usage: a container is free to use less than its request (and free to use more, up to the limit, if the node has spare capacity -- "bursting").
A second confusion: "Setting limits equal to requests is always best." It gives you Guaranteed QoS (evicted last), at the cost of no bursting, so your tail latency during spikes is worse. For latency-sensitive services this is often a good trade; for bursty workloads it is not.
A third confusion: "CPU limit = number of CPUs." cpu: "1" is one core-second per second. At cpu: "500m", the container gets 500 millicores, i.e. half a core's worth of time, which may be served across all cores on the node. Memory, by contrast, is a hard byte count.
A fourth confusion: "HPA scales based on actual CPU." It scales based on usage / request. If requests.cpu is unset, there is nothing to divide by, and HPA cannot compute utilization.
How To Use It
Starting defaults:
- Always set
requestsfor CPU and memory. Without them, the scheduler has no guidance and HPA cannot work. - Set
limitsfor memory almost always. Memory is not compressible; OOMKilling a runaway container is safer than swapping or starving its neighbors. - Be deliberate about CPU limits. For latency-sensitive paths, experiment without CPU limits once memory is bounded.
- Pick QoS class by eviction risk:
Guaranteedfor critical,Burstablefor typical,BestEffortonly for dev.
Check Yourself
- What is the exact difference between
requestsandlimitsfrom the scheduler's and the kernel's points of view? - How do
requests,limits, and QoS class relate? - Why does an HPA fail silently when
requests.cpuis unset?
Mini Drill or Application
Deploy a CPU-stress container (image polinux/stress) with requests.cpu: 100m and limits.cpu: 200m. Hook an HPA with a 50% utilization target. Generate load. Run:
kubectl top pod
kubectl describe hpa
kubectl get events --sort-by=.lastTimestamp
Record how many replicas the HPA settles on at steady state. Change limits to 100m (equal to request) and repeat. Write a paragraph comparing behavior.
Beyond HPA: VPA, KEDA, and the Cluster Autoscaler
HPA scales pods horizontally against a metric. Three siblings cover the other axes:
- Vertical Pod Autoscaler (VPA). Adjusts
requestsandlimitsbased on observed usage. Useful for workloads you cannot easily scale horizontally (single-writer DBs, legacy apps). Caution: by default it recreates pods to apply the new values. - KEDA (Kubernetes Event-Driven Autoscaling). Scales based on queue depth, pub/sub backlog, HTTP concurrency, Prometheus queries, etc. Handles scale-to-zero (useful for bursty workloads), which HPA alone does not.
- Cluster Autoscaler / Karpenter. Scales nodes up when Pods are
Pendingbecause no node has room, and scales down when nodes are empty. HPA adds pods; the cluster autoscaler adds the nodes those pods need.
For a full autoscaling story you usually combine HPA (pods) with the Cluster Autoscaler (nodes) and occasionally VPA (rightsizing). Treat it as a single architectural layer with three knobs, not three independent features.
Read This Only If Stuck
- Linux Command Line: Viewing processes dynamically with top --
topsemantics are the foundation forkubectl topand the metrics pipeline. - Linux Command Line: How a process works and viewing processes -- CPU shares and OOM behavior bottom out in these process primitives.
- Kubernetes: Resource Management for Pods and Containers -- authoritative reference on
requests/limits. - Kubernetes: Configure Quality of Service for Pods -- how QoS classes are computed and used.
- Kubernetes: Assign Memory Resources to Containers and Pods -- hands-on walkthrough including
OOMKilledbehavior. - Kubernetes: Horizontal Pod Autoscaling -- algorithm, stabilization window, scaling policies.
- Kubernetes: HPA Walkthrough -- step-by-step verification against a synthetic load.
- Kubernetes: Vertical Pod Autoscaler (GitHub) -- VPA design doc and install guide.
- KEDA documentation -- event-driven autoscaling and scale-to-zero.
- Cluster Autoscaler / Karpenter (AWS) -- node-level autoscaling that complements HPA.