StatefulSets and Headless Services for Stateful Workloads
What This Concept Is
A Deployment treats Pods as interchangeable. For a database or a quorum-based system, Pods are not interchangeable: pod-0 is the primary, pod-1 has replica slot 1, each has its own disk, and client code cares which is which.
A StatefulSet gives you:
- Stable, ordinal names:
web-0,web-1,web-2(not random hashes). - Stable DNS via a headless Service:
web-0.web.default.svc.cluster.localalways points to the same Pod identity. - Ordered rollout: Pods are created
0, 1, 2, …and updatedN-1, N-2, …, 0by default. - Per-Pod storage via
volumeClaimTemplates: each ordinal gets its own PVC (and therefore its own PV) that survives Pod rescheduling.
A headless Service (clusterIP: None) is the companion piece. It skips kube-proxy: DNS returns the Pod IPs directly, one A record per Ready Pod, plus one A record per ordinal name. This is how clients can target a specific replica.
Why It Matters Here
Every database, every queue broker, every consensus system (etcd, Zookeeper, Kafka, Cassandra, Postgres) you deploy on Kubernetes will use a StatefulSet. If you try to run them as Deployments, you get symptoms like:
- replicas scrambling their data when a Pod is rescheduled because every replica mounted the same PV (or none did)
- "the primary" is whichever Pod happens to be Ready, because no Pod has a stable identity
- bootstrap fails because replicas come up in parallel and each tries to become leader
Concrete Example
A minimal three-replica StatefulSet with per-Pod disk:
apiVersion: v1
kind: Service
metadata:
name: db
spec:
clusterIP: None
selector: { app: db }
ports: [{ name: pg, port: 5432 }]
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: db
spec:
serviceName: db
replicas: 3
selector: { matchLabels: { app: db } }
template:
metadata: { labels: { app: db } }
spec:
terminationGracePeriodSeconds: 60
containers:
- name: pg
image: postgres:16
ports: [{ name: pg, containerPort: 5432 }]
volumeMounts:
- { name: data, mountPath: /var/lib/postgresql/data }
volumeClaimTemplates:
- metadata: { name: data }
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp3
resources: { requests: { storage: 20Gi } }
After kubectl apply:
- Pods
db-0,db-1,db-2exist in order. - PVCs
data-db-0,data-db-1,data-db-2exist, each bound to its own PV. - DNS names
db-0.db,db-1.db,db-2.db(fully qualifieddb-0.db.default.svc.cluster.local) resolve directly to each Pod's IP.
If db-1 is rescheduled, it comes back with the same name, the same DNS record, and its PVC re-attached on the new node. From a client's point of view, db-1 still exists; only the node beneath it changed.
Common Confusion / Misconception
"StatefulSets provide high availability."
They provide stable identity, not HA. The database application has to implement replication, leader election, and failover itself. Kubernetes only guarantees that db-0 is a single Pod with a single disk, not that losing it is safe.
A second confusion: "A StatefulSet needs a headless Service only for DNS." It needs it, full stop. Without a matching headless Service (serviceName field), the StatefulSet controller will not create per-Pod DNS records, and db-0.db will not resolve.
A third confusion: "StatefulSets delete their PVCs when I delete the StatefulSet." By default they do not. This is intentional: deleting the workload should not delete the data. To clean up, you delete PVCs separately or set persistentVolumeClaimRetentionPolicy to Delete on whenDeleted / whenScaled.
How To Use It
Pick StatefulSet when any of these are true:
- the workload keeps durable state on disk
- replicas are not interchangeable (primary/replica, shards, members of a quorum)
- the protocol needs stable peer identities (Raft, gossip, Kafka controller)
Pick Deployment for everything else.
Check Yourself
- Why does a StatefulSet need a headless Service?
- What survives a Pod rescheduling in a StatefulSet that would not survive in a Deployment?
- What does Kubernetes not give you about a stateful workload, even with a StatefulSet?
Rollout and Partition
StatefulSets support two update strategies:
RollingUpdate(default): update ordinals in reverse order (N-1 down to 0).OnDelete: do not update Pods on template change; operator deletes Pods when ready.
A partition: N field in RollingUpdate pins ordinals below N to the old version and only updates ordinals ≥ N. This is a canary-by-ordinal pattern: set partition: 2 on a 3-replica set, only db-2 updates; verify it; move partition down to 1; db-1 updates; and so on.
Combined with PodDisruptionBudget and a readiness probe that gates on replication lag, this gives safe, gradual upgrades for stateful workloads without surrendering control to the Deployment controller's two-hash logic.
Mini Drill or Application
Deploy the StatefulSet above (swap in whatever StorageClass your cluster has). From a debug Pod:
dig +short db-0.db.default.svc.cluster.local
dig +short db.default.svc.cluster.local
Observe the difference: the first returns one IP; the second returns all Ready Pod IPs. Delete db-1. Watch it return with the same name and the same PVC. Write a paragraph explaining why this is different from deleting a Pod in a Deployment.
When Not to Use a StatefulSet
"Stateful" does not automatically mean "StatefulSet." Two common anti-patterns:
- Stateless caches on a StatefulSet. Redis in
replicationmode where the cache can be rehydrated does not need ordinal names; a Deployment plus a headless Service works and scales faster. - Shared-storage workloads. If every replica reads/writes the same ReadWriteMany volume,
volumeClaimTemplates(per-ordinal PVCs) is exactly the wrong shape. Use a Deployment with a single shared PVC.
Conversely, some workloads really need a step beyond StatefulSet. Complex stateful systems (Postgres with HA, Kafka, TiDB) are typically run via an Operator that wraps a StatefulSet with leader election, backup schedules, and upgrade orchestration. StatefulSet gives you identity and storage; the operator gives you operations.
Read This Only If Stuck
- Linux Command Line: Mounting and unmounting storage devices -- per-ordinal disks get mounted via exactly these kernel primitives.
- Kubernetes: StatefulSets -- canonical reference for ordinals, rollouts, and PVC retention policy.
- Kubernetes: Headless Services -- the DNS behavior StatefulSets depend on.
- Kubernetes: DNS for Services and Pods (StatefulSet A records) -- how
pod-0.svc.ns.svc.cluster.localis published. - Kubernetes: Run a Replicated Stateful Application -- MySQL tutorial that puts the pieces together.
- Kubernetes: Operator Pattern -- why production databases wrap StatefulSet with a CRD + controller.
- CloudNativePG (Postgres operator) -- a realistic example: StatefulSet plus failover, backup, WAL archive.
- Strimzi (Kafka operator) -- Kafka on Kubernetes via a CRD that orchestrates per-broker StatefulSets.
- Kubernetes blog: StatefulSet PVC retention -- background on
persistentVolumeClaimRetentionPolicy.