Skip to main content

StatefulSets and Headless Services for Stateful Workloads

What This Concept Is

A Deployment treats Pods as interchangeable. For a database or a quorum-based system, Pods are not interchangeable: pod-0 is the primary, pod-1 has replica slot 1, each has its own disk, and client code cares which is which.

A StatefulSet gives you:

  • Stable, ordinal names: web-0, web-1, web-2 (not random hashes).
  • Stable DNS via a headless Service: web-0.web.default.svc.cluster.local always points to the same Pod identity.
  • Ordered rollout: Pods are created 0, 1, 2, … and updated N-1, N-2, …, 0 by default.
  • Per-Pod storage via volumeClaimTemplates: each ordinal gets its own PVC (and therefore its own PV) that survives Pod rescheduling.

A headless Service (clusterIP: None) is the companion piece. It skips kube-proxy: DNS returns the Pod IPs directly, one A record per Ready Pod, plus one A record per ordinal name. This is how clients can target a specific replica.

Why It Matters Here

Every database, every queue broker, every consensus system (etcd, Zookeeper, Kafka, Cassandra, Postgres) you deploy on Kubernetes will use a StatefulSet. If you try to run them as Deployments, you get symptoms like:

  • replicas scrambling their data when a Pod is rescheduled because every replica mounted the same PV (or none did)
  • "the primary" is whichever Pod happens to be Ready, because no Pod has a stable identity
  • bootstrap fails because replicas come up in parallel and each tries to become leader

Concrete Example

A minimal three-replica StatefulSet with per-Pod disk:

apiVersion: v1
kind: Service
metadata:
name: db
spec:
clusterIP: None
selector: { app: db }
ports: [{ name: pg, port: 5432 }]
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: db
spec:
serviceName: db
replicas: 3
selector: { matchLabels: { app: db } }
template:
metadata: { labels: { app: db } }
spec:
terminationGracePeriodSeconds: 60
containers:
- name: pg
image: postgres:16
ports: [{ name: pg, containerPort: 5432 }]
volumeMounts:
- { name: data, mountPath: /var/lib/postgresql/data }
volumeClaimTemplates:
- metadata: { name: data }
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp3
resources: { requests: { storage: 20Gi } }

After kubectl apply:

  • Pods db-0, db-1, db-2 exist in order.
  • PVCs data-db-0, data-db-1, data-db-2 exist, each bound to its own PV.
  • DNS names db-0.db, db-1.db, db-2.db (fully qualified db-0.db.default.svc.cluster.local) resolve directly to each Pod's IP.

If db-1 is rescheduled, it comes back with the same name, the same DNS record, and its PVC re-attached on the new node. From a client's point of view, db-1 still exists; only the node beneath it changed.

Common Confusion / Misconception

"StatefulSets provide high availability."

They provide stable identity, not HA. The database application has to implement replication, leader election, and failover itself. Kubernetes only guarantees that db-0 is a single Pod with a single disk, not that losing it is safe.

A second confusion: "A StatefulSet needs a headless Service only for DNS." It needs it, full stop. Without a matching headless Service (serviceName field), the StatefulSet controller will not create per-Pod DNS records, and db-0.db will not resolve.

A third confusion: "StatefulSets delete their PVCs when I delete the StatefulSet." By default they do not. This is intentional: deleting the workload should not delete the data. To clean up, you delete PVCs separately or set persistentVolumeClaimRetentionPolicy to Delete on whenDeleted / whenScaled.

How To Use It

Pick StatefulSet when any of these are true:

  • the workload keeps durable state on disk
  • replicas are not interchangeable (primary/replica, shards, members of a quorum)
  • the protocol needs stable peer identities (Raft, gossip, Kafka controller)

Pick Deployment for everything else.

Check Yourself

  1. Why does a StatefulSet need a headless Service?
  2. What survives a Pod rescheduling in a StatefulSet that would not survive in a Deployment?
  3. What does Kubernetes not give you about a stateful workload, even with a StatefulSet?

Rollout and Partition

StatefulSets support two update strategies:

  • RollingUpdate (default): update ordinals in reverse order (N-1 down to 0).
  • OnDelete: do not update Pods on template change; operator deletes Pods when ready.

A partition: N field in RollingUpdate pins ordinals below N to the old version and only updates ordinals ≥ N. This is a canary-by-ordinal pattern: set partition: 2 on a 3-replica set, only db-2 updates; verify it; move partition down to 1; db-1 updates; and so on.

Combined with PodDisruptionBudget and a readiness probe that gates on replication lag, this gives safe, gradual upgrades for stateful workloads without surrendering control to the Deployment controller's two-hash logic.

Mini Drill or Application

Deploy the StatefulSet above (swap in whatever StorageClass your cluster has). From a debug Pod:

dig +short db-0.db.default.svc.cluster.local
dig +short db.default.svc.cluster.local

Observe the difference: the first returns one IP; the second returns all Ready Pod IPs. Delete db-1. Watch it return with the same name and the same PVC. Write a paragraph explaining why this is different from deleting a Pod in a Deployment.

When Not to Use a StatefulSet

"Stateful" does not automatically mean "StatefulSet." Two common anti-patterns:

  • Stateless caches on a StatefulSet. Redis in replication mode where the cache can be rehydrated does not need ordinal names; a Deployment plus a headless Service works and scales faster.
  • Shared-storage workloads. If every replica reads/writes the same ReadWriteMany volume, volumeClaimTemplates (per-ordinal PVCs) is exactly the wrong shape. Use a Deployment with a single shared PVC.

Conversely, some workloads really need a step beyond StatefulSet. Complex stateful systems (Postgres with HA, Kafka, TiDB) are typically run via an Operator that wraps a StatefulSet with leader election, backup schedules, and upgrade orchestration. StatefulSet gives you identity and storage; the operator gives you operations.

Read This Only If Stuck