StatefulSets and Headless Services for Stateful Workloads

What This Concept Is

A Deployment treats Pods as interchangeable. For a database or a quorum-based system, Pods are not interchangeable: pod-0 is the primary, pod-1 has replica slot 1, each has its own disk, and client code cares which is which.

A StatefulSet gives you:

Stable, ordinal names: web-0, web-1, web-2 (not random hashes).
Stable DNS via a headless Service: web-0.web.default.svc.cluster.local always points to the same Pod identity.
Ordered rollout: Pods are created 0, 1, 2, … and updated N-1, N-2, …, 0 by default.
Per-Pod storage via volumeClaimTemplates: each ordinal gets its own PVC (and therefore its own PV) that survives Pod rescheduling.

A headless Service (clusterIP: None) is the companion piece. It skips kube-proxy: DNS returns the Pod IPs directly, one A record per Ready Pod, plus one A record per ordinal name. This is how clients can target a specific replica.

Why It Matters Here

Every database, every queue broker, every consensus system (etcd, Zookeeper, Kafka, Cassandra, Postgres) you deploy on Kubernetes will use a StatefulSet. If you try to run them as Deployments, you get symptoms like:

replicas scrambling their data when a Pod is rescheduled because every replica mounted the same PV (or none did)
"the primary" is whichever Pod happens to be Ready, because no Pod has a stable identity
bootstrap fails because replicas come up in parallel and each tries to become leader

Concrete Example

A minimal three-replica StatefulSet with per-Pod disk:

apiVersion: v1
kind: Service
metadata:
  name: db
spec:
  clusterIP: None
  selector: { app: db }
  ports: [{ name: pg, port: 5432 }]
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: db
spec:
  serviceName: db
  replicas: 3
  selector: { matchLabels: { app: db } }
  template:
    metadata: { labels: { app: db } }
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: pg
          image: postgres:16
          ports: [{ name: pg, containerPort: 5432 }]
          volumeMounts:
            - { name: data, mountPath: /var/lib/postgresql/data }
  volumeClaimTemplates:
    - metadata: { name: data }
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: gp3
        resources: { requests: { storage: 20Gi } }

After kubectl apply:

Pods db-0, db-1, db-2 exist in order.
PVCs data-db-0, data-db-1, data-db-2 exist, each bound to its own PV.
DNS names db-0.db, db-1.db, db-2.db (fully qualified db-0.db.default.svc.cluster.local) resolve directly to each Pod's IP.

If db-1 is rescheduled, it comes back with the same name, the same DNS record, and its PVC re-attached on the new node. From a client's point of view, db-1 still exists; only the node beneath it changed.

Common Confusion / Misconception

"StatefulSets provide high availability."

They provide stable identity, not HA. The database application has to implement replication, leader election, and failover itself. Kubernetes only guarantees that db-0 is a single Pod with a single disk, not that losing it is safe.

A second confusion: "A StatefulSet needs a headless Service only for DNS." It needs it, full stop. Without a matching headless Service (serviceName field), the StatefulSet controller will not create per-Pod DNS records, and db-0.db will not resolve.

A third confusion: "StatefulSets delete their PVCs when I delete the StatefulSet." By default they do not. This is intentional: deleting the workload should not delete the data. To clean up, you delete PVCs separately or set persistentVolumeClaimRetentionPolicy to Delete on whenDeleted / whenScaled.

How To Use It

Pick StatefulSet when any of these are true:

the workload keeps durable state on disk
replicas are not interchangeable (primary/replica, shards, members of a quorum)
the protocol needs stable peer identities (Raft, gossip, Kafka controller)

Pick Deployment for everything else.

Check Yourself

Why does a StatefulSet need a headless Service?
What survives a Pod rescheduling in a StatefulSet that would not survive in a Deployment?
What does Kubernetes not give you about a stateful workload, even with a StatefulSet?

Rollout and Partition

StatefulSets support two update strategies:

RollingUpdate (default): update ordinals in reverse order (N-1 down to 0).
OnDelete: do not update Pods on template change; operator deletes Pods when ready.

A partition: N field in RollingUpdate pins ordinals below N to the old version and only updates ordinals ≥ N. This is a canary-by-ordinal pattern: set partition: 2 on a 3-replica set, only db-2 updates; verify it; move partition down to 1; db-1 updates; and so on.

Combined with PodDisruptionBudget and a readiness probe that gates on replication lag, this gives safe, gradual upgrades for stateful workloads without surrendering control to the Deployment controller's two-hash logic.

Mini Drill or Application

Deploy the StatefulSet above (swap in whatever StorageClass your cluster has). From a debug Pod:

dig +short db-0.db.default.svc.cluster.local
dig +short db.default.svc.cluster.local

Observe the difference: the first returns one IP; the second returns all Ready Pod IPs. Delete db-1. Watch it return with the same name and the same PVC. Write a paragraph explaining why this is different from deleting a Pod in a Deployment.

When Not to Use a StatefulSet

"Stateful" does not automatically mean "StatefulSet." Two common anti-patterns:

Stateless caches on a StatefulSet. Redis in replication mode where the cache can be rehydrated does not need ordinal names; a Deployment plus a headless Service works and scales faster.
Shared-storage workloads. If every replica reads/writes the same ReadWriteMany volume, volumeClaimTemplates (per-ordinal PVCs) is exactly the wrong shape. Use a Deployment with a single shared PVC.

Conversely, some workloads really need a step beyond StatefulSet. Complex stateful systems (Postgres with HA, Kafka, TiDB) are typically run via an Operator that wraps a StatefulSet with leader election, backup schedules, and upgrade orchestration. StatefulSet gives you identity and storage; the operator gives you operations.

Read This Only If Stuck

Linux Command Line: Mounting and unmounting storage devices -- per-ordinal disks get mounted via exactly these kernel primitives.
Kubernetes: StatefulSets -- canonical reference for ordinals, rollouts, and PVC retention policy.
Kubernetes: Headless Services -- the DNS behavior StatefulSets depend on.
Kubernetes: DNS for Services and Pods (StatefulSet A records) -- how pod-0.svc.ns.svc.cluster.local is published.
Kubernetes: Run a Replicated Stateful Application -- MySQL tutorial that puts the pieces together.
Kubernetes: Operator Pattern -- why production databases wrap StatefulSet with a CRD + controller.
CloudNativePG (Postgres operator) -- a realistic example: StatefulSet plus failover, backup, WAL archive.
Strimzi (Kafka operator) -- Kafka on Kubernetes via a CRD that orchestrates per-broker StatefulSets.
Kubernetes blog: StatefulSet PVC retention -- background on persistentVolumeClaimRetentionPolicy.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Rollout and Partition​

Mini Drill or Application​

When Not to Use a StatefulSet​

Read This Only If Stuck​