Skip to main content

The Control Plane: api-server, etcd, scheduler, controllers

What This Concept Is

A Kubernetes cluster has two kinds of machines:

  • Control plane nodes run the components that decide what should happen.
  • Worker nodes run the containers.

The control plane is a small set of cooperating processes:

ComponentRole
kube-apiserverThe only component that writes to etcd. Serves the REST API. All other components read and write through it.
etcdDistributed key-value store. The single source of truth for cluster state.
kube-schedulerWatches new Pods with no node assignment and picks a node based on resources, affinities, and taints.
kube-controller-managerRuns the built-in controllers: Deployment, ReplicaSet, Node, Endpoint, Job, etc. Each is a reconciliation loop.
cloud-controller-managerRuns controllers that integrate with the cloud provider (routes, load balancers, node metadata).

On every worker node (and usually control plane nodes too):

ComponentRole
kubeletTalks to the api-server, syncs desired pod state onto the node via CRI.
kube-proxyPrograms the kernel (iptables/ipvs/nftables) to implement Services on the node.
Container runtimecontainerd or CRI-O, speaks CRI to the kubelet.

Why It Matters Here

Every operational question in later clusters reduces to "who owns this, and how did the state get there?" If you cannot draw the control plane and label the arrows, you cannot answer:

  • "Why can I kubectl get a pod but the node has nothing running?"
  • "Why does deleting a Deployment not delete its Pods instantly?"
  • "Why did a rolling update continue even though the api-server was briefly unreachable?"

Concrete Example

A kubectl apply -f deployment.yaml moves through the control plane like this:

Explicitly:

  1. kubectl POSTs the Deployment to the api-server.
  2. The api-server validates, admits, and persists it to etcd.
  3. The Deployment controller sees a new Deployment and creates a matching ReplicaSet.
  4. The ReplicaSet controller sees a new ReplicaSet with 0 Pods and creates N Pods with no .spec.nodeName.
  5. The scheduler watches unscheduled Pods, ranks nodes, writes .spec.nodeName back to the api-server.
  6. The kubelet on the chosen node watches pods bound to itself, tells containerd to pull images and start containers, writes status back to the api-server.

No component talks to any other component directly. Everything goes through the api-server. This is why the api-server is the durability boundary of the cluster.

Common Confusion / Misconception

"The scheduler places the pod."

The scheduler picks a node. The kubelet on that node actually runs the pod. If the scheduler says "node X" and node X's kubelet cannot pull the image or fails admission locally, the pod stays in Pending or fails even though scheduling "succeeded."

A second confusion: "If etcd is down, running workloads go down." They do not. Already-running containers keep running because the runtime does not need the api-server tick-by-tick. What breaks is change: no new pods can be scheduled, no rollouts, no service-endpoint updates. The data plane survives short control plane outages; that is by design.

A third confusion: "The controller-manager is one controller." It is one binary hosting dozens of separate reconcile loops -- Node, Deployment, ReplicaSet, Endpoint, ServiceAccount, Token, Namespace, Job, CronJob, and more. Each loop is independent.

How To Use It

When a cluster misbehaves, localize the failure along the pipeline:

  1. Does kubectl get return the resource? If not, api-server or auth.
  2. Does kubectl describe show a scheduling event? If the pod is Pending with "0/N nodes available," scheduler is rejecting.
  3. Does the node have the pod? If yes, kubelet-level events and container logs.
  4. Did the desired state change in etcd but not take effect? A controller is missing or the controller is rate-limited.

Scheduler: A Closer Look

The scheduler is not a scheduler in the OS sense. It does not time-slice CPUs. It is a placement engine that runs once per unscheduled Pod. Its algorithm for each Pod:

  1. Filter nodes: remove nodes that don't satisfy nodeSelector, nodeAffinity, taints without matching tolerations, or resource requests.
  2. Score remaining nodes: plugins add points for spreading across zones, keeping workloads near their caches, respecting inter-pod affinity, etc.
  3. Bind the highest-scoring node by writing spec.nodeName through the api-server.

From that point the scheduler forgets the Pod. It will not rebalance a cluster because a Pod can now run better elsewhere. Rebalancing requires an external descheduler.

This has practical consequences:

  • A Pod with requests that no node can satisfy stays Pending forever; no background task tries again with different requests.
  • If you add a large node later, existing Pods do not migrate to it.
  • Pod scheduling is strictly forward-looking; understand it as "place, then forget."

Check Yourself

  1. Which control plane component is the only one that writes to etcd?
  2. Why can the data plane survive a short api-server outage?
  3. What does the scheduler actually produce as output, and where does it write it?

etcd: The Durability Boundary

etcd is the only stateful component. Every other control plane component is effectively stateless -- it reads from etcd (via the api-server) and acts. The practical implications:

  • Backups are about etcd. A cluster backup is an etcd snapshot plus the PKI material. Workload YAML can be regenerated from Git; etcd data cannot.
  • Quorum matters. Running etcd with 3 members tolerates 1 failure; 5 tolerates 2. Running with 2 members is strictly worse than 1 (you can lose quorum on either failure).
  • Large objects hurt. etcd is tuned for small values. Huge ConfigMaps or Secrets (>1MiB) pressure write latency for every controller in the cluster.
  • Compaction and defrag matter. Over time etcd accumulates historical revisions; periodic compaction and defrag keep it healthy. Most managed Kubernetes platforms handle this for you.

Mini Drill or Application

On a running cluster, run:

kubectl get componentstatuses
kubectl -n kube-system get pods
kubectl get events --sort-by=.lastTimestamp

For each control plane component you can see, write one sentence describing what breaks in the cluster if that component is down for 60 seconds.

The api-server Request Path

Understanding what happens between kubectl apply -f pod.yaml and a durable etcd write helps diagnose half of all cluster surprises:

  1. Transport: client sends HTTPS with a client cert or bearer token.
  2. Authentication plugins turn that into a user identity (system:serviceaccount:ns:sa, or a named user).
  3. Authorization plugins (RBAC, Node, ABAC, Webhook) decide whether that identity can perform this verb on this resource.
  4. Mutating admission webhooks and built-ins (e.g. default service account, PodSecurity mutation) alter the object.
  5. Validation checks schema and immutable fields.
  6. Validating admission webhooks (Pod Security Standards, OPA/Gatekeeper, Kyverno) get the final say.
  7. Persistence: the object is written to etcd (transactionally, with an increasing resourceVersion).
  8. Watch fan-out: every watcher (controllers, kubectl -w, informers) receives the change.

When an apply "does nothing," the step that ate it is almost always 4 or 6. kubectl apply --v=8 exposes the full HTTP trace and reveals which webhook or admission plugin stripped or rejected your field.

Learning by (Re)Building

The best way to internalize the control plane is to watch somebody assemble it from first principles. Kelsey Hightower's Kubernetes The Hard Way boots a cluster without kubeadm: you issue the PKI certs, start etcd, start the kube-apiserver binary with its flags, start kube-scheduler and kube-controller-manager, and join workers. Doing that once makes the architecture tangible in a way that no diagram replaces.

For a quicker version of the same insight, run a local kind cluster and kubectl -n kube-system get pods -- you will see kube-apiserver, kube-scheduler, kube-controller-manager, etcd, and coredns as static pods on the control plane node. The binaries are exactly the ones Hightower boots by hand; kubeadm just wraps the flags.

Read This Only If Stuck