Module 3: Container Orchestration: Case Studies
These case studies make Kubernetes concrete: reconciliation, rollouts, probes, resources, scheduling, networking, and state.
Case Study 1: Deployment Rollout Without Readiness
Scenario: A new version starts slowly. Kubernetes sends traffic before the app has warmed caches and opened DB connections. Error rate spikes during every rollout.
Source anchor: Kubernetes Deployments, which covers rolling updates, rollout status, and rollback behavior.
Module concepts: Deployment, ReplicaSet, rolling update, readiness probe, rollback.
Wrong Approach
"The container is running, so it is ready."
Better Approach
Separate liveness from readiness:
readinessProbe:
httpGet:
path: /ready
port: 8080
livenessProbe:
httpGet:
path: /live
port: 8080
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| no readiness | simple | traffic before ready |
| readiness probe | safer rollout | must implement truthful endpoint |
| slow maxUnavailable | safer capacity | slower release |
| rollback | fast recovery | needs stable revision and migration safety |
Required Artifact
Write a rollout spec with readiness/liveness, maxSurge/maxUnavailable, rollback trigger, and migration note.
Case Study 2: OOMKilled From Missing Memory Limits
Scenario: A batch pod consumes all node memory. Other workloads are evicted. The app team says "Kubernetes killed us randomly."
Source anchor: Kubernetes Resource Management for Pods and Containers, which describes requests, limits, and OOMKilled behavior.
Module concepts: requests, limits, QoS, OOMKilled, scheduling.
Wrong Approach
Deploy pods without requests/limits and hope the scheduler knows intent.
Better Approach
Set resource contracts:
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
memory: "1Gi"
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| no requests | easy scheduling | noisy neighbor risk |
| requests | scheduler capacity signal | needs measurement |
| memory limit | bounds damage | OOM if too low |
| CPU limit | caps usage | throttling risk |
Required Artifact
Create a resource sizing note from observed p50/p95/p99 CPU/memory and expected burst.
Case Study 3: Service Hides Pod IP Churn
Scenario: Clients call pod IPs directly. After a rollout, pod IPs change and clients fail.
Source anchor: Kubernetes Services explain stable network abstraction over changing Pods.
Module concepts: Pod IP, Service, selector, ClusterIP, DNS.
Wrong Approach
Treat pod IPs as durable endpoints.
Better Approach
Use Services:
Client -> service DNS -> selected healthy pods
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| pod IP direct | simple debug | breaks on restart |
| ClusterIP service | stable internal endpoint | selector correctness |
| headless service | direct pod discovery | client handles endpoints |
| LoadBalancer | external entry | cloud cost/exposure |
Required Artifact
Draw service discovery for one workload including pod labels, selector, DNS name, and failure behavior.
Case Study 4: StatefulSet For Identity, Not Just Replicas
Scenario: A database is deployed as a Deployment with three replicas. Pod names and storage identities change, confusing replication membership.
Source anchor: Kubernetes StatefulSets describe stable network identities and stable persistent storage for stateful applications.
Module concepts: StatefulSet, stable identity, PVC, headless service.
Wrong Approach
Run every replicated app as a Deployment.
Better Approach
Use StatefulSet when identity matters:
db-0, db-1, db-2
stable PVC per ordinal
headless service for peer discovery
ordered rollout when needed
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| Deployment | simple stateless scaling | no stable identity |
| StatefulSet | stable identity/storage | more operational care |
| managed database | less ops | provider coupling/cost |
| operator | domain automation | operator complexity |
Required Artifact
Write a workload decision: Deployment vs StatefulSet vs managed service.
Case Study 5: RBAC Overgrant In The Cluster
Scenario: A CI service account has cluster-admin because early deploys failed. A compromised pipeline can now read secrets and mutate every namespace.
Source anchor: Kubernetes RBAC authorization documents roles, cluster roles, role bindings, and least privilege.
Module concepts: service account, RBAC, namespace, least privilege, secret exposure.
Wrong Approach
Grant cluster-admin to make deploys pass.
Better Approach
Scope permissions:
namespace:
production-app-a
verbs:
get, list, watch, create, patch deployments/services/configmaps
denied:
secrets read unless required
cluster-wide mutation
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| cluster-admin | easy | huge blast radius |
| namespace role | scoped | more policy work |
| separate deploy accounts | isolation | more identities |
| read secrets in CI | flexible | leakage risk |
Required Artifact
Write an RBAC review: subject, namespace, verbs, resources, forbidden actions, and audit test.
Source Map
| Source | Use it for |
|---|---|
| Kubernetes Deployments | rolling updates and rollback |
| Kubernetes resource management | requests, limits, OOMKilled |
| Kubernetes Services | stable networking for pods |
| Kubernetes StatefulSets | stable identity and storage |
| Kubernetes RBAC | cluster authorization |
Completion Standard
- At least three artifacts are completed.
- At least one artifact includes rollout safety.
- At least one artifact includes resources and QoS reasoning.
- At least one artifact includes RBAC least privilege.