Observability and Troubleshooting: The kubectl Workflow
What This Concept Is
There is no magic debugging command in Kubernetes. There is a reliable workflow made of four or five kubectl invocations, run in order, that resolves the vast majority of day-2 issues. The goal is to build it into muscle memory so you can triage a broken workload in under a minute without searching documentation.
The canonical loop:
- What does the user see?
kubectl get <resource> -A/kubectl get pods -A -o wide - Why is this resource in this state?
kubectl describe <resource> <name> - What did the controllers observe?
kubectl get events --sort-by=.lastTimestamp -A - What did the process say?
kubectl logs <pod> -c <container> [--previous] - What does reality look like from inside?
kubectl exec -it <pod> -- sh/kubectl debug
Each step narrows the search space. If step 2 shows a scheduling error, you don't need step 4.
Why It Matters Here
Engineers who have never learned this loop default to kubectl logs first, which is usually the wrong first move. Logs tell you what the process said while it was alive; they cannot tell you why the process is not running at all. Most production tickets are resolved at step 2 (describe) or step 3 (events).
Concrete Example
Symptom: kubectl get pods shows web-6f4-xyz 0/1 CrashLoopBackOff 5.
# Step 2
kubectl describe pod web-6f4-xyz
# Look at Events and Last State.
# -> Last State: Terminated, Reason: OOMKilled, Exit Code: 137
# Step 4 (previous container because this one is restarting)
kubectl logs web-6f4-xyz --previous --tail=100
# -> process prints "allocated 600MB"
# Step 1 on the parent to see limits
kubectl get deploy web -o yaml | grep -A3 resources
# -> memory limit: 256Mi
Diagnosis in under a minute: the memory limit is smaller than the steady-state usage. Fix is a higher limits.memory or a leak investigation upstream.
Common mapping from symptom to next step:
STATUS | Most likely cause | Next kubectl |
|---|---|---|
Pending | no node matches requests/affinities/taints | describe pod, check Events |
ContainerCreating | image pull stuck, volume mount stuck, CNI failed | describe pod, Events |
ImagePullBackOff | registry auth, wrong tag | describe pod, check imagePullSecrets |
CrashLoopBackOff | process dies fast | logs --previous |
OOMKilled | memory limit too low / leak | describe pod Last State, then limits |
Running, 0/1 Ready | readiness probe failing | describe pod, probe config and endpoint |
| Service 503 | no Ready endpoints | get endpointslices -l kubernetes.io/service-name=<svc> |
Common Confusion / Misconception
"kubectl logs is the debugger."
It is one of five tools, and usually not the first. Before reading logs, check that the Pod is running the code you think it is. A pod in ContainerCreating has no logs; a pod that was OOMKilled has logs only with --previous; a pod with a failing init container has logs only for that init container (-c <init>).
A second confusion: "Events are stored forever." They are not. By default events are retained about an hour. A pod that failed yesterday will have no Events today. This is why incidents need to be correlated with cluster-wide logs (audit log, controller manager logs) if they are not caught quickly.
A third confusion: "If kubectl exec works, the pod is healthy." exec only proves the container has a running PID 1 and a shell. Readiness, application-level health, and Service endpoint inclusion are separate questions.
How To Use It
Print this on a card and keep it near the terminal:
For anything that involves the network: also check kubectl get endpointslices and kubectl get svc before reading logs. A working pod behind a Service with zero endpoints produces the same symptom as a broken pod, and logs will not tell you.
kubectl debug is the modern replacement for sshing into a node: it can launch an ephemeral debug container that shares the namespaces of a target Pod, which lets you run curl, tcpdump, or netstat against the target without modifying it.
Useful Output Formats and Aliases
A few flags dramatically speed up inspection:
kubectl get pods -o wide # add NODE + IPs
kubectl get pods -o yaml # full YAML
kubectl get pods -o jsonpath='{.items[*].spec.nodeName}'
kubectl get pods --field-selector=status.phase=Failed
kubectl get events --sort-by=.lastTimestamp
kubectl logs -f deploy/web -c app --tail=50
kubectl rollout status deployment/web
kubectl rollout history deployment/web
kubectl rollout undo deployment/web --to-revision=3
kubectl debug pod/web-0 -it --image=nicolaka/netshoot --target=app
A shell alias like alias k=kubectl and enabling completion (source <(kubectl completion bash)) pays for itself within one incident.
Metrics and Logs Beyond kubectl
kubectl is enough for the first 80% of troubleshooting. Beyond that you need:
- metrics-server and
kubectl top pod/nodefor real-time CPU/memory utilization - a metrics stack (Prometheus + Grafana, or cloud-native equivalent) for historical data and alerting
- a logs stack (Loki, Elasticsearch, CloudWatch) for cross-pod log search
- cluster-level audit logs for "who did what to this resource"
These belong to Module 5 (Cloud Security & Observability). For this module, learn the kubectl layer to mastery first; the tools above only help if you already know what you are looking for.
Check Yourself
- When do you use
kubectl logs --previousinstead ofkubectl logs? - Why is
kubectl describeusually a better first step thankubectl logs? - What does a Service with zero EndpointSlices look like from the client, and how do you distinguish it from a dead Pod?
A Named Failure Catalog
Keep this in your head (or on a card). Each row is a symptom, one command that confirms it, and the fix direction:
| Symptom | Confirm with | Fix direction |
|---|---|---|
ImagePullBackOff | describe pod events | check tag exists, imagePullSecrets, network to registry |
ErrImageNeverPull | describe pod | imagePullPolicy: Never with missing local image |
CrashLoopBackOff | logs --previous | read app log, check config, probes, and limits |
OOMKilled | describe pod (Exit Code 137) | raise limits.memory or fix leak |
CreateContainerConfigError | describe pod | ConfigMap/Secret referenced but missing, or wrong key |
Pending: 0/N nodes available | describe pod | insufficient cpu/memory, taints, affinity mismatches |
Running, 0/1 Ready | describe pod | readiness probe failing |
Service 503 / refused | get endpointslices | no Ready Pods behind the Service |
Pod stuck Terminating | describe pod | stuck finalizer, ungraceful node, or long preStop |
Mini Drill or Application
Cause each of these intentionally and walk through the full triage loop, recording the commands and their output per step:
- Break the image tag to cause
ImagePullBackOff. - Lower memory limit to 10Mi to cause
OOMKilled. - Point the readiness probe at a wrong path to cause "no endpoints."
- Apply a Pod with
nodeSelector: nonexistent-label: "yes"to causePending.
For each, identify which kubectl command from the loop first surfaced the actual cause.
The Metrics and Logs Layer You Will Graduate Into
Once kubectl runs in your sleep, most real production clusters add a stack you should recognize by name:
- Prometheus scrapes metrics from pods, nodes, kube-state-metrics, and the metrics-server. The de facto standard on Kubernetes; HPA's custom-metrics and KEDA's triggers both ride on top.
- Grafana renders Prometheus (and other) data; most cluster dashboards you will inherit are Grafana JSON.
- OpenTelemetry Collector is the direction logs/metrics/traces are all converging on -- one pipeline for all three signals.
- Loki (logs), Tempo (traces), or a cloud equivalent (CloudWatch, Stackdriver, Azure Monitor) round out the picture.
This module's scope is kubectl; Module 5 (Cloud Security & Observability) goes into the metrics/logs/traces pipeline proper. For day-to-day Kubernetes debugging, the five-step loop above still solves most tickets even in clusters with the full stack installed -- the extra layers are for historical analysis, SLO tracking, and cross-pod correlation you cannot do in a one-shot kubectl session.
Read This Only If Stuck
- Linux Command Line: Viewing processes dynamically with top -- the semantic basis for
kubectl topand any metric dashboard. - Linux Command Line: Standard input, output, and error -- container logs are stdout/stderr captured by the runtime; the abstractions start here.
- Kubernetes: Troubleshooting Applications -- the official triage guide, symptom by symptom.
- Kubernetes: kubectl Reference -- every verb, flag, and output format.
- Kubernetes: Debugging Running Pods with
kubectl debug-- ephemeral containers and process-namespace sharing. - Kubernetes: Determine the Reason for Pod Failure -- mapping termination reasons to fixes.
- Kubernetes: Resource Metrics Pipeline -- metrics-server, the metrics API, and what
kubectl topqueries. - Prometheus documentation: Monitoring Kubernetes -- the canonical metrics stack for Kubernetes clusters.
- kube-prometheus stack -- the batteries-included "Prometheus + Grafana + alerts" deployment most teams start with.
- OpenTelemetry: Kubernetes -- the converging standard for logs, metrics, and traces on Kubernetes.