Skip to main content

Observability and Troubleshooting: The kubectl Workflow

What This Concept Is

There is no magic debugging command in Kubernetes. There is a reliable workflow made of four or five kubectl invocations, run in order, that resolves the vast majority of day-2 issues. The goal is to build it into muscle memory so you can triage a broken workload in under a minute without searching documentation.

The canonical loop:

  1. What does the user see? kubectl get <resource> -A / kubectl get pods -A -o wide
  2. Why is this resource in this state? kubectl describe <resource> <name>
  3. What did the controllers observe? kubectl get events --sort-by=.lastTimestamp -A
  4. What did the process say? kubectl logs <pod> -c <container> [--previous]
  5. What does reality look like from inside? kubectl exec -it <pod> -- sh / kubectl debug

Each step narrows the search space. If step 2 shows a scheduling error, you don't need step 4.

Why It Matters Here

Engineers who have never learned this loop default to kubectl logs first, which is usually the wrong first move. Logs tell you what the process said while it was alive; they cannot tell you why the process is not running at all. Most production tickets are resolved at step 2 (describe) or step 3 (events).

Concrete Example

Symptom: kubectl get pods shows web-6f4-xyz 0/1 CrashLoopBackOff 5.

# Step 2
kubectl describe pod web-6f4-xyz
# Look at Events and Last State.
# -> Last State: Terminated, Reason: OOMKilled, Exit Code: 137

# Step 4 (previous container because this one is restarting)
kubectl logs web-6f4-xyz --previous --tail=100
# -> process prints "allocated 600MB"

# Step 1 on the parent to see limits
kubectl get deploy web -o yaml | grep -A3 resources
# -> memory limit: 256Mi

Diagnosis in under a minute: the memory limit is smaller than the steady-state usage. Fix is a higher limits.memory or a leak investigation upstream.

Common mapping from symptom to next step:

STATUSMost likely causeNext kubectl
Pendingno node matches requests/affinities/taintsdescribe pod, check Events
ContainerCreatingimage pull stuck, volume mount stuck, CNI faileddescribe pod, Events
ImagePullBackOffregistry auth, wrong tagdescribe pod, check imagePullSecrets
CrashLoopBackOffprocess dies fastlogs --previous
OOMKilledmemory limit too low / leakdescribe pod Last State, then limits
Running, 0/1 Readyreadiness probe failingdescribe pod, probe config and endpoint
Service 503no Ready endpointsget endpointslices -l kubernetes.io/service-name=<svc>

Common Confusion / Misconception

"kubectl logs is the debugger."

It is one of five tools, and usually not the first. Before reading logs, check that the Pod is running the code you think it is. A pod in ContainerCreating has no logs; a pod that was OOMKilled has logs only with --previous; a pod with a failing init container has logs only for that init container (-c <init>).

A second confusion: "Events are stored forever." They are not. By default events are retained about an hour. A pod that failed yesterday will have no Events today. This is why incidents need to be correlated with cluster-wide logs (audit log, controller manager logs) if they are not caught quickly.

A third confusion: "If kubectl exec works, the pod is healthy." exec only proves the container has a running PID 1 and a shell. Readiness, application-level health, and Service endpoint inclusion are separate questions.

How To Use It

Print this on a card and keep it near the terminal:

For anything that involves the network: also check kubectl get endpointslices and kubectl get svc before reading logs. A working pod behind a Service with zero endpoints produces the same symptom as a broken pod, and logs will not tell you.

kubectl debug is the modern replacement for sshing into a node: it can launch an ephemeral debug container that shares the namespaces of a target Pod, which lets you run curl, tcpdump, or netstat against the target without modifying it.

Useful Output Formats and Aliases

A few flags dramatically speed up inspection:

kubectl get pods -o wide                        # add NODE + IPs
kubectl get pods -o yaml # full YAML
kubectl get pods -o jsonpath='{.items[*].spec.nodeName}'
kubectl get pods --field-selector=status.phase=Failed
kubectl get events --sort-by=.lastTimestamp
kubectl logs -f deploy/web -c app --tail=50
kubectl rollout status deployment/web
kubectl rollout history deployment/web
kubectl rollout undo deployment/web --to-revision=3
kubectl debug pod/web-0 -it --image=nicolaka/netshoot --target=app

A shell alias like alias k=kubectl and enabling completion (source <(kubectl completion bash)) pays for itself within one incident.

Metrics and Logs Beyond kubectl

kubectl is enough for the first 80% of troubleshooting. Beyond that you need:

  • metrics-server and kubectl top pod/node for real-time CPU/memory utilization
  • a metrics stack (Prometheus + Grafana, or cloud-native equivalent) for historical data and alerting
  • a logs stack (Loki, Elasticsearch, CloudWatch) for cross-pod log search
  • cluster-level audit logs for "who did what to this resource"

These belong to Module 5 (Cloud Security & Observability). For this module, learn the kubectl layer to mastery first; the tools above only help if you already know what you are looking for.

Check Yourself

  1. When do you use kubectl logs --previous instead of kubectl logs?
  2. Why is kubectl describe usually a better first step than kubectl logs?
  3. What does a Service with zero EndpointSlices look like from the client, and how do you distinguish it from a dead Pod?

A Named Failure Catalog

Keep this in your head (or on a card). Each row is a symptom, one command that confirms it, and the fix direction:

SymptomConfirm withFix direction
ImagePullBackOffdescribe pod eventscheck tag exists, imagePullSecrets, network to registry
ErrImageNeverPulldescribe podimagePullPolicy: Never with missing local image
CrashLoopBackOfflogs --previousread app log, check config, probes, and limits
OOMKilleddescribe pod (Exit Code 137)raise limits.memory or fix leak
CreateContainerConfigErrordescribe podConfigMap/Secret referenced but missing, or wrong key
Pending: 0/N nodes availabledescribe podinsufficient cpu/memory, taints, affinity mismatches
Running, 0/1 Readydescribe podreadiness probe failing
Service 503 / refusedget endpointslicesno Ready Pods behind the Service
Pod stuck Terminatingdescribe podstuck finalizer, ungraceful node, or long preStop

Mini Drill or Application

Cause each of these intentionally and walk through the full triage loop, recording the commands and their output per step:

  1. Break the image tag to cause ImagePullBackOff.
  2. Lower memory limit to 10Mi to cause OOMKilled.
  3. Point the readiness probe at a wrong path to cause "no endpoints."
  4. Apply a Pod with nodeSelector: nonexistent-label: "yes" to cause Pending.

For each, identify which kubectl command from the loop first surfaced the actual cause.

The Metrics and Logs Layer You Will Graduate Into

Once kubectl runs in your sleep, most real production clusters add a stack you should recognize by name:

  • Prometheus scrapes metrics from pods, nodes, kube-state-metrics, and the metrics-server. The de facto standard on Kubernetes; HPA's custom-metrics and KEDA's triggers both ride on top.
  • Grafana renders Prometheus (and other) data; most cluster dashboards you will inherit are Grafana JSON.
  • OpenTelemetry Collector is the direction logs/metrics/traces are all converging on -- one pipeline for all three signals.
  • Loki (logs), Tempo (traces), or a cloud equivalent (CloudWatch, Stackdriver, Azure Monitor) round out the picture.

This module's scope is kubectl; Module 5 (Cloud Security & Observability) goes into the metrics/logs/traces pipeline proper. For day-to-day Kubernetes debugging, the five-step loop above still solves most tickets even in clusters with the full stack installed -- the extra layers are for historical analysis, SLO tracking, and cross-pod correlation you cannot do in a one-shot kubectl session.

Read This Only If Stuck