Observability and Troubleshooting: The kubectl Workflow

What This Concept Is

There is no magic debugging command in Kubernetes. There is a reliable workflow made of four or five kubectl invocations, run in order, that resolves the vast majority of day-2 issues. The goal is to build it into muscle memory so you can triage a broken workload in under a minute without searching documentation.

The canonical loop:

What does the user see? kubectl get <resource> -A / kubectl get pods -A -o wide
Why is this resource in this state? kubectl describe <resource> <name>
What did the controllers observe? kubectl get events --sort-by=.lastTimestamp -A
What did the process say? kubectl logs <pod> -c <container> [--previous]
What does reality look like from inside? kubectl exec -it <pod> -- sh / kubectl debug

Each step narrows the search space. If step 2 shows a scheduling error, you don't need step 4.

Why It Matters Here

Engineers who have never learned this loop default to kubectl logs first, which is usually the wrong first move. Logs tell you what the process said while it was alive; they cannot tell you why the process is not running at all. Most production tickets are resolved at step 2 (describe) or step 3 (events).

Concrete Example

Symptom: kubectl get pods shows web-6f4-xyz 0/1 CrashLoopBackOff 5.

# Step 2
kubectl describe pod web-6f4-xyz
# Look at Events and Last State.
#  -> Last State: Terminated, Reason: OOMKilled, Exit Code: 137

# Step 4 (previous container because this one is restarting)
kubectl logs web-6f4-xyz --previous --tail=100
#  -> process prints "allocated 600MB"

# Step 1 on the parent to see limits
kubectl get deploy web -o yaml | grep -A3 resources
#  -> memory limit: 256Mi

Diagnosis in under a minute: the memory limit is smaller than the steady-state usage. Fix is a higher limits.memory or a leak investigation upstream.

Common mapping from symptom to next step:

`STATUS`	Most likely cause	Next `kubectl`
`Pending`	no node matches requests/affinities/taints	`describe pod`, check Events
`ContainerCreating`	image pull stuck, volume mount stuck, CNI failed	`describe pod`, Events
`ImagePullBackOff`	registry auth, wrong tag	`describe pod`, check `imagePullSecrets`
`CrashLoopBackOff`	process dies fast	`logs --previous`
`OOMKilled`	memory limit too low / leak	`describe pod` Last State, then limits
`Running, 0/1 Ready`	readiness probe failing	`describe pod`, probe config and endpoint
Service 503	no Ready endpoints	`get endpointslices -l kubernetes.io/service-name=<svc>`

Common Confusion / Misconception

"kubectl logs is the debugger."

It is one of five tools, and usually not the first. Before reading logs, check that the Pod is running the code you think it is. A pod in ContainerCreating has no logs; a pod that was OOMKilled has logs only with --previous; a pod with a failing init container has logs only for that init container (-c <init>).

A second confusion: "Events are stored forever." They are not. By default events are retained about an hour. A pod that failed yesterday will have no Events today. This is why incidents need to be correlated with cluster-wide logs (audit log, controller manager logs) if they are not caught quickly.

A third confusion: "If kubectl exec works, the pod is healthy." exec only proves the container has a running PID 1 and a shell. Readiness, application-level health, and Service endpoint inclusion are separate questions.

How To Use It

Print this on a card and keep it near the terminal:

For anything that involves the network: also check kubectl get endpointslices and kubectl get svc before reading logs. A working pod behind a Service with zero endpoints produces the same symptom as a broken pod, and logs will not tell you.

kubectl debug is the modern replacement for sshing into a node: it can launch an ephemeral debug container that shares the namespaces of a target Pod, which lets you run curl, tcpdump, or netstat against the target without modifying it.

Useful Output Formats and Aliases

A few flags dramatically speed up inspection:

kubectl get pods -o wide                        # add NODE + IPs
kubectl get pods -o yaml                        # full YAML
kubectl get pods -o jsonpath='{.items[*].spec.nodeName}'
kubectl get pods --field-selector=status.phase=Failed
kubectl get events --sort-by=.lastTimestamp
kubectl logs -f deploy/web -c app --tail=50
kubectl rollout status deployment/web
kubectl rollout history deployment/web
kubectl rollout undo deployment/web --to-revision=3
kubectl debug pod/web-0 -it --image=nicolaka/netshoot --target=app

A shell alias like alias k=kubectl and enabling completion (source <(kubectl completion bash)) pays for itself within one incident.

Metrics and Logs Beyond kubectl

kubectl is enough for the first 80% of troubleshooting. Beyond that you need:

metrics-server and kubectl top pod/node for real-time CPU/memory utilization
a metrics stack (Prometheus + Grafana, or cloud-native equivalent) for historical data and alerting
a logs stack (Loki, Elasticsearch, CloudWatch) for cross-pod log search
cluster-level audit logs for "who did what to this resource"

These belong to Module 5 (Cloud Security & Observability). For this module, learn the kubectl layer to mastery first; the tools above only help if you already know what you are looking for.

Check Yourself

When do you use kubectl logs --previous instead of kubectl logs?
Why is kubectl describe usually a better first step than kubectl logs?
What does a Service with zero EndpointSlices look like from the client, and how do you distinguish it from a dead Pod?

A Named Failure Catalog

Keep this in your head (or on a card). Each row is a symptom, one command that confirms it, and the fix direction:

Symptom	Confirm with	Fix direction
`ImagePullBackOff`	`describe pod` events	check tag exists, `imagePullSecrets`, network to registry
`ErrImageNeverPull`	`describe pod`	`imagePullPolicy: Never` with missing local image
`CrashLoopBackOff`	`logs --previous`	read app log, check config, probes, and limits
`OOMKilled`	`describe pod` (Exit Code 137)	raise `limits.memory` or fix leak
`CreateContainerConfigError`	`describe pod`	ConfigMap/Secret referenced but missing, or wrong key
`Pending: 0/N nodes available`	`describe pod`	insufficient cpu/memory, taints, affinity mismatches
`Running, 0/1 Ready`	`describe pod`	readiness probe failing
`Service 503 / refused`	`get endpointslices`	no Ready Pods behind the Service
`Pod stuck Terminating`	`describe pod`	stuck finalizer, ungraceful node, or long `preStop`

Mini Drill or Application

Cause each of these intentionally and walk through the full triage loop, recording the commands and their output per step:

Break the image tag to cause ImagePullBackOff.
Lower memory limit to 10Mi to cause OOMKilled.
Point the readiness probe at a wrong path to cause "no endpoints."
Apply a Pod with nodeSelector: nonexistent-label: "yes" to cause Pending.

For each, identify which kubectl command from the loop first surfaced the actual cause.

The Metrics and Logs Layer You Will Graduate Into

Once kubectl runs in your sleep, most real production clusters add a stack you should recognize by name:

Prometheus scrapes metrics from pods, nodes, kube-state-metrics, and the metrics-server. The de facto standard on Kubernetes; HPA's custom-metrics and KEDA's triggers both ride on top.
Grafana renders Prometheus (and other) data; most cluster dashboards you will inherit are Grafana JSON.
OpenTelemetry Collector is the direction logs/metrics/traces are all converging on -- one pipeline for all three signals.
Loki (logs), Tempo (traces), or a cloud equivalent (CloudWatch, Stackdriver, Azure Monitor) round out the picture.

This module's scope is kubectl; Module 5 (Cloud Security & Observability) goes into the metrics/logs/traces pipeline proper. For day-to-day Kubernetes debugging, the five-step loop above still solves most tickets even in clusters with the full stack installed -- the extra layers are for historical analysis, SLO tracking, and cross-pod correlation you cannot do in a one-shot kubectl session.

Read This Only If Stuck

Linux Command Line: Viewing processes dynamically with top -- the semantic basis for kubectl top and any metric dashboard.
Linux Command Line: Standard input, output, and error -- container logs are stdout/stderr captured by the runtime; the abstractions start here.
Kubernetes: Troubleshooting Applications -- the official triage guide, symptom by symptom.
Kubernetes: kubectl Reference -- every verb, flag, and output format.
Kubernetes: Debugging Running Pods with kubectl debug -- ephemeral containers and process-namespace sharing.
Kubernetes: Determine the Reason for Pod Failure -- mapping termination reasons to fixes.
Kubernetes: Resource Metrics Pipeline -- metrics-server, the metrics API, and what kubectl top queries.
Prometheus documentation: Monitoring Kubernetes -- the canonical metrics stack for Kubernetes clusters.
kube-prometheus stack -- the batteries-included "Prometheus + Grafana + alerts" deployment most teams start with.
OpenTelemetry: Kubernetes -- the converging standard for logs, metrics, and traces on Kubernetes.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Useful Output Formats and Aliases​

Metrics and Logs Beyond kubectl​

Check Yourself​

A Named Failure Catalog​

Mini Drill or Application​

The Metrics and Logs Layer You Will Graduate Into​

Read This Only If Stuck​