Kubernetes Pod Troubleshooting Guide | Debug CrashLoopBackOff, OOMKilled, Pending
이 글의 핵심
Pod crashes, OOM kills, and scheduling failures are the most common Kubernetes pain points. This guide gives you a systematic troubleshooting workflow and kubectl one-liners for every scenario.
Troubleshooting Workflow
When a pod misbehaves, follow this sequence:
# 1. Check pod status
kubectl get pods
# 2. Describe the pod (events, conditions, resource usage)
kubectl describe pod <pod-name>
# 3. Check logs
kubectl logs <pod-name>
kubectl logs <pod-name> --previous # last crash
# 4. Shell into container (if it's running)
kubectl exec -it <pod-name> -- /bin/sh
# 5. Check events for the namespace
kubectl get events --sort-by='.lastTimestamp'
1. CrashLoopBackOff
The container starts and exits repeatedly. Kubernetes backs off exponentially between restarts.
Diagnose:
kubectl describe pod <pod-name>
# Look at: Last State, Exit Code, Reason
kubectl logs <pod-name> --previous
# Application output from the last crashed container
Common causes and fixes:
| Exit Code | Meaning | Fix |
|---|---|---|
| 1 | Application error | Check --previous logs |
| 127 | Command not found | Fix command/args in spec |
| 137 | OOMKilled | Increase memory limit |
| 143 | SIGTERM timeout | Fix graceful shutdown |
# Check what command the container is running
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].command}'
# Check environment variables (missing config?)
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].env}'
Typical fix — missing environment variable:
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: app-secrets
key: database-url
2. OOMKilled (Exit Code 137)
Your container exceeded its memory limit.
Diagnose:
kubectl describe pod <pod-name>
# Look for: OOMKilled, Last State reason
# Check current memory usage
kubectl top pod <pod-name>
kubectl top pod <pod-name> --containers
Fix — increase memory limit:
resources:
requests:
memory: "256Mi"
limits:
memory: "512Mi" # increase this
Find memory-hungry pods cluster-wide:
kubectl top pods --all-namespaces --sort-by=memory | head -20
Tips:
- Set
limitsat 2× your typical peak usage - Use
requestsfor scheduling,limitsfor enforcement - Profile your app before setting limits — guessing leads to OOM or resource waste
3. ImagePullBackOff / ErrImagePull
Kubernetes can’t pull the container image.
Diagnose:
kubectl describe pod <pod-name>
# Look at Events: Failed to pull image "..."
Common causes:
# 1. Wrong image name or tag
# Fix: correct the image in your Deployment spec
image: myapp:v1.2.3 # verify this tag exists in your registry
# 2. Private registry — missing imagePullSecret
kubectl create secret docker-registry regcred \
--docker-server=registry.example.com \
--docker-username=user \
--docker-password=password
# Reference in pod spec:
imagePullSecrets:
- name: regcred
# 3. Rate limiting (Docker Hub)
# Fix: authenticate or use a mirror
4. Pending
Pod is stuck waiting to be scheduled.
Diagnose:
kubectl describe pod <pod-name>
# Look at Events: "0/3 nodes are available: ..."
Cause: Insufficient resources
kubectl describe nodes | grep -A5 "Allocated resources"
kubectl top nodes
# Fix: reduce requests or scale the cluster
resources:
requests:
cpu: "100m" # not "1000m" unless you need it
memory: "128Mi"
Cause: Node selector / affinity mismatch
kubectl get nodes --show-labels
# Verify your nodeSelector labels exist on nodes
# Fix nodeSelector mismatch
nodeSelector:
kubernetes.io/arch: amd64 # make sure nodes have this label
Cause: Taints with no toleration
kubectl describe node <node-name> | grep Taint
# Add toleration to your pod spec
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
Cause: PVC not bound
kubectl get pvc
# If STATUS is Pending, the PV doesn't exist or StorageClass is wrong
kubectl describe pvc <pvc-name>
5. Running But Not Ready
Pod is running but failing readiness probe — traffic isn’t routed to it.
kubectl describe pod <pod-name>
# Look for: Readiness probe failed
Fix — check your probe config:
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10 # wait before first check
periodSeconds: 5
failureThreshold: 3
timeoutSeconds: 2
Test the probe endpoint manually:
kubectl exec -it <pod-name> -- curl http://localhost:8080/health
6. Network Issues
Pod can’t reach another service:
# Check service exists and has endpoints
kubectl get service <service-name>
kubectl get endpoints <service-name>
# If ENDPOINTS is <none>, no pods match the selector
# Test DNS resolution from inside a pod
kubectl exec -it <pod-name> -- nslookup my-service.default.svc.cluster.local
# Test connectivity
kubectl exec -it <pod-name> -- curl http://my-service:8080/health
# Check NetworkPolicy
kubectl get networkpolicy -n <namespace>
kubectl describe networkpolicy <policy-name>
Service selector mismatch (most common cause of empty endpoints):
# Service selector
kubectl get service my-app -o jsonpath='{.spec.selector}'
# Output: {"app":"my-app"}
# Pod labels
kubectl get pods --show-labels
# Make sure pods have app=my-app label
7. Init Container Failures
kubectl describe pod <pod-name>
# Look at Init Containers section
kubectl logs <pod-name> -c <init-container-name>
Common: init container waiting for a database that isn’t ready.
initContainers:
- name: wait-for-db
image: busybox
command: ['sh', '-c', 'until nc -z postgres 5432; do echo waiting; sleep 2; done']
8. Useful kubectl One-Liners
# All pods not Running across all namespaces
kubectl get pods -A --field-selector=status.phase!=Running
# Watch pod restarts
kubectl get pods -w
# Pod resource usage sorted by CPU
kubectl top pods --sort-by=cpu
# Describe all pods matching a label
kubectl describe pods -l app=myapp
# Copy file from pod
kubectl cp <pod-name>:/var/log/app.log ./app.log
# Port forward to local machine
kubectl port-forward pod/<pod-name> 8080:8080
# Run a debug pod with full tools
kubectl run debug --image=nicolaka/netshoot -it --rm -- bash
# Force delete a stuck Terminating pod
kubectl delete pod <pod-name> --grace-period=0 --force
# View resource quotas and limits
kubectl describe resourcequota -n <namespace>
kubectl describe limitrange -n <namespace>
9. Pod Status Quick Reference
| Status | Meaning | First action |
|---|---|---|
Pending | Not scheduled | describe pod → check Events |
Init:0/1 | Init container running | logs -c <init-name> |
PodInitializing | Init done, main starting | Wait or check image pull |
Running | Running but may not be ready | Check readiness probe |
CrashLoopBackOff | App crashing | logs --previous |
OOMKilled | Memory limit exceeded | Increase memory limit |
Terminating | Being deleted | Wait; force delete if stuck |
ImagePullBackOff | Can’t pull image | Check image name + registry creds |
Error | Container exited with error | logs --previous + exit code |
Systematic Checklist
□ kubectl get pods → identify status
□ kubectl describe pod → read Events section
□ kubectl logs --previous → app-level error
□ kubectl top pod → memory/CPU usage
□ kubectl get events → cluster-level context
□ kubectl exec -- curl → test endpoints from inside
□ kubectl get endpoints → verify service routing
Most Kubernetes issues fall into five categories: application errors (logs), resource constraints (top/describe), scheduling conflicts (describe node), image problems (events), and network misconfigurations (endpoints/NetworkPolicy). Work through the checklist in order and you’ll resolve 95% of issues in under five minutes.