The Situation
It's 2 AM. PagerDuty fires. You check the cluster and see pods restarting every 60β90 seconds. kubectl get pods shows a restart count climbing fast. You pull the events and there it is:
Warning Unhealthy pod/api-server-7d9f4b8c6-xk2pq Liveness probe failed: Get "http://10.0.0.1:8080/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Kubernetes thinks your app is dead. It kills the container, restarts it, waits for startup β then the probe fails again before the app is ready. You're in a loop, and it won't break itself.
Why This Happens
The liveness probe sends an HTTP GET to your container. No response within timeoutSeconds (default: 1 second)? Failed. After failureThreshold consecutive failures (default: 3), Kubernetes restarts the container.
Four things cause this. Most are fixable in under 10 minutes.
- Probe timeout too tight β Your app sometimes takes longer than 1 second to respond to
/healthz, especially under load or during JVM/GC pauses. - App is slow to start β The probe fires before the app is ready to accept connections.
initialDelaySecondswasn't set, or was set to something optimistic like 5 seconds. - Resource starvation β The container is CPU-throttled. The health endpoint can't respond in time because the process is waiting for CPU cycles.
- Health endpoint doing too much β Your
/healthzroute checks DB connections, runs queries, or calls external services. Fine on a quiet server; fatal under any real load.
Diagnose First
Don't touch the probe config yet. Confirm what's actually failing.
Check events and restart count
# See restart count
kubectl get pods -n your-namespace
# See probe failure events
kubectl describe pod api-server-7d9f4b8c6-xk2pq -n your-namespace | grep -A 20 Events
Check current probe config
kubectl get deployment api-server -n your-namespace -o yaml | grep -A 20 livenessProbe
Test the health endpoint from inside the pod
# Exec into the pod before it restarts
kubectl exec -it api-server-7d9f4b8c6-xk2pq -n your-namespace -- sh
# Hit the endpoint and time it
time wget -qO- http://localhost:8080/healthz
Anything over 1 second is your problem. A complete failure means your app has a bug in the health handler itself.
Check resource pressure
kubectl top pod api-server-7d9f4b8c6-xk2pq -n your-namespace
kubectl top node
CPU near the limit? Throttling is likely slowing every request β including health checks.
Fix 1: Tune the Probe Timing (Most Common Fix)
Kubernetes ships with probe defaults designed for fast, simple apps. A 1-second timeout and 10-second interval are too aggressive for most production services. Adjust your deployment:
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30 # Wait 30s after container starts before first probe
periodSeconds: 15 # Check every 15s (not every 10s)
timeoutSeconds: 5 # Allow 5s for response (not default 1s)
failureThreshold: 3 # Still restart after 3 consecutive failures
successThreshold: 1
Then apply:
kubectl apply -f deployment.yaml
JVM apps often need initialDelaySeconds: 60 or higher β Spring Boot with a full context load can take 45 seconds on a cold start. For apps under heavy load, timeoutSeconds: 5 is a safe starting point; go to 10 if you're still seeing failures.
Fix 2: Add a Startup Probe (Kubernetes 1.16+)
Slow startup is its own problem. Fighting it with a large initialDelaySeconds is a guess β you're hardcoding a number that will be wrong in CI and wrong on a degraded node. Use a startup probe instead. It holds the liveness probe until the app signals ready:
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30 # Try for up to 5 minutes (30 * 10s)
periodSeconds: 10
timeoutSeconds: 5
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3
The startup probe runs first. The moment it succeeds once, it stops and liveness takes over. No guessing required.
Fix 3: Fix the Health Endpoint Itself
A slow /healthz is often the sneakiest culprit. Strip out any logic that doesn't belong there. The liveness probe has one job: confirm the process is alive. Checking database connectivity, running cache pings, or validating external APIs belongs in /readyz behind a readiness probe β not here.
A correct liveness endpoint looks like this:
# Go
http.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
w.Write([]byte("ok"))
})
# Node.js
app.get('/healthz', (req, res) => res.status(200).send('ok'));
# Python/FastAPI
@app.get("/healthz")
def healthz():
return {"status": "ok"}
It should return 200 in under 5 milliseconds. If yours doesn't, strip it down until it does.
Fix 4: Increase Resource Limits
CPU throttling is easy to miss because the app looks healthy in every other way. When a container hits its CPU limit, the kernel throttles it β and suddenly a 2ms health check takes 2 seconds. Check whether this is happening:
kubectl exec -it api-server-7d9f4b8c6-xk2pq -- cat /sys/fs/cgroup/cpu/cpu.stat | grep throttled
If throttled_time is non-zero and growing, your app is CPU-starved. Raise the limit:
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "1000m" # Raise this if throttled
memory: "512Mi"
Start by doubling the CPU limit, redeploy, and recheck throttled_time.
Fix 5: Switch to a TCP or Exec Probe
Sometimes the simplest probe is the right one. TCP just checks if the port is open β no HTTP overhead, no handler code to worry about. Exec runs a command inside the container:
# TCP probe β confirms port is listening
livenessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
timeoutSeconds: 5
# Exec probe β runs a command inside the container
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 10
The exec pattern works well for apps that write /tmp/healthy on startup and delete it when they want to signal a problem. Coarse, but reliable.
Verify the Fix
# Watch pods stabilize
kubectl get pods -n your-namespace -w
# Confirm restart count stops climbing
kubectl get pods -n your-namespace
# RESTARTS column should stay flat
# Check events are clean
kubectl describe pod -n your-namespace | tail -20
# Should show no Unhealthy warnings
Wait at least 3β5 probe cycles before calling it stable. With periodSeconds: 15, that's about 75 seconds of clean output before you close the laptop.
Prevention
- Never ship default probe values to production. The defaults (1s timeout, 10s period) are designed for demos, not real workloads. Tune them per service.
- Keep liveness and readiness separate. Liveness = is the process alive. Readiness = is it safe to receive traffic. Mixing them causes cascading restarts when a dependency goes down.
- Load test your health endpoint before deploying. Run
ab -n 1000 -c 50 http://localhost:8080/healthzand check p99 latency. If it's over 200ms, fix it before Kubernetes finds out. - Set
terminationGracePeriodSecondshigh enough for in-flight requests to complete. 30 seconds is a reasonable starting point for most APIs. - Review probe config in code review like any other production setting. Skipping it is how 2 AM pages happen.

