Fix Liveness Probe Failed: context deadline exceeded Causing Constant Pod Restarts in Kubernetes

The Situation

It's 2 AM. PagerDuty fires. You check the cluster and see pods restarting every 60–90 seconds. kubectl get pods shows a restart count climbing fast. You pull the events and there it is:

Warning  Unhealthy  pod/api-server-7d9f4b8c6-xk2pq  Liveness probe failed: Get "http://10.0.0.1:8080/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Kubernetes thinks your app is dead. It kills the container, restarts it, waits for startup — then the probe fails again before the app is ready. You're in a loop, and it won't break itself.

Why This Happens

The liveness probe sends an HTTP GET to your container. No response within timeoutSeconds (default: 1 second)? Failed. After failureThreshold consecutive failures (default: 3), Kubernetes restarts the container.

Four things cause this. Most are fixable in under 10 minutes.

Probe timeout too tight — Your app sometimes takes longer than 1 second to respond to /healthz, especially under load or during JVM/GC pauses.
App is slow to start — The probe fires before the app is ready to accept connections. initialDelaySeconds wasn't set, or was set to something optimistic like 5 seconds.
Resource starvation — The container is CPU-throttled. The health endpoint can't respond in time because the process is waiting for CPU cycles.
Health endpoint doing too much — Your /healthz route checks DB connections, runs queries, or calls external services. Fine on a quiet server; fatal under any real load.

Diagnose First

Don't touch the probe config yet. Confirm what's actually failing.

Check events and restart count

# See restart count
kubectl get pods -n your-namespace

# See probe failure events
kubectl describe pod api-server-7d9f4b8c6-xk2pq -n your-namespace | grep -A 20 Events

Check current probe config

kubectl get deployment api-server -n your-namespace -o yaml | grep -A 20 livenessProbe

Test the health endpoint from inside the pod

# Exec into the pod before it restarts
kubectl exec -it api-server-7d9f4b8c6-xk2pq -n your-namespace -- sh

# Hit the endpoint and time it
time wget -qO- http://localhost:8080/healthz

Anything over 1 second is your problem. A complete failure means your app has a bug in the health handler itself.

Check resource pressure

kubectl top pod api-server-7d9f4b8c6-xk2pq -n your-namespace
kubectl top node

CPU near the limit? Throttling is likely slowing every request — including health checks.

Fix 1: Tune the Probe Timing (Most Common Fix)

Kubernetes ships with probe defaults designed for fast, simple apps. A 1-second timeout and 10-second interval are too aggressive for most production services. Adjust your deployment:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30    # Wait 30s after container starts before first probe
  periodSeconds: 15          # Check every 15s (not every 10s)
  timeoutSeconds: 5          # Allow 5s for response (not default 1s)
  failureThreshold: 3        # Still restart after 3 consecutive failures
  successThreshold: 1

Then apply:

kubectl apply -f deployment.yaml

JVM apps often need initialDelaySeconds: 60 or higher — Spring Boot with a full context load can take 45 seconds on a cold start. For apps under heavy load, timeoutSeconds: 5 is a safe starting point; go to 10 if you're still seeing failures.

Fix 2: Add a Startup Probe (Kubernetes 1.16+)

Slow startup is its own problem. Fighting it with a large initialDelaySeconds is a guess — you're hardcoding a number that will be wrong in CI and wrong on a degraded node. Use a startup probe instead. It holds the liveness probe until the app signals ready:

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30       # Try for up to 5 minutes (30 * 10s)
  periodSeconds: 10
  timeoutSeconds: 5

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 15
  timeoutSeconds: 5
  failureThreshold: 3

The startup probe runs first. The moment it succeeds once, it stops and liveness takes over. No guessing required.

Fix 3: Fix the Health Endpoint Itself

A slow /healthz is often the sneakiest culprit. Strip out any logic that doesn't belong there. The liveness probe has one job: confirm the process is alive. Checking database connectivity, running cache pings, or validating external APIs belongs in /readyz behind a readiness probe — not here.

A correct liveness endpoint looks like this:

# Go
http.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
    w.Write([]byte("ok"))
})

# Node.js
app.get('/healthz', (req, res) => res.status(200).send('ok'));

# Python/FastAPI
@app.get("/healthz")
def healthz():
    return {"status": "ok"}

It should return 200 in under 5 milliseconds. If yours doesn't, strip it down until it does.

Fix 4: Increase Resource Limits

CPU throttling is easy to miss because the app looks healthy in every other way. When a container hits its CPU limit, the kernel throttles it — and suddenly a 2ms health check takes 2 seconds. Check whether this is happening:

kubectl exec -it api-server-7d9f4b8c6-xk2pq -- cat /sys/fs/cgroup/cpu/cpu.stat | grep throttled

If throttled_time is non-zero and growing, your app is CPU-starved. Raise the limit:

resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "1000m"    # Raise this if throttled
    memory: "512Mi"

Start by doubling the CPU limit, redeploy, and recheck throttled_time.

Fix 5: Switch to a TCP or Exec Probe

Sometimes the simplest probe is the right one. TCP just checks if the port is open — no HTTP overhead, no handler code to worry about. Exec runs a command inside the container:

# TCP probe — confirms port is listening
livenessProbe:
  tcpSocket:
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20
  timeoutSeconds: 5

# Exec probe — runs a command inside the container
livenessProbe:
  exec:
    command:
    - cat
    - /tmp/healthy
  initialDelaySeconds: 5
  periodSeconds: 10

The exec pattern works well for apps that write /tmp/healthy on startup and delete it when they want to signal a problem. Coarse, but reliable.

Verify the Fix

# Watch pods stabilize
kubectl get pods -n your-namespace -w

# Confirm restart count stops climbing
kubectl get pods -n your-namespace
# RESTARTS column should stay flat

# Check events are clean
kubectl describe pod  -n your-namespace | tail -20
# Should show no Unhealthy warnings

Wait at least 3–5 probe cycles before calling it stable. With periodSeconds: 15, that's about 75 seconds of clean output before you close the laptop.

Prevention

Never ship default probe values to production. The defaults (1s timeout, 10s period) are designed for demos, not real workloads. Tune them per service.
Keep liveness and readiness separate. Liveness = is the process alive. Readiness = is it safe to receive traffic. Mixing them causes cascading restarts when a dependency goes down.
Load test your health endpoint before deploying. Run ab -n 1000 -c 50 http://localhost:8080/healthz and check p99 latency. If it's over 200ms, fix it before Kubernetes finds out.
Set terminationGracePeriodSeconds high enough for in-flight requests to complete. 30 seconds is a reasonable starting point for most APIs.
Review probe config in code review like any other production setting. Skipping it is how 2 AM pages happen.