Fix Kubernetes Node NotReady: Node node-1 status is now: NodeNotReady

What happened

You checked your cluster and noticed pods stuck in Pending state. Running kubectl get nodes shows one or more nodes with NotReady status. The control plane event log says:

Node node-1 status is now: NodeNotReady

The scheduler stops placing new pods on that node immediately. Pods already running there may get evicted depending on your tolerations — usually after 5 minutes by default. Time to find the actual cause.

Quick triage — what to check first

SSH into the affected node before anything else. The vast majority of root causes live on the node itself, not in the API server.

1. Check node status and conditions

kubectl get nodes
kubectl describe node node-1

Scroll to the Conditions section in the describe output. Four fields tell you almost everything:

Ready — should be True. False or Unknown means kubelet isn't communicating with the control plane.
MemoryPressure / DiskPressure / PIDPressure — any of these True means resource exhaustion on the node.
NetworkUnavailable — points to a CNI plugin problem.

2. Check kubelet status on the node

systemctl status kubelet
journalctl -u kubelet -n 100 --no-pager

Kubelet is the agent that reports node health to the control plane. When it crashes or stops, the node flips to NotReady within 40 seconds — that's the default node-monitor-grace-period. The journal almost always tells you exactly what went wrong within the first 10 lines.

Common causes and fixes

Cause 1: kubelet is stopped or crashed

Nine times out of ten, this is it. The kubelet process died quietly and nothing restarted it.

# Restart kubelet
systemctl restart kubelet
systemctl enable kubelet

# Watch the node recover
kubectl get nodes -w

Still crashing? Follow the logs in real time:

journalctl -u kubelet -f

Three sub-causes show up repeatedly: a misconfigured /var/lib/kubelet/config.yaml, an expired TLS certificate, or a cgroup driver mismatch. That last one trips people up — the node runs cgroupfs but the cluster was configured for systemd, or vice versa.

Check for the mismatch like this:

# Docker runtime
docker info | grep -i cgroup

# containerd runtime
containerd config dump | grep -i cgroup

# What kubelet expects
cat /var/lib/kubelet/config.yaml | grep cgroupDriver

If they differ, update /var/lib/kubelet/config.yaml to match the runtime, then restart kubelet.

Cause 2: Disk pressure

Disk fills up faster than most people expect on busy nodes. Once usage crosses the eviction threshold — 85% by default — kubelet sets DiskPressure=True and marks the node unschedulable.

# On the node
df -h
du -sh /var/lib/docker/*      # Docker runtime
du -sh /var/lib/containerd/   # containerd runtime

# Look for eviction messages
journalctl -u kubelet | grep -i evict

Container images and old logs are usually the culprits. Clean them up fast:

# Docker
docker system prune -af

# containerd
crictl rmi --prune

# Trim old journal logs
journalctl --vacuum-time=3d

Cause 3: Memory pressure or OOM kill

# Check for recent OOM kills
dmesg | grep -i oom
cat /proc/meminfo | grep -i available

The kernel OOM killer acts fast — it can terminate kubelet itself to reclaim memory. Restarting kubelet gets the node back, but without resource limits on your pods, the same thing happens again in an hour. Find what's consuming memory and add resources.limits.memory to those pods.

Cause 4: CNI plugin failure (NetworkUnavailable)

A NetworkUnavailable=True condition points straight to the CNI layer — Flannel, Calico, Cilium, Weave, or whatever you're running. The plugin crashed or lost its state on that specific node.

# Find the CNI pod running on the broken node
kubectl get pods -n kube-system -o wide | grep -E 'flannel|calico|cilium|weave'

# Read its logs
kubectl logs -n kube-system <cni-pod-on-node-1>

# Verify the CNI binary and config exist on the node
ls /opt/cni/bin/
ls /etc/cni/net.d/

Delete the CNI pod so Kubernetes reschedules it fresh. Re-initialization usually clears the problem:

kubectl delete pod -n kube-system <cni-pod-on-node-1>

Cause 5: Clock skew or certificate expiry

TLS authentication breaks when clocks drift. A skew of more than 2 minutes causes the API server to reject kubelet's certificates entirely.

# Compare time on node vs. what you expect
date
timedatectl status

# Re-sync immediately
chronyc makestep        # chrony users
ntpdate -u pool.ntp.org # ntp users

Certificate expiry is a different animal — it's calendar-driven. kubeadm-issued certificates expire after exactly one year. Check and renew from the control plane:

# See expiry dates for all certs
kubeadm certs check-expiration

# Renew everything at once
kubeadm certs renew all
systemctl restart kubelet

Cause 6: Node unreachable from control plane

The API server talks to kubelet on port 10250. A firewall rule change, a misconfigured security group (common after cloud infrastructure updates), or the node simply going offline will all produce the same NotReady symptom.

# From the control plane node, test directly
curl -k https://<node-ip>:10250/healthz

# Check iptables on the node
iptables -L -n | grep 10250

# firewalld alternative
firewall-cmd --list-all

Verify the node is back

# Node should show Ready
kubectl get nodes

# Expected:
# NAME     STATUS   ROLES    AGE   VERSION
# node-1   Ready    <none>   5d    v1.28.0

# Confirm conditions are clean
kubectl describe node node-1 | grep -A 10 Conditions

# Run a quick scheduling test
kubectl run test-pod --image=nginx --restart=Never
kubectl get pod test-pod -o wide

# Clean up
kubectl delete pod test-pod

Was the node cordoned during the incident? Don't forget this step:

kubectl uncordon node-1

Cordon and drain before maintenance

Doing kernel upgrades or disk cleanup? Always drain the node first — evicting pods properly beats having them killed mid-request.

# Stop new pods from landing here
kubectl cordon node-1

# Evict existing pods gracefully
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data

# Do your maintenance...

# Bring the node back into rotation
kubectl uncordon node-1

What to do differently next time

Alert on disk at 70%, not 85%. By the time kubelet triggers eviction at 85%, you're already in a bad spot. A 70% alert gives you a 15-point runway to clean up without a production incident. DiskPressure catches more people off guard than any other cause.
Start with kubelet logs, not the API server. The journal almost always names the exact failure within the first 10 lines. Go there first.
Clock sync is not optional. VMs drift after live migration in cloud environments — sometimes by minutes. Run chrony or ntpd and treat it as infrastructure, not a nice-to-have.
Deploy node-problem-detector. The node-problem-detector DaemonSet surfaces kernel panics and runtime failures as node conditions before they escalate to NotReady. It's a 2-minute install with helm install node-problem-detector deliveryhero/node-problem-detector.
Track certificate expiry on your ops calendar. kubeadm certificates expire after one year, on a fixed date. Set a renewal reminder at 11 months — treating it as a surprise is avoidable.