The 2 AM Pager CallYour phone screams at 2 AM. A production node is dead. You can't SSH in, and the web console is a wall of frozen text. Look closely at the bottom of the log, and you'll likely spot this culprit:
kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s!
Essentially, a CPU core got trapped in a kernel-level loop for too long. By default, Linux loses patience after 20 seconds. If a task doesn't yield by then, the watchdog barks. On a single-core VPS, this is a total system freeze. On larger machines, it causes massive latency spikes before the whole thing eventually chokes.
TL;DR: The Emergency PatchIf you can still type commands, give the kernel some breathing room. Increasing the watchdog threshold won't fix the root bug, but it can stop the immediate crashing. This buys you time to investigate without the server constantly rebooting.
# See the current limit (usually 20 seconds)
cat /proc/sys/kernel/softlockup_thresh
# Bump it to 60 seconds immediately
sudo sysctl -w kernel.softlockup_thresh=60
To regain control of a severely stuttering system, you might need to silence the backtraces temporarily:
sudo sysctl -w kernel.softlockup_all_cpu_backtrace=0
sudo sysctl -w kernel.watchdog_thresh=0
What's actually trapping your CPU?A soft lockup happens when the kernel spends too long in a 'critical section' without letting the scheduler work. The usual suspects include:
- NFS/Network Latency: A remote mount hangs, and the kernel blocks while waiting for a response.
- Cloud Oversubscription: On AWS or Azure, 'Noisy Neighbors' can steal your physical CPU cycles, making your VM feel like it's frozen.
- Heavy Interrupt Loads: A malfunctioning NIC flooding the system with 50,000+ interrupts per second.
- Micro-Burst Traffic: A 10x surge in database queries that overwhelms the I/O stack instantly.
Step 1: Digging through the wreckageThe kernel usually dumps a stack trace to dmesg the moment the lockup is detected. You need to identify the specific process that was 'on deck' when the timer expired.
dmesg | grep -C 5 "soft lockup"
Find the line marked CPU: 0 PID: 1234 Comm: .... If the Comm value is java or python, your application code is likely triggering heavy syscalls. If it's kworker, you're almost certainly dealing with a disk I/O or driver bottleneck.
Step 2: Making the Fix PermanentIf your workload naturally has high-intensity bursts, you should permanently raise the limit. Edit /etc/sysctl.conf to keep these settings after a reboot:
# Increase thresholds to handle heavy load spikes
kernel.softlockup_thresh = 60
kernel.watchdog_thresh = 30
Apply the config change:
sudo sysctl -p
Step 3: Checking for Virtualization LagOn virtual machines, the 'Clocksource' matters. If the host and guest clocks drift, the kernel panics. Check your current source:
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
If you're on a VM and see tsc, try switching to kvm-clock if available. Also, run top and watch the %st (steal time) column. If steal time consistently hits 5% or higher, your cloud provider has overbooked the hardware. You'll need to migrate the VM to a different host or upgrade the instance type.
Step 4: Spotting Interrupt StormsSometimes buggy hardware floods the CPU with requests. Use cat /proc/interrupts to see if one specific IRQ is skyrocketing:
watch -n 1 "cat /proc/interrupts"
Watch the numbers. If you see a specific IRQ jump by tens of thousands in a few seconds while the system stutters, you've found a hardware or driver fault.
Verification: Stress Testing the FixAfter tuning, monitor the logs for 24 hours. You can simulate a heavy workload using the stress tool to see if the new thresholds hold up under pressure.
# Install the stress utility
sudo apt install stress
# Warning: Don't run this on live production!
# Stress 4 CPU cores for 60 seconds
stress --cpu 4 --timeout 60
If dmesg remains silent during this test, your new threshold is adequate. If the error returns, stop tuning sysctl and start auditing your I/O subsystem with iostat -xz 1 to find out why the disks are choking.

