The Error
You open the EC2 console and see this:
Instance reachability check failed (1/2 status checks failed)
The instance shows a green "running" dot. But SSH times out. Your app stops responding. AWS is telling you the instance is alive but unreachable โ like a computer that's powered on but won't respond to keyboard input.
The system status check passes โ AWS hardware is fine. The instance-level check is what failed. That's on you to fix.
What Each Status Check Means
- System status check (2/2) โ AWS hardware, networking, power. AWS handles this if it fails.
- Instance status check (1/2 failed) โ the OS and software running inside your instance. Your problem.
The culprit is almost always inside the OS: a bad kernel, corrupted filesystem, misconfigured network stack, or a boot failure that left the instance half-started.
Root Causes
- Kernel panic or failed boot โ often triggered by a
yum/aptupgrade that swapped the kernel - Corrupted root filesystem โ bad
/etc/fstabentry, disk full at shutdown, or a forced stop mid-write - Misconfigured network interface โ DHCP disabled, wrong MTU, missing default route
- OOM killer fired โ memory exhausted and critical system processes were terminated
- SSH daemon crashed โ instance is otherwise alive but you can't get in
- Security group or NACL blocking health check traffic (rare, but worth checking)
Step 1 โ Read the System Log First
Start here, before touching anything else. The EC2 system log captures serial console output from the last boot, and it usually tells you exactly what went wrong.
Via console:
EC2 Console โ select instance โ Actions โ Monitor and troubleshoot โ Get system log
Via CLI:
aws ec2 get-console-output \
--instance-id i-0abc1234567890def \
--region us-east-1 \
--output text
Look for these patterns:
Kernel panic - not syncingโ kernel or initrd issueUNEXPECTED INCONSISTENCY; RUN fsck MANUALLYโ filesystem corruptionERROR: device not foundin fstab context โ bad mount entryOut of memory: Kill processโ OOM killer fired
Fix 1 โ Stop and Start the Instance (Not Reboot)
A stop + start moves the instance to different underlying hardware. A reboot stays on the same host. Try this first โ it resolves transient hardware or hypervisor issues without any OS-level work.
aws ec2 stop-instances --instance-ids i-0abc1234567890def
aws ec2 wait instance-stopped --instance-ids i-0abc1234567890def
aws ec2 start-instances --instance-ids i-0abc1234567890def
Wait 2โ3 minutes, then check status:
aws ec2 describe-instance-status \
--instance-ids i-0abc1234567890def \
--query 'InstanceStatuses[0].InstanceStatus.Status'
Returns ok? Done. Still failing? Move to the next fix.
Fix 2 โ Repair via EC2 Serial Console
SSH is down but the instance is still booting? The EC2 Serial Console connects you at the kernel level โ no network needed, no SSH required.
EC2 Console โ Instance โ Actions โ Monitor and troubleshoot โ EC2 Serial Console โ Connect
One catch: this only works on Nitro-based instances (t3, m5, c5, r5, and most current-gen types). Enable it for your account first:
aws ec2 enable-serial-console-access --region us-east-1
Once connected, check for the usual suspects:
# Check disk space
df -h
# Check for filesystem errors (unmount first if possible)
sudo fsck -y /dev/xvda1
# Check /etc/fstab for bad entries
cat /etc/fstab
# Comment out any suspicious mount to test boot
sudo nano /etc/fstab
Fix 3 โ Detach Root Volume and Repair from Another Instance
Instance completely dead and serial console isn't an option? Detach the root volume, mount it on a working instance, and fix it from outside. Sounds involved, but it's reliable โ and you won't lose data.
1. Stop the broken instance (not terminate).
2. Detach its root volume:
aws ec2 detach-volume --volume-id vol-0abc123def456
3. Attach it to a rescue instance as a secondary volume (e.g., /dev/xvdf):
aws ec2 attach-volume \
--volume-id vol-0abc123def456 \
--instance-id i-0rescue123456 \
--device /dev/xvdf
4. SSH into the rescue instance and mount the volume:
sudo mkdir /mnt/recovery
sudo mount /dev/xvdf1 /mnt/recovery
# Fix /etc/fstab
sudo nano /mnt/recovery/etc/fstab
# Run fsck on the unmounted partition
sudo fsck -y /dev/xvdf1
# Find what's eating disk space
df -h /mnt/recovery
sudo du -sh /mnt/recovery/var/log/* | sort -rh | head -20
5. Unmount, detach, reattach to the original instance as root (/dev/xvda), then start it.
Fix 4 โ Bad fstab Entry (Most Common After Adding Mounts)
Here's a classic trap. Add an EFS mount or extra EBS volume to /etc/fstab, skip the nofail option, and the next time that resource is unavailable the instance hangs at boot. Completely bricked โ over one missing flag.
Always use this format:
# EBS volume โ nofail prevents boot hang if volume detaches
UUID=xxxx-xxxx /data ext4 defaults,nofail 0 2
# EFS mount โ nofail + _netdev (wait for network before mounting)
fs-12345.efs.us-east-1.amazonaws.com:/ /mnt/efs efs defaults,_netdev,nofail 0 0
The _netdev flag tells systemd to wait until the network is up before attempting the mount. Both flags together make network mounts safe to put in fstab.
Fix 5 โ OOM / Memory Issues
Spot lines like Out of memory: Kill process 1234 (mysqld) in the log? The kernel ran out of RAM and started terminating processes โ probably critical ones. Two options:
- Resize to a larger instance type (stop โ change instance type โ start)
- Add swap to buy time, then address the root cause properly
# Add 2GB swap after recovery
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
Swap isn't a substitute for enough RAM โ under load it'll slow the instance significantly. Treat it as a buffer while you plan a proper resize.
Verification
After the fix, confirm both checks pass:
aws ec2 describe-instance-status \
--instance-ids i-0abc1234567890def \
--query 'InstanceStatuses[0].{System:SystemStatus.Status,Instance:InstanceStatus.Status}'
Expected output:
{
"System": "ok",
"Instance": "ok"
}
Then verify SSH access and confirm your application is actually running, not just the instance.
Prevention
- Make
nofaila team rule for every non-root volume and network mount in/etc/fstabโ no exceptions - Set a CloudWatch alarm on
StatusCheckFailed_Instanceso you know before your users do:
aws cloudwatch put-metric-alarm \
--alarm-name ec2-instance-check-i-0abc123 \
--namespace AWS/EC2 \
--metric-name StatusCheckFailed_Instance \
--dimensions Name=InstanceId,Value=i-0abc1234567890def \
--statistic Maximum \
--period 60 \
--evaluation-periods 2 \
--threshold 1 \
--comparison-operator GreaterThanOrEqualToThreshold \
--alarm-actions arn:aws:sns:us-east-1:123456789012:your-alert-topic
- Enable EC2 Auto Recovery for system-level failures โ it automatically migrates the instance to healthy hardware when AWS-side hardware fails:
aws cloudwatch put-metric-alarm \
--alarm-name ec2-auto-recover-i-0abc123 \
--namespace AWS/EC2 \
--metric-name StatusCheckFailed_System \
--dimensions Name=InstanceId,Value=i-0abc1234567890def \
--statistic Minimum \
--period 60 \
--evaluation-periods 2 \
--threshold 1 \
--comparison-operator GreaterThanOrEqualToThreshold \
--alarm-actions arn:aws:automate:us-east-1:ec2:recover
- Snapshot your root EBS volume on a schedule โ weekly at minimum. A snapshot from before a bad kernel upgrade cuts recovery from hours to minutes
- Test kernel upgrades in staging first. A disproportionate share of reachability failures happen within minutes of a package update that swapped the running kernel

