Fix EC2 "Instance reachability check failed" (1/2 Status Checks Failed)

The Error

You open the EC2 console and see this:

Instance reachability check failed (1/2 status checks failed)

The instance shows a green "running" dot. But SSH times out. Your app stops responding. AWS is telling you the instance is alive but unreachable — like a computer that's powered on but won't respond to keyboard input.

The system status check passes — AWS hardware is fine. The instance-level check is what failed. That's on you to fix.

What Each Status Check Means

System status check (2/2) — AWS hardware, networking, power. AWS handles this if it fails.
Instance status check (1/2 failed) — the OS and software running inside your instance. Your problem.

The culprit is almost always inside the OS: a bad kernel, corrupted filesystem, misconfigured network stack, or a boot failure that left the instance half-started.

Root Causes

Kernel panic or failed boot — often triggered by a yum/apt upgrade that swapped the kernel
Corrupted root filesystem — bad /etc/fstab entry, disk full at shutdown, or a forced stop mid-write
Misconfigured network interface — DHCP disabled, wrong MTU, missing default route
OOM killer fired — memory exhausted and critical system processes were terminated
SSH daemon crashed — instance is otherwise alive but you can't get in
Security group or NACL blocking health check traffic (rare, but worth checking)

Step 1 — Read the System Log First

Start here, before touching anything else. The EC2 system log captures serial console output from the last boot, and it usually tells you exactly what went wrong.

Via console:

EC2 Console → select instance → Actions → Monitor and troubleshoot → Get system log

Via CLI:

aws ec2 get-console-output \
  --instance-id i-0abc1234567890def \
  --region us-east-1 \
  --output text

Look for these patterns:

Kernel panic - not syncing → kernel or initrd issue
UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY → filesystem corruption
ERROR: device not found in fstab context → bad mount entry
Out of memory: Kill process → OOM killer fired

Fix 1 — Stop and Start the Instance (Not Reboot)

A stop + start moves the instance to different underlying hardware. A reboot stays on the same host. Try this first — it resolves transient hardware or hypervisor issues without any OS-level work.

aws ec2 stop-instances --instance-ids i-0abc1234567890def
aws ec2 wait instance-stopped --instance-ids i-0abc1234567890def
aws ec2 start-instances --instance-ids i-0abc1234567890def

Wait 2–3 minutes, then check status:

aws ec2 describe-instance-status \
  --instance-ids i-0abc1234567890def \
  --query 'InstanceStatuses[0].InstanceStatus.Status'

Returns ok? Done. Still failing? Move to the next fix.

Fix 2 — Repair via EC2 Serial Console

SSH is down but the instance is still booting? The EC2 Serial Console connects you at the kernel level — no network needed, no SSH required.

EC2 Console → Instance → Actions → Monitor and troubleshoot → EC2 Serial Console → Connect

One catch: this only works on Nitro-based instances (t3, m5, c5, r5, and most current-gen types). Enable it for your account first:

aws ec2 enable-serial-console-access --region us-east-1

Once connected, check for the usual suspects:

# Check disk space
df -h

# Check for filesystem errors (unmount first if possible)
sudo fsck -y /dev/xvda1

# Check /etc/fstab for bad entries
cat /etc/fstab

# Comment out any suspicious mount to test boot
sudo nano /etc/fstab

Fix 3 — Detach Root Volume and Repair from Another Instance

Instance completely dead and serial console isn't an option? Detach the root volume, mount it on a working instance, and fix it from outside. Sounds involved, but it's reliable — and you won't lose data.

1. Stop the broken instance (not terminate).

2. Detach its root volume:

aws ec2 detach-volume --volume-id vol-0abc123def456

3. Attach it to a rescue instance as a secondary volume (e.g., /dev/xvdf):

aws ec2 attach-volume \
  --volume-id vol-0abc123def456 \
  --instance-id i-0rescue123456 \
  --device /dev/xvdf

4. SSH into the rescue instance and mount the volume:

sudo mkdir /mnt/recovery
sudo mount /dev/xvdf1 /mnt/recovery

# Fix /etc/fstab
sudo nano /mnt/recovery/etc/fstab

# Run fsck on the unmounted partition
sudo fsck -y /dev/xvdf1

# Find what's eating disk space
df -h /mnt/recovery
sudo du -sh /mnt/recovery/var/log/* | sort -rh | head -20

5. Unmount, detach, reattach to the original instance as root (/dev/xvda), then start it.

Fix 4 — Bad fstab Entry (Most Common After Adding Mounts)

Here's a classic trap. Add an EFS mount or extra EBS volume to /etc/fstab, skip the nofail option, and the next time that resource is unavailable the instance hangs at boot. Completely bricked — over one missing flag.

Always use this format:

# EBS volume — nofail prevents boot hang if volume detaches
UUID=xxxx-xxxx  /data  ext4  defaults,nofail  0  2

# EFS mount — nofail + _netdev (wait for network before mounting)
fs-12345.efs.us-east-1.amazonaws.com:/ /mnt/efs efs defaults,_netdev,nofail 0 0

The _netdev flag tells systemd to wait until the network is up before attempting the mount. Both flags together make network mounts safe to put in fstab.

Fix 5 — OOM / Memory Issues

Spot lines like Out of memory: Kill process 1234 (mysqld) in the log? The kernel ran out of RAM and started terminating processes — probably critical ones. Two options:

Resize to a larger instance type (stop → change instance type → start)
Add swap to buy time, then address the root cause properly

# Add 2GB swap after recovery
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

Swap isn't a substitute for enough RAM — under load it'll slow the instance significantly. Treat it as a buffer while you plan a proper resize.

Verification

After the fix, confirm both checks pass:

aws ec2 describe-instance-status \
  --instance-ids i-0abc1234567890def \
  --query 'InstanceStatuses[0].{System:SystemStatus.Status,Instance:InstanceStatus.Status}'

Expected output:

{
  "System": "ok",
  "Instance": "ok"
}

Then verify SSH access and confirm your application is actually running, not just the instance.

Prevention

Make nofail a team rule for every non-root volume and network mount in /etc/fstab — no exceptions
Set a CloudWatch alarm on StatusCheckFailed_Instance so you know before your users do:

aws cloudwatch put-metric-alarm \
  --alarm-name ec2-instance-check-i-0abc123 \
  --namespace AWS/EC2 \
  --metric-name StatusCheckFailed_Instance \
  --dimensions Name=InstanceId,Value=i-0abc1234567890def \
  --statistic Maximum \
  --period 60 \
  --evaluation-periods 2 \
  --threshold 1 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:your-alert-topic

Enable EC2 Auto Recovery for system-level failures — it automatically migrates the instance to healthy hardware when AWS-side hardware fails:

aws cloudwatch put-metric-alarm \
  --alarm-name ec2-auto-recover-i-0abc123 \
  --namespace AWS/EC2 \
  --metric-name StatusCheckFailed_System \
  --dimensions Name=InstanceId,Value=i-0abc1234567890def \
  --statistic Minimum \
  --period 60 \
  --evaluation-periods 2 \
  --threshold 1 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --alarm-actions arn:aws:automate:us-east-1:ec2:recover

Snapshot your root EBS volume on a schedule — weekly at minimum. A snapshot from before a bad kernel upgrade cuts recovery from hours to minutes
Test kernel upgrades in staging first. A disproportionate share of reachability failures happen within minutes of a package update that swapped the running kernel