Fix AWS ALB 503 Service Unavailable: No Healthy Targets in Target Group

The Error

Your ALB is up. Your EC2 instances are running. But every single request comes back with:

503 Service Temporarily Unavailable - No healthy upstream

The ALB has no healthy targets in the Target Group. Every target is either failing health checks, draining, or missing entirely. Traffic has nowhere to go.

Why This Happens

The ALB probes each target on a fixed interval using your health check settings. Fail N consecutive checks, and the target gets marked unhealthy — no more traffic. The usual suspects:

App server crashed or never started
Health check path returns non-2xx (app returning 404 or 500 on /health)
Security group blocks the ALB from reaching the instance on the health check port
Wrong port configured in Target Group (app listens on 8080, TG says 80)
Instance just launched and hasn't passed the healthy threshold yet
Target Group is empty — no instances registered

Step 1: Check Target Health Status

Start here. The AWS CLI will tell you exactly what the ALB thinks of each target:

# Replace with your Target Group ARN
aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:ap-northeast-1:123456789:targetgroup/my-tg/abc123

Sample output:

{
  "TargetHealthDescriptions": [
    {
      "Target": { "Id": "i-0abc123def456", "Port": 80 },
      "TargetHealth": {
        "State": "unhealthy",
        "Reason": "Target.FailedHealthChecks",
        "Description": "Health checks failed with these codes: [502]"
      }
    }
  ]
}

The Reason field is your first clue. Common ones:

Target.FailedHealthChecks — app is returning bad status codes
Target.Timeout — no response within the timeout window
Target.NotRegistered — instance isn't in the Target Group at all
Elb.InitialHealthChecking — still warming up, give it 30–60 seconds
Target.DeregistrationInProgress — draining in progress, wait it out

Step 2: Verify Security Groups

This one kills people at 2 AM. The ALB needs a clear path to your instances on the health check port — and security groups are the most common roadblock.

# Get the security group attached to your instances
aws ec2 describe-instances --instance-ids i-0abc123def456 \
  --query 'Reservations[].Instances[].SecurityGroups'

# Check inbound rules on that security group
aws ec2 describe-security-groups --group-ids sg-0123456789 \
  --query 'SecurityGroups[].IpPermissions'

The instance's security group must allow inbound traffic from the ALB's security group on the health check port — TCP 80, 443, or whatever your app uses. If you only see 0.0.0.0/0 rules or nothing referencing the ALB SG, you found it.

Fix it via CLI:

aws ec2 authorize-security-group-ingress \
  --group-id sg-INSTANCE_SG \
  --protocol tcp \
  --port 80 \
  --source-group sg-ALB_SG

Or in the console: EC2 → Security Groups → your instance SG → Inbound rules → Add rule → Custom TCP, port 80, Source = ALB's security group ID.

Step 3: Test the Health Check Endpoint Directly

SSH in and curl the exact URL the ALB is probing:

ssh ec2-user@your-instance-ip

# Test the health check endpoint
curl -v http://localhost:80/health

# If the app is on a different port
curl -v http://localhost:8080/health

The response must return HTTP 200 — or whichever success codes you've configured in the Target Group. Got a 404? Wrong path. Got a 500? The app is broken. Request just hangs? The app isn't listening on that port.

Confirm what's actually listening:

ss -tlnp | grep LISTEN
# or
netstat -tlnp

Step 4: Review Health Check Settings in Target Group

Misconfigured health check settings are a sneaky cause — especially after someone tweaks the TG without updating the app. Check via CLI:

aws elbv2 describe-target-groups \
  --target-group-arns arn:aws:elasticloadbalancing:...

Match these against your actual app behavior:

Protocol: HTTP or HTTPS — must match what your app serves
Port: must match the port your app listens on
Path: e.g., /health or /ping — must return a 2xx
Healthy threshold: consecutive successes needed to flip to healthy (default: 5)
Unhealthy threshold: consecutive failures before marked unhealthy (default: 2)
Timeout: seconds ALB waits for a response (default: 5s)
Interval: how often to probe (default: 30s)

Under load, apps sometimes take 6–8 seconds to respond to health checks. If your timeout is 5s, every probe times out and the target stays unhealthy. Bump the timeout. For faster recovery, drop the interval to 10–15s and the healthy threshold to 2.

aws elbv2 modify-target-group \
  --target-group-arn arn:aws:elasticloadbalancing:... \
  --health-check-path /health \
  --health-check-interval-seconds 15 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 3 \
  --health-check-timeout-seconds 10

Step 5: Check Registered Targets

Worth confirming the obvious — are any instances actually registered?

aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:... \
  --query 'TargetHealthDescriptions[].Target'

Empty list? Register your instances:

aws elbv2 register-targets \
  --target-group-arn arn:aws:elasticloadbalancing:... \
  --targets Id=i-0abc123def456,Port=80

Verify the Fix

Once you've made changes, poll the target health until you see it flip:

# Poll every 10 seconds
watch -n 10 'aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:... \
  --query "TargetHealthDescriptions[].TargetHealth"'

The moment you see "State": "healthy" for at least one target, test the ALB:

curl -I https://your-alb-dns-name.ap-northeast-1.elb.amazonaws.com/
# Expected: HTTP/1.1 200 OK

Useful Tips

ALB Access Logs: Enable them in ALB attributes. Every log line includes the target's response code — invaluable when you're chasing intermittent 503s that disappear before you can SSH in.
CloudWatch alarms: Add an alarm on HealthyHostCount < 1 in your Target Group metrics. You want to know about this before your users do.
Auto Scaling Groups: Make sure the ASG is attached to the right Target Group. Also check that instances pass the health check before the warmup period ends — a 300-second warmup with a 30-second check interval means 10 probes before traffic flows.
ECS services: The container must expose the correct port and pass its container health check first. ALB won't route to a task that ECS itself considers unhealthy.
Rolling deploys: If old instances are all draining at once, you can hit 503 briefly. Stagger your deployments or set minimum healthy percentage to 100% to prevent this.