The Error
Your ALB is up. Your EC2 instances are running. But every single request comes back with:
503 Service Temporarily Unavailable - No healthy upstream
The ALB has no healthy targets in the Target Group. Every target is either failing health checks, draining, or missing entirely. Traffic has nowhere to go.
Why This Happens
The ALB probes each target on a fixed interval using your health check settings. Fail N consecutive checks, and the target gets marked unhealthy โ no more traffic. The usual suspects:
- App server crashed or never started
- Health check path returns non-2xx (app returning 404 or 500 on
/health) - Security group blocks the ALB from reaching the instance on the health check port
- Wrong port configured in Target Group (app listens on 8080, TG says 80)
- Instance just launched and hasn't passed the healthy threshold yet
- Target Group is empty โ no instances registered
Step 1: Check Target Health Status
Start here. The AWS CLI will tell you exactly what the ALB thinks of each target:
# Replace with your Target Group ARN
aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:ap-northeast-1:123456789:targetgroup/my-tg/abc123
Sample output:
{
"TargetHealthDescriptions": [
{
"Target": { "Id": "i-0abc123def456", "Port": 80 },
"TargetHealth": {
"State": "unhealthy",
"Reason": "Target.FailedHealthChecks",
"Description": "Health checks failed with these codes: [502]"
}
}
]
}
The Reason field is your first clue. Common ones:
Target.FailedHealthChecksโ app is returning bad status codesTarget.Timeoutโ no response within the timeout windowTarget.NotRegisteredโ instance isn't in the Target Group at allElb.InitialHealthCheckingโ still warming up, give it 30โ60 secondsTarget.DeregistrationInProgressโ draining in progress, wait it out
Step 2: Verify Security Groups
This one kills people at 2 AM. The ALB needs a clear path to your instances on the health check port โ and security groups are the most common roadblock.
# Get the security group attached to your instances
aws ec2 describe-instances --instance-ids i-0abc123def456 \
--query 'Reservations[].Instances[].SecurityGroups'
# Check inbound rules on that security group
aws ec2 describe-security-groups --group-ids sg-0123456789 \
--query 'SecurityGroups[].IpPermissions'
The instance's security group must allow inbound traffic from the ALB's security group on the health check port โ TCP 80, 443, or whatever your app uses. If you only see 0.0.0.0/0 rules or nothing referencing the ALB SG, you found it.
Fix it via CLI:
aws ec2 authorize-security-group-ingress \
--group-id sg-INSTANCE_SG \
--protocol tcp \
--port 80 \
--source-group sg-ALB_SG
Or in the console: EC2 โ Security Groups โ your instance SG โ Inbound rules โ Add rule โ Custom TCP, port 80, Source = ALB's security group ID.
Step 3: Test the Health Check Endpoint Directly
SSH in and curl the exact URL the ALB is probing:
ssh ec2-user@your-instance-ip
# Test the health check endpoint
curl -v http://localhost:80/health
# If the app is on a different port
curl -v http://localhost:8080/health
The response must return HTTP 200 โ or whichever success codes you've configured in the Target Group. Got a 404? Wrong path. Got a 500? The app is broken. Request just hangs? The app isn't listening on that port.
Confirm what's actually listening:
ss -tlnp | grep LISTEN
# or
netstat -tlnp
Step 4: Review Health Check Settings in Target Group
Misconfigured health check settings are a sneaky cause โ especially after someone tweaks the TG without updating the app. Check via CLI:
aws elbv2 describe-target-groups \
--target-group-arns arn:aws:elasticloadbalancing:...
Match these against your actual app behavior:
- Protocol: HTTP or HTTPS โ must match what your app serves
- Port: must match the port your app listens on
- Path: e.g.,
/healthor/pingโ must return a 2xx - Healthy threshold: consecutive successes needed to flip to healthy (default: 5)
- Unhealthy threshold: consecutive failures before marked unhealthy (default: 2)
- Timeout: seconds ALB waits for a response (default: 5s)
- Interval: how often to probe (default: 30s)
Under load, apps sometimes take 6โ8 seconds to respond to health checks. If your timeout is 5s, every probe times out and the target stays unhealthy. Bump the timeout. For faster recovery, drop the interval to 10โ15s and the healthy threshold to 2.
aws elbv2 modify-target-group \
--target-group-arn arn:aws:elasticloadbalancing:... \
--health-check-path /health \
--health-check-interval-seconds 15 \
--healthy-threshold-count 2 \
--unhealthy-threshold-count 3 \
--health-check-timeout-seconds 10
Step 5: Check Registered Targets
Worth confirming the obvious โ are any instances actually registered?
aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:... \
--query 'TargetHealthDescriptions[].Target'
Empty list? Register your instances:
aws elbv2 register-targets \
--target-group-arn arn:aws:elasticloadbalancing:... \
--targets Id=i-0abc123def456,Port=80
Verify the Fix
Once you've made changes, poll the target health until you see it flip:
# Poll every 10 seconds
watch -n 10 'aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:... \
--query "TargetHealthDescriptions[].TargetHealth"'
The moment you see "State": "healthy" for at least one target, test the ALB:
curl -I https://your-alb-dns-name.ap-northeast-1.elb.amazonaws.com/
# Expected: HTTP/1.1 200 OK
Useful Tips
- ALB Access Logs: Enable them in ALB attributes. Every log line includes the target's response code โ invaluable when you're chasing intermittent 503s that disappear before you can SSH in.
- CloudWatch alarms: Add an alarm on
HealthyHostCount < 1in your Target Group metrics. You want to know about this before your users do. - Auto Scaling Groups: Make sure the ASG is attached to the right Target Group. Also check that instances pass the health check before the warmup period ends โ a 300-second warmup with a 30-second check interval means 10 probes before traffic flows.
- ECS services: The container must expose the correct port and pass its container health check first. ALB won't route to a task that ECS itself considers unhealthy.
- Rolling deploys: If old instances are all draining at once, you can hit 503 briefly. Stagger your deployments or set minimum healthy percentage to 100% to prevent this.

