Fix AWS ECS ResourceInitializationError: unable to retrieve ecr registry auth

The situation

Deployed a new ECS task definition pointing to an ECR image. The task failed immediately during initialization — never even started the container. Checked the ECS console:

ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 3 time(s): RequestError: send request failed

No container logs. Nothing in CloudWatch. The task kept cycling into STOPPED state with exit code None.

Two things cause this: missing IAM permissions on the task execution role, or a network issue blocking the ECS agent from reaching ECR. Both produce the same error message — which is exactly what makes this a pain to debug.

Debug process

Step 1: Check the task execution role

The task execution role is separate from the task role. It's what the ECS agent uses to pull the image and fetch secrets — not what your application code uses at runtime.

Find the Task execution role in your task definition, then inspect it:

aws iam list-attached-role-policies --role-name ecsTaskExecutionRole

You need to see AmazonECSTaskExecutionRolePolicy in the output. If it's missing, that's your problem. This managed policy covers the bare minimum: ECR pulls and CloudWatch Logs writes.

Next, check whether the trust relationship actually lets ECS assume this role:

aws iam get-role --role-name ecsTaskExecutionRole --query 'Role.AssumeRolePolicyDocument'

Expected output:

{
  "Effect": "Allow",
  "Principal": {
    "Service": "ecs-tasks.amazonaws.com"
  },
  "Action": "sts:AssumeRole"
}

A wrong or missing trust policy means ECS can't assume the role at all — and you never get past initialization.

Step 2: Check network access (Fargate in private subnets)

This is the sneaky one. If the execution role looks fine, the culprit is almost certainly network connectivity. Fargate tasks in private subnets need to reach several AWS endpoints just to start:

ecr.dkr.ecr.<region>.amazonaws.com — image layers
ecr.api.ecr.<region>.amazonaws.com — auth token
s3.<region>.amazonaws.com — ECR stores image layers in S3
logs.<region>.amazonaws.com — CloudWatch Logs
ssm.<region>.amazonaws.com — if using Secrets Manager or SSM Parameter Store

A private subnet with no NAT Gateway and no VPC endpoints can't route to any of these. The ECS agent silently times out — and you get exactly this error.

Check what VPC endpoints already exist:

aws ec2 describe-vpc-endpoints \
  --filters "Name=vpc-id,Values=vpc-xxxxxxxx" \
  --query 'VpcEndpoints[*].{Service:ServiceName,State:State}'

Step 3: Verify the ECR image URI

Less common, but worth a quick check. A wrong account ID or region in the URI causes auth failures too. This happens often when copying task definitions between regions — say, from us-east-1 to ap-northeast-1 — and forgetting to update the URI.

aws ecs describe-task-definition --task-definition my-task \
  --query 'taskDefinition.containerDefinitions[*].image'

Correct format: 123456789012.dkr.ecr.ap-northeast-1.amazonaws.com/my-repo:tag

The fixes

Fix 1: Attach the correct managed policy to the execution role

aws iam attach-role-policy \
  --role-name ecsTaskExecutionRole \
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

Using Secrets Manager or SSM to inject environment variables? The execution role needs these additional permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetSecretValue",
        "ssm:GetParameters",
        "kms:Decrypt"
      ],
      "Resource": "*"
    }
  ]
}

Fix 2: Add VPC endpoints for private subnets (no NAT Gateway)

Create interface endpoints for ECR and a gateway endpoint for S3:

# ECR API endpoint
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-xxxxxxxx \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.ap-northeast-1.ecr.api \
  --subnet-ids subnet-xxxxxxxx \
  --security-group-ids sg-xxxxxxxx \
  --private-dns-enabled

# ECR DKR endpoint (image layers)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-xxxxxxxx \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.ap-northeast-1.ecr.dkr \
  --subnet-ids subnet-xxxxxxxx \
  --security-group-ids sg-xxxxxxxx \
  --private-dns-enabled

# S3 gateway endpoint (ECR stores layers here)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-xxxxxxxx \
  --vpc-endpoint-type Gateway \
  --service-name com.amazonaws.ap-northeast-1.s3 \
  --route-table-ids rtb-xxxxxxxx

# CloudWatch Logs endpoint
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-xxxxxxxx \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.ap-northeast-1.logs \
  --subnet-ids subnet-xxxxxxxx \
  --security-group-ids sg-xxxxxxxx \
  --private-dns-enabled

The security group on each VPC endpoint must allow inbound HTTPS (port 443) from the task subnets. Without this, the endpoints exist but traffic still can't get through.

Fix 3: Alternative — assign a public IP (quick test only)

Want to isolate whether the problem is network-related before setting up endpoints? Enable auto-assign public IP for a one-off test run:

aws ecs run-task \
  --cluster my-cluster \
  --task-definition my-task \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-xxxxxxxx],securityGroups=[sg-xxxxxxxx],assignPublicIp=ENABLED}"

Task starts with a public IP but fails without one? Network confirmed. Set up VPC endpoints — don't ship this config to production.

Verification

Once the fix is in, rerun the task and watch the events:

aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks <task-arn> \
  --query 'tasks[0].{Status:lastStatus,StoppedReason:stoppedReason,Containers:containers[*].{Name:name,Reason:reason}}'

A clean pull moves the task through PENDING → ACTIVATING → RUNNING. Still failing? Check stoppedReason — it often contains more detail than the ECS console surfaces.

Once it's running, tail the logs:

aws logs tail /ecs/my-task --follow

Lessons learned

Always bootstrap ECS task execution roles from AmazonECSTaskExecutionRolePolicy — don't write the policy from scratch. You'll miss something.
When standing up ECS in a new VPC, provision the ECR/S3/Logs VPC endpoints before deploying any tasks. Retrofitting them while a deployment is broken is painful.
The error says "retried 3 times" — that's a network timeout, not an auth rejection. A real 403 from IAM looks different. The ResourceInitializationError wrapper swallows both cases, so always check IAM and network together.
The task execution role and the task role serve different purposes. Developers routinely configure one and forget the other exists.
Secrets Manager references in task definitions require secretsmanager:GetSecretValue on the execution role. Running in a private subnet? You'll also need a Secrets Manager VPC endpoint — it's a separate endpoint from ECR.