Fix CloudFormation "The stack is in an UPDATE_ROLLBACK_FAILED state and can not be updated"

The Error

The stack is in an UPDATE_ROLLBACK_FAILED state and can not be updated.

You pushed a CloudFormation update. It failed midway. CloudFormation tried to roll back — and that rollback also failed. Now the stack is frozen. You can't update it, you can't delete it cleanly, and every deployment attempt throws the same error. Production is either partially broken or you're staring at a half-updated stack at 2 AM wondering what to do next.

Why This Happens

When a stack update fails, CloudFormation automatically attempts rollback. That rollback can itself fail when:

A resource was manually modified outside of CloudFormation — the stack wants to restore it, but the state no longer matches
A resource was deleted outside the stack before rollback tried to restore it
An IAM permission was removed mid-operation, leaving CloudFormation unable to revert a resource
A dependent resource in another service is in a transitional state (e.g., an RDS instance still in modifying status)

The root issue is always the same: CloudFormation cannot safely restore the stack to its previous state without your help.

Step-by-Step Fix

Step 1 — Identify the Failing Resource

Start by pinpointing which resource caused the rollback to fail. Open the AWS Console → CloudFormation → your stack → Events tab. Find the most recent UPDATE_ROLLBACK_FAILED or UPDATE_FAILED event and note the Logical ID of the resource.

Prefer the CLI? Run this:

aws cloudformation describe-stack-events \
  --stack-name YOUR_STACK_NAME \
  --query 'StackEvents[?ResourceStatus==`UPDATE_ROLLBACK_FAILED`].[LogicalResourceId,ResourceStatusReason]' \
  --output table

This gives you the exact resource and the reason it failed. Typical output looks like:

MyBucket   | Resource does not exist
MyFunction | IAM role not found

Step 2 — Use ContinueUpdateRollback with Skip Resources

ContinueUpdateRollback resumes the stalled rollback. The trick is the --resources-to-skip flag — it tells CloudFormation to skip over the broken resource and let everything else roll back cleanly.

aws cloudformation continue-update-rollback \
  --stack-name YOUR_STACK_NAME \
  --resources-to-skip LogicalResourceId1 LogicalResourceId2

Use the logical ID(s) from Step 1. Multiple resources are space-separated.

Real-world example:

aws cloudformation continue-update-rollback \
  --stack-name my-prod-stack \
  --resources-to-skip MyS3Bucket MyLambdaFunction

Not sure what to skip? Try running without the flag first:

aws cloudformation continue-update-rollback \
  --stack-name YOUR_STACK_NAME

This retries the full rollback from scratch. Only works if you've already fixed the underlying issue — for example, restored a manually deleted resource before running the command.

Step 3 — Fix the Skipped Resources Manually

Skipping a resource doesn't restore it. CloudFormation just marks it as rolled back and moves on. The resource's real state in AWS may be completely out of sync with your template — that's your problem to solve now.

What to do depends on what happened:

Resource was deleted outside the stack: Recreate it manually in AWS, then use stack import or drift detection to re-sync it.
Resource was modified outside the stack: Either revert the manual change, or update your CloudFormation template to reflect the current real-world state and push a fresh stack update.
Resource is no longer needed: Remove it from the template in your next update.

Step 4 — Check for Drift After Recovery

Once rollback completes, run drift detection before touching anything else:

aws cloudformation detect-stack-drift \
  --stack-name YOUR_STACK_NAME

Then poll the result (detection usually takes 30–60 seconds):

aws cloudformation describe-stack-drift-detection-status \
  --stack-drift-detection-id DETECTION_ID

Reconcile any drifted resources before your next deploy. Skipping this step is how you end up in the same failure loop a week later.

If ContinueUpdateRollback Keeps Failing

Some stacks are too corrupted to recover incrementally. At that point, delete and redeploy. If the stack owns resources you want to keep — an S3 bucket, an RDS instance — protect them first:

# Detach specific resources instead of deleting them
aws cloudformation delete-stack \
  --stack-name YOUR_STACK_NAME \
  --retain-resources MyS3Bucket MyRDSInstance

Resources listed in --retain-resources survive the deletion. They're just decoupled from the stack. Redeploy from a clean template, then import those retained resources into the new stack.

Verify the Fix

After ContinueUpdateRollback runs, confirm the stack status:

aws cloudformation describe-stacks \
  --stack-name YOUR_STACK_NAME \
  --query 'Stacks[0].StackStatus'

UPDATE_ROLLBACK_COMPLETE means success — the stack is back in its stable pre-update state and ready for a new deployment.

Still showing UPDATE_ROLLBACK_FAILED? Go back to Step 1. There's likely a second resource in the events log that also needs skipping.

Prevention Tips

Don't touch CloudFormation-managed resources in the Console. Manual changes are the single biggest cause of this error. If you need to test something, go through the stack.
Enable termination protection on production stacks: aws cloudformation update-termination-protection --stack-name YOUR_STACK --enable-termination-protection
Use change sets before every update. aws cloudformation create-change-set shows you exactly what will change — replacements, deletions, modifications — before anything actually happens.
Add stack policies to block accidental replacement of stateful resources like RDS instances and S3 buckets.
Run detect-stack-drift in CI/CD before each deploy. Catching a drifted resource during a pipeline check takes minutes. Catching it during a failed rollback at 2 AM takes hours.