Handling Unexpected AWS IAM Changes

The cloud is tricky! You might think the rules that determine which IAM permissions are required for which actions will continue to apply in the same way. You might think they’d apply the same way to different AWS accounts. Or that if these things aren’t true, at least AWS will let you know. (I did.) You’d be wrong!

Coiled operates in our users’ AWS (or Google Cloud) accounts, with IAM permissions they assign to us. We have a CLI that lets users get set up by assigning us a minimal set of IAM permissions to do stuff in their account (like run EC2 instances), and we spin up Dask clusters when they ask us to. Or they can manually set things up by creating roles with the permissions we document that we need.

We also “tag” the resources we create, to help our users track down what Coiled is up to (for cost tracking or other auditing).

Someone seems to be on a rampage to fix some bugs in how IAM permissions apply to tags, and it’s been a bit rough on some of our users.

Act 1: a logs:TagLogGroup problem in November

When a user runs coiled setup aws, we create a CloudWatch log group in their account, where we’ll send the logs from their Dask cluster.

Back in November, a new user in a small AWS region reported a problem setting up Coiled:

User with account ID: <redacted> is not authorized
to perform CreateLogGroup with Tags.

We’d never seen this before and couldn’t reproduce it in any of our larger regions, but could reproduce in the user’s region.

This was a bit of a mystery, but not that much of one. I guess someone realized that they hadn’t been requiring the TagLogGroup permission to create a log group with tags (only to tag an existing log group?), and fixed it. Maybe the “fix” appeared first in this small alphabetically-early region, and broke us there.

We added logs:TagLogGroup to the list of permissions we ask users to give us, and all was well.

Act 2: a warning for January!

On December 12, I received an email (well, many emails across several languages) that seemed related:

As of October 30, 2022, tagging is supported for the “Destination” resource. Previously, CloudWatch Logs supported tagging only for the “Log Group” resource. We recommend that, for your IAM policies that are used to access the CreateLogGroup API, you add logs:TagResource permission to your IAM policies by January 31, 2023. The new logs:TagResource permission will not be required for existing accounts that previously used CreateLogGroup API with tags.

In order to tag new log groups using the CreateLogGroup API, we recommend you add logs:TagResource permission to your IAM policies …

At first I read this as a belated warning about the behavior from November. But no, it’s not actually about a change from needing logs:TagLogGroup to needing logs:TagResource. Thanks for the warning! That would have been nice back in November. (I still haven’t seen any acknowledgement about the behavior change requiring logs:TagLogGroup.)

Act 3: an ecr:TagResource problem in December

This morning a new user reported another IAM problem. They were trying to create a Coiled “software environment” (just a Docker image we build and push to ECR for them), and saw:

botocore.exceptions.ClientError: An error occurred (AccessDeniedException) when calling the CreateRepository operation: User:
arn:aws:iam::<redacted>:user/coiled is not authorized to perform: ecr:TagResource on resource: <redacted> because no identity-based
 policy allows the ecr:TagResource action

I quickly reproduced the problem and helped the user past it by giving us the ecr:TagResource permission. We’ve seen this before! Seems like they should have been requiring that all along, and broke us by requiring it now.

My next question was: Why isn’t this affecting more users?

Assisted by the hint in AWS’s December 12th email, I realized that maybe accounts which had previously created an ECR repo (tagged) would be unaffected. After confirming this hypothesis (my first repro was lucky to use a disposable account, which had never created an ECR repo, or at least not in that region), I exited “emergency” mode since I now knew existing users’ workflows aren’t at risk.

I haven’t found any public discussion of this change.

We deployed a fix and set up notifications to alert us if this ever happens again so we can individually help affected users (which it might, since a few users “set up” Coiled with the old permissions, but have not yet tried to create a software environment).

Act 4: Test in new accounts…?

It seems like AWS’s strategy is to make these changes in a way that will only affect accounts which haven’t used those permissions before.

That keeps these changes from breaking existing users, and helps folks choosing a collection of IAM permissions for the first time. But we’re asking for a fixed collection of permissions from AWS accounts that may have never taken any of these actions before.

We already do testing in accounts that are wiped clean nightly, but that’s not enough to catch these issues.

I don’t know how much more of this to expect (or how much will be announced to us ahead of time). If this is going to keep happening, it would probably pay off to set up a process to programmatically create completely brand new accounts in which to test Coiled.

Obligatory pitch: Use Coiled!

At Coiled we love these absurd details, and we love dealing with them so you don’t have to. We pay close attention to strange cloud behaviors and have fixes in place before most (or often any) users are aware.