One Engineer’s Simple Mistake…

Having 20 years working with production systems for major worldwide brands, stories like this really touch home.

In case you missed it, many major websites were knocked offline on Tuesday (including one client I work with) when the Amazon S3 service in the main US region (similar to Azure Blob storage) was brought down for about four hours.

The incident highlighted two things that we architects and cloud developers really need to pay attention to.

One is, of course, increasing dependence on technology outside of your direct control as we move to the cloud. The ops teams at some billion dollar companies could not do anything once the images stopped working on their company websites because they trusted Amazon S3 would always be available. Most did not have a backup directory of assets in another hosting location that they could switch to temporarily. Trusting Amazon is great when you trust that Amazon is much better at keeping servers available than you are, but is frustrating in the moment when things like this happen because there’s not much you can do quickly.

And two is that so many websites do not properly use multi-region capability for maximum availability. For whatever reason, most companies did not have backup buckets in other regions that they could manually or automatically swap to when the main region went down. There was a failure of application architecture here for companies using the cloud.

Amazon, of course, is going to bear the brunt of the responsibility for the downtime. But companies that use Amazon S3 could and should do more to keep their applications running when bad things happen.

The sad part of the story, for me, is the single engineer who is responsible for the error through a typo of some type.

At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.

Man. Bringing some percentage of the Internet down by mistyping a parameter to a command line function. Poor guy. That could be any one of us.