Unless you were completely off the grid on February 28th 2017, you likely noticed Amazon S3 suffered a big outage that affected pretty much all AWS services. This happened in its biggest region, N. Virginia, and it impacted a big portion of the internet.
Here are some key takeways for the rest of us, as application and business owners, which will help us build reliable systems on AWS.
Good news first, all other AWS regions were fine.
This outage only affected the N. Virginia (us-east-1) region. For the whole duration of the incident, the other AWS regions were working just fine. There were 0 incidents reported in Ohio, Oregon or N. California. The same is true for Montreal and other regions in Europe and Asia. AWS has made a big effort to design its regions so they work in isolation and operate independently. At least in this event, AWS’ promise of delivering truly independent regions passed the test.
The AWS cloud is not perfect, but none of its alternatives are either.
So, you have an application running on AWS. You have the following alternatives: a server rack in your garage or in your office, hiring a hosting provider, running your own data center, running your own private cloud, or using other public clouds such as Azure, Google, SoftLayer, etc.
It’s obviously a serious issue when you have a 4+ hour outage, and Amazon has published a summary of why this incident happened. But up until now, S3 had a pretty decent track record regarding availability. When bad things happen in AWS, at least I know there is a team of experienced and talented engineers working 24-7 to solve any issue. I also know there is strong leadership that will ensure whatever went wrong doesn’t happen again.
The truth is, any alternative to AWS will fail eventually, and probably also for 4 hours or more.
AWS has obviously built a complex network of interdependencies among their services.
When I looked at all the AWS services that were affected by this S3 outage, it made me think: maybe AWS has taken the number of interdependencies among their services too far. Many AWS services depend on S3. For example, Amazon Machine Images, which are required for EC2 instance launches, are stored in S3. Many services use S3 to archive data or store configurations. EBS backups live in S3, EMR has many integrations with S3. A service such as EC2 is used as the backbone for many other services, such as RDS, Lambda, EMR or ECS.
So, if S3 doesn’t work, then EC2 doesn’t work and then a lot of other services stop working. This creates a bad domino effect that AWS has to somehow mitigate. There are other services, such as IAM or Dynamo DB, that are essential dependencies to other AWS services, which could cause a similar situation in the future.
There will always be human error, even as you automate procedures.
As Amazon explained, the whole issue was caused by someone executing a command with a wrong parameter. Even though it’s clear that Amazon uses automated tools to run operational procedures, those tools still need to follow instructions from a human. And humans make mistakes. Humans also build automation tools, and can make mistakes when building them.
My point is, automated tools will save you time and make your team more efficient. They will reduce the possibility of human error, but will not eliminate it. That’s why we always have to be prepared for failure and design our systems accordingly.
You are responsible for delivering a good experience to your customers.
AWS has a commitment to you, as their customer. And you have a commitment to your customers. In other words, it’s also your responsibility to mitigate the effects of a bad AWS outage. Netflix, for example, was unaffected by the whole incident. Why? Because for many years, they’ve built backup and failover mechanisms in multiple AWS regions. If something goes wrong in one region, they can quickly route their traffic to a healthy one. Since Oregon and Ohio were just fine, most likely Netflix redirected traffic there and its customers didn’t notice a thing.
My point is, there are things you can do to mitigate the impact of large-scale events like this one.
Disaster Recovery and Failover mechanisms are expensive. They require you to duplicate your data and some infrastructure in at least one additional region. You also need to build, test and maintain automatic backup and failover mechanisms. This not only will increase your AWS bill significantly, but it will also require a lot of effort from your team. That’s engineering time that you could invest in other things.
So, how do you decide what to do? Think about how much an outage would cost to your business. Quantify your Business Flows and the lost revenue your business would suffer as a result of 30 minutes, 1 hour, 4 hours or 8 hours of downtime. Think about customer dissatisfaction also and how much would it cost you.
Then estimate the cost of additional EC2 servers, RDS instances, Dynamo DB tables or S3 buckets in a backup region. Think about regular data transfer costs from region A to region B and the additional infrastructure you’ll have to build for failover automation. See if there are AWS built-in mechanisms to help with multi-region backups (i.e. RDS multi-region replicas, S3 Events, Route53 DNS failover, or using Dynamo DB streams, to name a few). These are options that can save you a lot of time. Calculate the total cost of the required engineering time too.
Once you have a number, just do the math and see if there are strategic areas where it makes sense to invest in multi-region coverage.
Communicate with your customers
Even if you decide not to implement a multi-region failover mechanism, the least you can do is being on top of customer communications. Have metrics in place that tell you in advance when things are not looking good and be ready to let your customers know when the situation is affecting them.
A clear status page is great. Or use your mailing list and social media platforms. Do whatever works for you, but it’s extremely important that your customers know you’re aware of what’s happening and that you keep them informed. The last thing you want is to know something’s wrong via some angry tweets.
Preferably, make sure your status page is not hosted on AWS (directly or indirectly). For a period of time, AWS’ status page didn’t work properly because it has a dependency on S3!
Just to conclude, this widespread event is a reminder that things can and will go wrong eventually when you use the AWS cloud or any other solution. But as business and application owners, there are things we can do. It all starts with making informed decisions and having a well planned strategy on how to deal with such events.