Why Isn’t AWS More Transparent About Outages?
A common complaint about AWS is that while they usually provide details explanations about why an outage occurred days later, they often don’t communicate until hours after an outage has occurred. Sometimes, alleged outages occur without Amazon providing any explanation whatsoever. AWS customers invest a lot of into AWS’s serverless compute platform. Often paying a premium for the tools provided and the convenience of a full featured platform. But downtime is frustrating, and AWS’ clients must face the wrath of their own customers if services go down. So why isn’t AWS more transparent about outages? We’ve got a few ideas.
The Illusion of the cloud and serverless
The cloud isn’t magic. It’s literally hosting and running your app or site on someone else’s computer. And serverless isn’t a thing, servers are still needed to run a site that is billed as a utility rather than a fixed monthly price. But ever since cloud computing became a thing there has always been a mysticism about it among both IT professionals and users alike. Even if we intuitively know that Amazon is running our apps on similar servers that a provider like ZebraHost would use, or something a company could co-locate themselves, we like to believe a fortune 500 tech giant like Amazon has some sort of magic sauce unavailable to anyone else except maybe Microsoft, Alibaba, or Google.
And think about it. If somewhere in your head you believe in a magic cloud that can never fail and will always keep your data safe, wouldn’t it shatter that illusion if you knew about every outage the moment it happened? This maybe part of Amazon’s logic in not disclosing outages sooner, and only disclosing big outages that affect large companies. If everyone believes cloud has 100% up time, or outages only happen every couple of years the illusion is maintained.
Support Crowd Control
At ZebraHost, if clients notice an outage before we do, we typically see a large influx of tickets. If a site is down we don’t expect out clients to tolerate or wait for an answer, we get into contact with the customer and fix it. Sometimes the problem is isolated to a single customer and other times it is more widespread. But ZebraHost is small compared to AWS.
If AWS sounded the alarm bells for their millions of customers the moment an outage happened even if small and isolated it would still likely overwhelm the support team. Many customers don’t receive support at AWS unless they pay for higher tiers. So combined with relegating most users to the forums, or places like Reddit such as this example, AWS can cut down on potential panic or overwhelming the support team. This gives them time to research and write a sophisticated mass response about the issue days later.
Cost Cutting
Everyone likes to cut costs where they make sense. And alerting customers early about issues can increase costs. AWS would likely have to provide a greater number of customer relations specialists (support) to communicate with clients earlier. AWS would also see cost rise in the cost of time which needs to be diverted away from other departments as more people need to approve statements and be consulted on communications. Putting out a grand statement towards the end and minimizing the amount of communication that could potentially create panic can be seen as a cost saving measure.
They May Not Even Know About It
AWS is a bear of a platform. Managing thousands of servers, thousands of services, and millions of customers is no doubt a challenge. There are times at ZebraHost we don’t know about outages until a customer informs us and very likely AWS has the same experience. AWS also only provides dedicated support to large paying customers so the percentage of customers that have a means to get a hold of AWS when an issue occurs is very small. Couple that with the fact that many customers with direct support are big companies in their own right that likely need test their own systems before they determine the issue is an AWS infrastructure one means that there is likely a very long delay between systems going down and people becoming informed.
AWS Assumes Customers Wont Remember Outages
Think about a time you’ve screwed up. Wouldn’t you want everyone to just forget? And only remember the good times when your site or app are up? There’s a chance AWS hopes that’s how it goes. AWS got off to a rocky start when it first began and have certainly reduced their downtime. But outages do happen, and because AWS runs a large portion of the biggest sites on the internet people usually find out in the moment.
But AWS, like all cloud providers do experience downtime. They typically publish detailed explanations about what the problem was and vow to implement solutions to make sure it doesn’t happen again. A great example is the Kinetic outage that happened not long ago. AWS was transparent about what occurred but took about 17 hours to fully restore.
How to prepare for an AWS Outage
As AWS has expanded, it has increased redundancy, expanded its regions and found resolutions for major outage causing problems. But just like every cloud provider, outages can (and will) happen. Here’s some advice on LinkedIn about how larger businesses that can afford redundancy (such as his example Reddit) should take the extra steps to avoid downtime in the case of an AWS outage.
The main ways to prepare for an outage in AWS are choosing highly available storage tiers in the even an instance goes down. The other way to prepare is make sure you are purchasing plans including geo-redundancy so in the (highly unlikely) event that an entire data center goes down, you are prepared.
Conclusion
There is a reason that AWS has became the leading hyperscaler. They offer a platform as a service that has every tool needed to build an app and host it start to finish. But due to the sheer size of AWS and frankly, the effort level AWS feels they need to put in to keep each individual customer, transparency leaves much to be desired. It is best to prepare yourself for AWS outages so that even if Amazon doesn’t inform you about the issue and resolve it quickly, you minimize your risk to downtime.
Resources
AWS Says It’s Never Seen a Whole Data Center Go Down | Data Center Knowledge
AWS Apologizes For Cloud Outage From Amazon Kinesis (crn.com)
Amazon: Here’s what caused the major AWS outage last week | ZDNet
AWS Support Plans (amazon.com)
AWS Service Health Dashboard - Jun 18, 2021 PDT (amazon.com)
Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region
AWS Network outage? : aws (reddit.com)
Reddit Didn’t Need To Have An Outage When AWS Went Down | LinkedIn
Get featured blog articles, industry news, and specials straight in your inbox.