Fingers should be pointed at AWS

The recent outage suffered at Amazon Web Services due to the failure of something-or-other caused by storms in Virginia has created yet another round of discussions about availability in the public cloud.

Update: The report from AWS on the cause and ramifications of the outage is here.

While there has been some of the usual commentary about how this outage reminds us of the risks of public cloud computing, there have been many articles and posts on how AWS customers are simply doing it wrong. The general consensus is that those applications that were down were architected incorrectly and should have been built with geographic redundancy in mind. I fully agree with that as a principle of cloud based architectures and posted as much last year when there was another outage (and also when it was less fashionable to blame the customers).

Yes, you should build for better geographic redundancy if you need higher availability, but the availability of AWS is, quite frankly not acceptable. The AWS SLA promises 99.95% uptime on EC2 and although they may technically be reaching that or giving measly 10% credits, anecdotally I don’t believe that AWS is getting near that in US-East. 99.95% translates to 4.38 hours a year or 22 minutes a month and I don’t believe that they are matching those targets. (If someone from AWS can provide a link with actual figures, I’ll gladly update this post to reflect as much). Using the x-nines measure of availability is all that we have, even if it is a bit meaningless, and by business measures of availability (application must be available when needed) AWS availability falls far short of expectations.

I am all for using geographic replication/redundancy/resilience when you want to build an architecture that pushes 100% on lower availability infrastructure, but it should not be required to overcome infrastructure that has outages for a couple of hours every few weeks or months. While individual AWS fans are defending AWS and pointing fingers at architectures that are not geographically distributed is going to happen, an article on ZDNet calling AWS customers ‘cheapskates’ is a bit unfair to customers. If AWS can’t keep a data centre running when there is a power failure in an area, and can’t remember to keep the generator filled with diesel (or whatever), blaming customers for single building single zone architectures isn’t the answer.

Yes, I know that there are availability zones and applications that spanned availability zones may not have been affected, but building an application where data is distributed across multiple AZs is not trivial either. Also, it seems that quite frequently an outage in one AZ has an impact on the other (overloads the EBS control plane, insufficient capacity on healthy AZ etc), so the multiple AZ approach is a little bit risky too.

Us application developers and architects get that running a highly available data centre is hard, but so is building a geographically distributed application. So we are expected to build these complicated architectures because the infrastructure is less stable than expected? Why should we (and the customers paying us) take on the extra effort and cost just because AWS is unreliable? How about this for an idea — fix AWS? Tear down US East and start again… or something. How is AWS making it easier to build geographically distributed applications? No, white papers aren’t good enough. If you want your customers to wallpaper over AWS cracks, make services available that make geographic distribution easier (data synchronisation services, cross-region health monitoring and autoscaling, pub-sub messaging services, lower data egress costs to AWS data centres).

Regardless of how customers may feel, if you Google ‘AWS outage’ you get way, way to many results in the search. This isn’t good for anybody. It isn’t good for people like me who are fans of the public cloud, it isn’t good for AWS obviously, and it isn’t even good for AWS competitors (who are seen as inferior to AWS). If I see another AWS outage in the next few months, in any region, for any reason I will be seriously fucking pissed off.

Simon Munro

@simonmunro

  1. Windows Azure and Cloud Computing Posts for 7/2/2012+ - Windows Azure Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: