A post Getting Real About Distributed System Reliability by Jay Kreps is an interesting post about the perception that distributed systems (and distributed databases) increase reliability because they are horizontally scalable. The reasoning flaw, he points out is ‘is the assumption that failures are independent’.
Failures tend to occur, as is his observation, because of bugs in the software (or in the homogeneous infrastructure) and the addition of redundant nodes does not decrease the likelihood of failure much. We see this continuously with cloud outages – the recent leap day bug that crashed Windows Azure is a good example.
I have been doing some work on availability recently and my first availability influencer is quality, followed by fault tolerance (resilience). Redundancy is relevant at the hardware level and is more relevant for scalability than availability. So yes, to active availability — quality, then resilience, and redundancy near the bottom of the list.
I have also been doing work on cloud operations and was intrigued to see that in his post he highlights that the core difficulty is not architecture or design, but operations. I think that he is downplaying architecture but the ability to operate a complex (distributed) system is a big part of keeping it running. He singles out AWSs DynamoDB,
This is why people should be excited about things like Amazon’s DynamoDB. When DynamoDB was released, the company DataStax that supports and leads development on Cassandra released a feature comparison checklist. The checklist was unfair in many ways (as these kinds of vendor comparisons usually are), but the biggest thing missing in the comparison is that you don’t run DynamoDB, Amazon does. That is a huge, huge difference. Amazon is good at this stuff, and has shown that they can (usually) support massively multi-tenant operations with reasonable SLAs, in practice.
I tend to agree with that. Rolling your own available platform is going to be hard, and providers of cloud services, such as Amazon or Microsoft, have more mature operational processes to keep things available. It also casts a shadow over self operated cloud platforms (such as CloudFoundry) which have all of the bugs and none of the operational chops to ensure that availability is high.
Go and read Jay’s post. It is required reading for people building cloud applications.