Archive for category Architecture

Free eBook on Designing Cloud Applications

Too often we see cloud project fail, not because of the platforms or lack of enthusiasm, but from a general lack of skills on cloud computing principles and architectures. At the beginning of last year I looked at how to address this problem and realised that some guidance was needed on what is different with cloud applications and how to address those differences.

The result was a book that I wrote and published “CALM -Cloud ALM with Microsoft Windows Azure”, which takes the approach that implementation teams know how to do development, they just don’t know how to do it for the cloud, and they need to adopt cloud thinking into their existing development practices.

The “with Windows Azure” means that the book has been written with specific examples of how problems are solved with Windows Azure, but is not necessarily a book about Windows Azure — it applies as much to AWS (except you would have to figure out the technologies that apply yourself).

CALM takes an approach to look at certain models and encourages filling in the detail of the models in order to come up with the design. The models include the lifecycle model, which looks at load and traffic over time, the availability model, data model, test model and others. In looking at the full breadth of ALM (not just development), some models apply to earlier stages (qualify and prove), as well as post-delivery models, such as the deployment, health and operational models.

CALM is licensed as open source, which also means that it is free to download, read and use. It is available on github at, with pdf, mobi (Kindle), and raw html available for download on this share. A print version of the book is also available for purchase on Lulu.

I encourage you to have a look at CALM, let others know about it, ask any questions, and give me some feedback on how it can be made better.


Simon Munro


Leave a comment

AWS and high performance commodity

One of the primary influencers on cloud application architectures is the lack of high performance infrastructure — particularly infrastructure that satisfies the I/O demands of databases. Databases running on public cloud infrastructure have never had access to the custom-build high I/O infrastructure of their on-premise counterparts. This had led to the well known idea that “SQL doesn’t scale” and the rise of distributed databases has been on the back of the performance bottleneck of SQL. Ask any Oracle sales rep and they will tell you that SQL scales very well and will point to an impressive list of references. The truth about SQL scalability is that it should rather be worded as ‘SQL doesn’t scale on commodity infrastructure’. There are enough stories on poor and unreliable performance of EBS backed EC2 instances to lend credibility to that statement.

Given high performance infrastructure, dedicated network backbones, Fusion-IO cards on the bus, silly amounts of RAM, and other tweaks, SQL databases will run very well for most needs. The desire for running databases on commodity hardware comes largely down to cost (with influence of availability). Why run your database on hardware that costs a million dollars, licences that cost about the same and support agreements that cost even more, when you can run it on commodity hardware, with open-source software for a fraction of the cost?

That’s all very fine and well until high performance becomes commodity. When high performance becomes commodity then cloud architectures can, and should, adapt. High performance services such as DynamoDB do change things, but such proprietary APIs won’t be universally accepted. The AWS announcement of the new High I/O EC2 Instance Type, which deals specifically with I/O performance by having 10Gb ethernet and SSD backed storage, makes high(er) performance I/O commodity.

How this impacts cloud application architectures will depend on the markets that use it. AWS talks specifically about the instances being ‘an exceptionally good host for NoSQL databases such as Cassandra and MongoDB’. That may be true, but there are not many applications that need that kind of performance on their distributed NoSQL databases — most run fine (for now) on the existing definition of commodity. I’m more interested to see how this matches up with AWSs enterprise play. When migrating to the cloud, enterprises need good I/O to run their SQL databases (and other legacy software) and these instances at least make it possible to get closer to what is possible in on-premise data centres for commodity prices. That, in turn, makes them ripe for accepting more of the cloud into their architectures.

The immediate architectural significance is small, after all, good cloud architects have assumed that better stuff would become commodity (@swardley’s kittens keep shouting that out), so the idea of being able to do more with less is built in to existing approaches. The medium term market impact will be higher. IaaS competitors will be forced to bring their own high performance I/O plans forward as people start running benchmarks. Existing co-lo hosters are going to see one of their last competitive bastions (offering hand assembled high performance infrastructure) broken and will struggle to differentiate themselves from the competition.

Down with latency! Up with IOPS! Bring on commodity performance!

Simon Munro



Feature Shaping

One of the key concepts in scalability is the ability to allow for service degradation when an application is under load. But service degradation can be difficult to explain (an relate back to the term) and ‘degrade’ has negative connotations.

The networking people overcame the bad press of degradation by calling it ‘traffic shaping’ or ‘packet shaping’. Traffic shaping, as we see it on the edge of the network on our home broadband connections, allows some data packets to be of a lower priority (such online gaming) than others (such as web browsing). The idea is that a saturated network can handle the load by changing the profile or shape of priority traffic. Key to traffic shaping is that most users don’t notice that it is happening.

So along a similar vein I am starting to talk about feature shaping which is the ability for an application, when under load to shape the profile of features that get priority, or to shape the result to be one that is less costly (in terms of resources) to produce. This is best explained by examples.

  • A popular post on High Scalability talked about how Farmville degraded services when under load by dropping some of the in game features that required a lot of back end processing — shaping the richness of in-game functionality.
  • Email confirmations can be delayed to reduce load. The deferred load can either by the generation of the email itself, or the result of sending the email.
  • Encoding of videos on Facebook is not immediate and is shaped by the capacity that is available for encoding. During peak usage, the feature will take longer.
  • A different search index that produces less accurate results, but for a lower cost, may be used during heavy load — shaping the search result.
  • Real-time analytics for personalised in-page advertising can be switched off when under load — shaping the adverts to those that are more general.

So my quick definition of feature shaping is

  • Feature shaping allows some parts of an application degrade their normal performance or accuracy service levels in response to load.
  • Feature shaping is not fault tolerance — it is not a mechanism to cope when all hell breaks loose.
  • Feature shaping is for exceptional behaviour and features should not be shaped under normal conditions
  • Shaped features will be generally unnoticeable to most users. The application seems to behave as expected.
  • Feature shaping can be automated or manual.
  • Feature shaping can be applied differently to different sets of users at the same time (e.g. registered users don’t get features shaped).

So, does the terminology of feature shaping make sense to you?

Simon Munro

1 Comment

Qualifying and quantifying “spiky” and “bursty” workloads as candidates for Cloud

Enterprises are looking to migrate applications to the cloud. Enterprises with thousands of applications require a fast, consistent and repeatable process to identify which applications could stand to benefit. One of the benefits of cloud is how on-demand elasticity of seemingly infinite resources can be an advantage to “spiky” or “bursty” workloads. But as I mentioned in a recent post, people may have a very different view on what constitutes “spiky”. This could yield very unpredictable results when trying to identify those cloud candidates.

It would be useful to have a consistent and repeatable way to determine whether an application was spiky.

We normally translate spiky and bursty to “high variability” which is a good term to use as it indicates the statistical methods by which we can assess whether our utilization patterns match this description and hence benefit from the cloud.

Consider the utilization graphs below (numbers displayed beneath). It could equally be transactions per second – just don’t mix and match.

  1. “Very High” line shows a single, short lived spike.
  2. “High” shows two slightly longer lived spikes.
  3. “Mild” shows the same two spikes, less exaggerated and decaying to a non-zero utilization.
  4. “None” shows small fluctuations but no real spikes – essentially constant utilization.

2,3 and 4 actually have the same average utilization of c. 25%. This means that they consume the same amount of compute cycles during the day irrespective of their pattern; however we can see that to cater for “High” we actually have to have 3x the capacity to service the peak than for “None” – leaving resources underutilized most of the time. The curse of the data centre.

Clearly the Average utilization isn’t the whole picture. We need to look at the Standard Deviation to see into the distribution of utilization:

  • High = 42.5
  • Mild = 21.2
  • None = 5.0

Excellent, so the high standard deviation is starting to show which ones are variable and those which are not. But what is the standard deviation of the spikiest load of all “Very High”? Only 20.0? Much the same as the “Mild” line! The final step is to look at the Coefficient of Variation which is the ratio of the Standard Deviation to the Mean. The Coefficient of Variation is:

  • Very High = 4.00
  • High = 1.59
  • Mild = 0.82
  • None = 0.20

@grapesfrog asked me to describe something like AWS Elastic Map Reduce in these terms. Think of EMR as utilisation of {0,0,0,0,5000,0,0,0,0,0,0,0,0,0,0,0…} where 5000% utilisation is 50 x 1 machine at 100% utilisation. So if you used 50 machines @ 100% for one hour every day your CV would be 4.8. If you used 50 machines for one hour every month your CV would rise to 27.2.


Comparing the mean utilization allows us to compare the relative amount of resource used over a period of time. This showed that they nearly all consumed the same amount of resources with the very noticeable exception of the spikiest of them all. It actually consumed very little resource in total.

Comparing the coefficient of variation reveals the spikiest workload. In this example, the spikiest would require 3x the resources of the least spikey to service the demand BUT would only actually consume 20% of the resources consumed by the least! Sometimes this is the point: spikiest loads require the largest amount of resources to be deployed but can actually consume the least.

Further work:

Our assessment could state “any workload showing a CV>0.5” is a candidate for cloud – revealing applications with spiky intra day behaviour as well as month-end classics.

Workloads that oscillate with a high frequency between extremes may show a CV>0.5 but we begin to trespass on topics such as dead beat control within control theory, and will start to challenge the ability of cloud monitoring/control time resolution etc. I’ll leave it there for the time being though…


Hour Very High High Mild None
1 0 0 10 30
2 0 0 10 20
3 0 0 10 30
4 0 0 10 20
5 0 0 10 30
6 0 0 15 20
7 0 10 20 30
8 0 100 50 20
9 0 100 80 30
10 0 100 50 20
11 0 10 20 30
12 0 0 15 20
13 0 0 15 30
14 10 0 15 20
15 100 0 15 30
16 10 0 15 20
17 0 0 15 30
18 0 10 20 20
19 0 100 50 30
20 0 100 80 20
21 0 100 50 30
22 0 10 20 20
23 0 0 15 30
24 0 0 10 20
Standard Deviation 20.0 42.5 21.2 5.0
Average 5 27 26 25
Coefficient of Variation 4.00 1.59 0.82 0.20
Median 0 0 15 25
Minimum 0 0 10 20
Maximum 100 100 80 30



Leave a comment

Issues and benefits conflated with spiky, bursty applications moving to the cloud

Fellow poster Simon Munro made some excellent follow up comments to my recent comments about enterprise applications qualifying for the cloud. I’ve tried to mangle the offline conversation into a post:

Simon was remarking that there are some other qualities conflated with spikiness that are perhaps easily ignored:

Performance – you cannot assume that the required performance is fixed. During peak periods users may tolerate slower performance (and not even notice it). You would have to include something like ApDex in order to get a better sense and impact of the spike.

I think this is a very good point and, ironically, i think it goes both ways. Some apps just aren’t worth the extra resources to maintain peak performance. However, for some apps, the rise in demand might indicate you should make the performance even better as this is a critical moment; the moment that really counts. For example: customer loyalty, serving really stressed users who only care about this system RIGHT NOW or even responding to a crisis.

Base Cost – particularly with commercial enterprise software you have base situations where the cost and load relationship is non linear. Think of the cost of a base Oracle setup, with backup licences, installation costs, DBAs – where the cost is fixed, regardless on whether or not you are putting n or 3n transactions through.

Another excellent point. Although, hopefully this will change over time in response to cloud. Even now we have Oracle on AWS RDS / AWS DevPay or SQL Server via SQL Azure.

Simon then introduces us to a new term:

Business Oriented Service Degradation. I did remark that this concept is sort of covered in some SLAs and SLOs but this is way cooler ;) Simon’s point is that when an accounting system does it’s end of month run (the spike), the ability to process other transactions is irrelevant because the system is ‘closed’ for new transactions by the business anyway.

Sometimes I wonder if the constraints of the past are cast into the DNA of those operating models. This month is closed but plenty of people could be entering new transactions for the next quarter. Is it possible the resource constraints meant that this was historically a better solution?

The point remains though if the spike is huge and infrequent there is a mismatch in total resources deployed (cost) and the level of utilisation. That means waste.

Interestingly, if the utilisation is sat at 100% for extended periods there is also the case for giving it access to more resources. Clearly, with more resources, the system could complete this effort far sooner. Would there be benefit in that? Enabling “closing” the month even later than usual and capturing more transactions? Better compliance rating by never being late despite failed runs and reruns?


Leave a comment

Just because your application would never earn a place on doesn’t mean it won’t qualify for cloud.

Working with large enterprises with 1000s of applications it is useful to assess hundreds of applications at a time to determine which ones will benefit from moving to the cloud. The first iteration to reveal the “low hanging fruit” requires a fairly swift “big handfuls” approach. One of the most commonly stated economic benefits of the cloud is elastically coping with “spiky” or “bursty” demand. OK, so whilst we scan through the application portfolio one of the things we have to spot are applications expressing this characteristic.

I first want to tackle one of the, ever so slightly, frustrating consequence of the original cloud use cases and case studies. You know – the one where some image processing service goes bananas on Sunday night when everyone uploads their pictures and they scale up with an extra 4000 nodes for a few hours to cope with the queue. The consequence is that people now only perceive such skewed scenarios as cloud candidates along the “spiky” dimension. Hence my parting note in my recent post on Evernote, Zynga and Netflix – even normal business systems express spiky behaviour when you consider the usage cycle over a day, month or year. This is one of the reasons, despite all the virtualisation magic, for still such low average utilisation figures in the data centre.

Some classic mitigations for this (excluding cloud) are to distribute the load more evenly across time and resources e.g.:

  • run maintenance tasks such as index rebuilds, backups, consistency checks during the quiet periods
  • build decisions support data structures overnight and prepare the most common or resource intensive reports before people turn up to work
  • make the application more asynchronous and use queues to defer processing to quieter periods or on to other resources
  • use virtualisation to consolidate VMs on to less physical hosts during quiet periods so you can power off hosts (those hosts are now idle assets returning no business value but at least they aren’t consuming power/cooling)

Despite all this, many applications continue to exhibit spiky behaviour – just not in the extreme headline sense that we see on

In a cloud assessment you just want to identify this behaviour as one reason (among many) to put that application on your list of candidates for further study and business justification. In a cloud migration some applications may be immune from the bulleted options above for a number of reasons. Anyway, those techniques will work nicely in the cloud too of course.



Cloud costs are an engineering problem

Head over to the new Google App Engine pricing that will come into effect when App Engine comes out of preview later in the year and you see a list of prices similar, in format at least, to pricing for AWS, Azure and other cloud providers. That seems fairly straightforward until you look at the FAQ that describes the pricing in more detail that, while answering a lot of questions, gives explanations that give rise to even more questions.

It seems that Google is switching over to an instance based pricing model from a CPU based one, but there are differences between different frameworks – where Java handles concurrent requests and Python and Go do not (yet). In addition the FAQ makes observations about the change in pricing that will affect current apps that are memory heavy because they have been designed to optimise the CPU pricing and may land up being more expensive under the new model. Then there are reserved instances, API charges, bandwidth, premier accounts and a whole lot of other considerations to add to the confusion. Even if you are not interested in App Engine it is a worthwhile read.

I have done and seen a few spreadsheets to try and work out hosting costs for cloud computing and they reach a point of complexity with so many unknowns that it becomes very difficult to go to the business with a definitive statement on how much it will cost to run an application. This is particularly difficult when development hasn’t even started yet, so there is no indication of the architectural choices (say memory over CPU) that affect the estimates. While AWS make be easier in some sense because the instance is a familiar unit (a machine yay big that we put stuff on), there are still many considerations that affect the cost of hosting. Grace an I struggled with a particular piece of SOLR availability and avoided using a load balancer for internal traffic until we ran the numbers and worked out that it would cost pennies per day in bandwidth costs so decided to use ELB after all – and that is one of the simpler pricing architectural decisions. Trying to build a scalable architecture out of loosely coupled components that makes optimal use of the resources available is very difficult to do.

We could ask vendors for better or more flexible pricing models. We could have estimating tools that allow us to estimate costs based on a choice of ‘similar’ application models. We could trade SLAs for cost as S3 reduced redundancy does. We can hedge out costs using reserved instances. We could run simulations (given the on demand availability this is relatively easy). We could have better tools to analyse our bills (as Quest has for Azure). We need all of this but ultimately the pricing of cloud computing is going to remain complex and will increase in complexity in future, leaving the big decisions up to the technical people doing the implementation.

Cloud expertise needs to extend beyond knowing your IaaS from your SaaS and experts need to have a handle on all aspects of cloud computing architectures, for a specific platform, in order to realise the benefits that cloud computing promises. In the context of developers being the new kingmakers, it is developers, software architects and DevOps that are the only ones close enough to the metal to make the decisions that ultimately affect the cost. Where currently developers optimise at the cost of development time (which is largely discouraged), we may want developers to optimise CPU against memory against latency against bandwidth against engineering effort, and even, at a push, against environmental friendliness in future. Let’s not even get into having to adapt to providers changing pricing models periodically. It is going to take some serious skill to pull that together – from the entire team.

So while the cloud computing marketers make it sound easy to put our apps onto the cloud there is a long road ahead in developing the necessary skills to ensure that it is done optimally and at a cost that is reasonable across the life of the application. There are business cases that could collapse under spiralling cloud costs if we pull one lever incorrectly.

Simon Munro



1 Comment

The Importance of Custom Monitoring Metrics

Part the cloud computing business case is the optimal use of compute resources and in order to optimise that use, you need to monitor how they are being used. While technical people may be interested in CPU load or memory utilisation, meaningful translation into something that supports the business case virtually impossible.

Business needs to know how the system is performing according to the business needs and is interested in page response times, number of ‘things’ processed per minute/second, number of fallbacks to degraded service and so on. There should be a direct correlation between spinning up a new instance or making an application change and user satisfaction (even using something like the ApDex standard).

I have been using New Relic and think that the product and company is great. New Relic gives so much valuable and well presented data for no engineering cost and is, in my opinion, a no brainer for any web based application. New Relic’s roots are in web app monitoring and that is where they excel, but when I was evaluating it I needed to be able monitor back end services (queue readers) and cron jobs. After months of hounding for the inclusion of the custom metric monitoring in the .NET API for New Relic I finally had a chance to put it to use after release a few weeks ago. The New Relic API allows custom metric to be added in the application that are easily graphed and viewable.

Yesterday AWS announced Custom Metrics for CloudWatch, adding the ability to record custom application metrics that are collected via a very simple API. While New Relic has the advantage of an agent that sends the data using a separate process, the CloudWatch metrics have the advantage of the endpoint being in the same data centre, which may negate the need for a separate process. There would also still be work for application developers to do to monitor web pages in CloudWatch in a way that comes for free from New Relic – you would have to track the beginning and end of a request in CloduWatch to monitor page rendering time.

The big advantage of CloudWatch custom monitoring is the ability to use the data for auto scaling, something that you could never get with a third party such as New Relic. For example, if you can directly correlate between, say, number of instances and page response times, you can set an auto scale rule that adds and removes instances as necessary.

In order to automate applications as much as possible, which is one of the things we strive for, the tie up between monitoring of data that has a context closer to the business is crucial and is an important component of first line support. As much as I like New Relic, the tie up between monitoring and the rest of the infrastructure (such as auto scaling) is so important that AWS, if they continue to add these sort features to CloudWatch, will win eventually.


Simon Munro


Leave a comment

AWS Architecture Guidance Is Hard

Hot on the heels of the 21 April AWS outage is the AWS Architecture Centre, where AWS, it seems, is trying to fill in the gaps on the best way to architect applications to avoid the impact of similar failures. Simple things like, you know, not putting everything in a single region because Availability Zones, despite what is inferred and what we have seen, do share common infrastructure.

The AWS Architecture Center is designed to provide you with the necessary guidance and best practices to build highly scalable and reliable applications in the AWS Cloud. These resources will help you understand the AWS platform, its services and features, and will provide architectural guidance for design and implementation of systems that run on the AWS infrastructure.

The collection of documents is somewhat anemic and AWS feature rich and the webinars, one of which I sat in on today, are at a fairly basic level where the really hard bits, like data, are skirted. To be fair to AWS, teaching customers how to build cloud applications on top of AWS is an enormous task. Not only do developers and architects need to learn about AWS, they first have to learn about building cloud applications in genera. Not many people know how to build applications that are highly scalable, available and cost effective (e.g. by using commodity services and infrastructure) on any platform, never mind AWS.

One of the things that the outage showed is that we need to take more responsibility for the availability (or recoverability) of the systems that we build on top of AWS and we cannot rely too heavily on the underlying infrastructure. Responsibility for availability is moving higher up the stack. In cloud computing we cannot buy expensive products and spend a fortune on networking, configuration and a lucrative support contract to make sure that an application, with its data, is replicated across regions – we need to build some of it into the application.

While this may be easy enough conceptually, it becomes difficult when considering all available options. With virtually unlimited choice of operating systems, application stacks and data stores, how do you write guidance documentation that covers all scenarios? Even if you are able to come up with the best technical recommendations, the financial ‘it depends’ throws it into disarray. Helping people figure out what the considerations are at the application level is a lot more difficult than selling products.

Microsoft, with Windows Azure, seems to have an easier time. The stack is simpler, customers are better understood and they have a closer relationship with them, Microsoft developers tend to follow Microsoft’s (and each others’) guidance, they have a bigger community, more money to spend on people producing documentation and examples, and of course the Windows Azure PaaS product is (by design) more restrictive.

Although AWS may not be able to, erm, scale to the requirement of putting out architectural guidance, the collective capabilities of their customers is huge. Vendors are an important source of guidance on particular products, for example 10gen puts out a lot of guidance on getting mongoDB running on AWS. Customers that didn’t fail in the outage have been describing some of what they do to ensure availability, which provides great insight into real implementation of architectural principles. Even small customers, who simply have an efficient recovery plan have something to add to the conversation.

Amazon needs to continue more aggressively with the architecture guidance that they have started, but also need to help collate guidance and become a source of links for credible AWS guidance scattered in forums and low visibility blogs. I hope that they put some effort and resources in this area because architectural guidance for AWS is hard, and the fallout from outage proved that there is a high demand.

Simon Munro


Leave a comment

How AWS learns from a mistake

I have just read the detailed and very long Post mortem from Amazon   re their outage in the US-East region.  It turns out that it was a combination of operator error and software configuration issues that caused   the problem.  The post mortem details what happened, what they will do to prevent the issue happening again, a couple of promised upgrades ( VPC across multiple AZ’s for starters) , a new architecture centre to help their customers.

The winners here are the AWS customers as the improvements in Availability Zones promised,  better communication and a whole slew of new webinars to help architecting for the AWS cloud. All of   which you know will be delivered in a timely manner will yet again put AWS even further out in front of the chasing pack.

In addition Amazon took the opportunity  with this post mortem to  respond to their critics with a description of how EBS actually works, A promise to improve communication, apologies and also explained that everyone right up the chain was involved ( hence the near  silence during the outage ).

Humbled but not down, more chasing for the pack …..

Leave a comment


Get every new post delivered to your Inbox.

%d bloggers like this: