Archive for category Cloud Economics

AWS doesn’t want you to cloud burst

AWS doesn’t want cloud bursting, where customers come to the platform and retreat to their on-premise infrastructure when the work is done. AWS is keen to get you to commit to longer term reserved instances and is adjusting their pricing accordingly.

There is nothing like a price drop in Jeff Barr’s midnight AWS announcements to get everybody excited in the morning. What interests me is not the price drop per se, nor the inevitable demand for competitors to follow, nor the back and forth comparison with traditional managed hosting that will get underway (again). What is interesting is the increasing differential between the price drop for on demand versus reserved instances.

As James has pointed out, reserved instance pricing is important in developing a cost model for cloud applications and over time, but the gathering of data and analysis can get a bit tricky. EC2 reserved instance pricing page reckons that reserved instance pricing will save between 50% and 70% of your EC2 costs (RDS has similar savings for reserved databases) — which is a compelling proposition.

Anecdotal evidence (read “I may or may not have heard it somewhere”) suggests that a lot of AWS business is for cloud bursting — where AWS is used, not as the primary platform, but one to use occasionally. It would also seem that the ‘occasionally’ refers to development and test capacity, rather than an architecture engineered to use AWS as a true cloud bursting platform for a production system. By creating a huge differential between on demand and reserved instances, AWS may be presenting the teaser to convert those ‘cloud burst’ uses into long term deployments. After all, if it is cheap enough to use on demand (which it must be otherwise customers would build their own) then being able to drop the price of something that customers know works (technically) by an additional 50% may push the decision makers in favour of AWS. But those savings only come from long term commitments (3 years), which is enough time to ensure that the particular application is committed to AWS, with others to follow.

While reductions in cloud costs are welcome, the fundamental problem of figuring out the costs, cost benefits and ‘in the box’ comparisons continue to be difficult. I have discussed this as an engineering problem and people always wade dangerously into the debate, as Jeff Barr did recently, and there is some tricky analysis required. Pricing is still far to complex and the models too immature to be convincing in front of the CFO, or whoever controls the budget, and private cloud advocates, who have been doing pricing for a while, can almost always bulldoze a public cloud pricing model. Rather than only a few pennies saved on EC2 savings, I would like to see some rich, capable pricing tools and models emerging.

Simon Munro

@simonmunro

,

2 Comments

More Cloud Computing Economics

I have just come across a site dedicated to in depth studies of the economics of cloud computing : http://cloud-computing-economics.com/

One of their authors touched on a pet subject of mine (Spot Pricing) with a paper on how to deliver a robust SaaS whilst using using IaaS and Spots to maximise revenue http://cloud-computing-economics.com/business-benefits-applications/delivering-reliable-services-spot-instances/

@jamessaull

Leave a comment

Doing BI to hunt for Spot Price trends: only for low margin high scale tasks

A couple of  hours in the evening meddling with Python and Excel, using the AWS Spot price history as a dataset, and I have concluded that a little data mining is probably worth doing if you are going to consume at scale where a few cents/hour could justify the cost of doing the analysis! Continuously updating the data and observing trends along all the dimensions of Region, Zone, Day of Week, Hour of Day, Instance Type could yield reasonable savings; and whilst it might not matter much now, who is to say more significant fluctuations and trends won’t emerge in the future?

My analysis didn’t reveal any gob smacking trends – maybe AWS suppress that sort of thing by the way they operate Spots. Sure, we can see that Tuesday 2AM UTC in the EU generally yields a poor price but by 10AM it is generally the best price. Overall these variations can only be worth tracking when amplified by scale.

The chart below shows each day of the week and spot instance pricing for m2.xlarge in the EU between 9th Feb 2011 and 9th May 2011. There is nothing immediate to suggest that AWS, for example, have huge amounts of excess capacity on Sundays (which would be evident by no plots on the top right). Nothing to suggest constrained capacity on Mondays (which would be shown by no plots on the bottom left). Most instance types show this pattern. The big price void in the middle is discussed in more detail in a previous post. But in the grand schemes of things it looks as if AWS Spot prices flip between a low price and a high price – but pretty much always better than the On Demand price.

image

Looking at the Spot prices by hour of the day also does not immediately show any long-term pattern of abundance or scarcity during the day.

image

Transforming the Spot data from Amazon and feeding it into a pivot table chart we can overlay some trend lines. It would seem that Monday in general gets more expensive as the day progresses and Tuesday declines in price. Therefore, in general the better prices would be found by using Monday mornings and Tuesday afternoons. In reality this may only account for a few cents per instance-hour so this only matters for those consuming at scale – supplementing an EMR cluster with an extra 50 Spot instances for 8 hours per day over a year is 146,000 instance hours. Whilst this would represent thousands of dollars of savings over On Demand instances, you’d need to be working at much larger scale to justify fussing over which part of which day typically give the best prices.

The chart below shows the effective aggregated cost per hour per day of the week over the sample period – e.g. the aggregate cost of running an instance from 9AM – 10AM every Monday. The blue trend line is Monday, and the red trend line is Tuesday.

image

 

@jamessaull

1 Comment

Qualifying and quantifying “spiky” and “bursty” workloads as candidates for Cloud

Enterprises are looking to migrate applications to the cloud. Enterprises with thousands of applications require a fast, consistent and repeatable process to identify which applications could stand to benefit. One of the benefits of cloud is how on-demand elasticity of seemingly infinite resources can be an advantage to “spiky” or “bursty” workloads. But as I mentioned in a recent post, people may have a very different view on what constitutes “spiky”. This could yield very unpredictable results when trying to identify those cloud candidates.

It would be useful to have a consistent and repeatable way to determine whether an application was spiky.

We normally translate spiky and bursty to “high variability” which is a good term to use as it indicates the statistical methods by which we can assess whether our utilization patterns match this description and hence benefit from the cloud.

Consider the utilization graphs below (numbers displayed beneath). It could equally be transactions per second – just don’t mix and match.

  1. “Very High” line shows a single, short lived spike.
  2. “High” shows two slightly longer lived spikes.
  3. “Mild” shows the same two spikes, less exaggerated and decaying to a non-zero utilization.
  4. “None” shows small fluctuations but no real spikes – essentially constant utilization.

2,3 and 4 actually have the same average utilization of c. 25%. This means that they consume the same amount of compute cycles during the day irrespective of their pattern; however we can see that to cater for “High” we actually have to have 3x the capacity to service the peak than for “None” – leaving resources underutilized most of the time. The curse of the data centre.

Clearly the Average utilization isn’t the whole picture. We need to look at the Standard Deviation to see into the distribution of utilization:

  • High = 42.5
  • Mild = 21.2
  • None = 5.0

Excellent, so the high standard deviation is starting to show which ones are variable and those which are not. But what is the standard deviation of the spikiest load of all “Very High”? Only 20.0? Much the same as the “Mild” line! The final step is to look at the Coefficient of Variation which is the ratio of the Standard Deviation to the Mean. The Coefficient of Variation is:

  • Very High = 4.00
  • High = 1.59
  • Mild = 0.82
  • None = 0.20

@grapesfrog asked me to describe something like AWS Elastic Map Reduce in these terms. Think of EMR as utilisation of {0,0,0,0,5000,0,0,0,0,0,0,0,0,0,0,0…} where 5000% utilisation is 50 x 1 machine at 100% utilisation. So if you used 50 machines @ 100% for one hour every day your CV would be 4.8. If you used 50 machines for one hour every month your CV would rise to 27.2.

Conclusions:

Comparing the mean utilization allows us to compare the relative amount of resource used over a period of time. This showed that they nearly all consumed the same amount of resources with the very noticeable exception of the spikiest of them all. It actually consumed very little resource in total.

Comparing the coefficient of variation reveals the spikiest workload. In this example, the spikiest would require 3x the resources of the least spikey to service the demand BUT would only actually consume 20% of the resources consumed by the least! Sometimes this is the point: spikiest loads require the largest amount of resources to be deployed but can actually consume the least.

Further work:

Our assessment could state “any workload showing a CV>0.5” is a candidate for cloud – revealing applications with spiky intra day behaviour as well as month-end classics.

Workloads that oscillate with a high frequency between extremes may show a CV>0.5 but we begin to trespass on topics such as dead beat control within control theory, and will start to challenge the ability of cloud monitoring/control time resolution etc. I’ll leave it there for the time being though…

clip_image002

Hour Very High High Mild None
1 0 0 10 30
2 0 0 10 20
3 0 0 10 30
4 0 0 10 20
5 0 0 10 30
6 0 0 15 20
7 0 10 20 30
8 0 100 50 20
9 0 100 80 30
10 0 100 50 20
11 0 10 20 30
12 0 0 15 20
13 0 0 15 30
14 10 0 15 20
15 100 0 15 30
16 10 0 15 20
17 0 0 15 30
18 0 10 20 20
19 0 100 50 30
20 0 100 80 20
21 0 100 50 30
22 0 10 20 20
23 0 0 15 30
24 0 0 10 20
         
Standard Deviation 20.0 42.5 21.2 5.0
Average 5 27 26 25
Coefficient of Variation 4.00 1.59 0.82 0.20
Median 0 0 15 25
Minimum 0 0 10 20
Maximum 100 100 80 30

 

@jamessaull

Leave a comment

Brave New world or?

Google’s recent announcement that they would be discontinuing support for older browsers could be seen either as  brave move in an ever changing world or a rather risky one . Brave as it reduces their support over head and hopefully encourages people to migrate away from older browsers in a  timely manner which is good for everyone involved right? Safer browser experience, consistency, compatibility with latest standards I could go on. BUT it’s also very risky as the guys with the bucks do not move that quickly as I was recently reminded when catching up with some old colleagues to be stunned to hear that they are  sticking to windows XP as  their internal apps do not work properly with the newer O/S’s or come to mention it the newer browsers I.E 6 will do nicely thank you very much  ( according to recent figures  I.E 6 still has 10% of the browser market and I.E 7 has around 7%  so they are removing support for about 17% of the market) . It’s not that they are luddites far from it they are a thriving business and seen as  leaders in their field but their field isn’t to keep up with latest innovations it’s to make money in their business of choice IT for internal users  is there to support the business .  The over head with rolling out new versions of Software and the testing involved when you have hundreds of apps and thousands of users to support is a daunting prospect so sticking to what works is a nice safe option.

Now I’m not sure how many  medium to large corporates use GoogleApps in anger but the fact that Google  are clearly trying to drive forward into the Brave New World regardless  I guess they will have done their research and figured that it’s  a risk worth taking as the eventual rewards will be worth it. I’m sure Microsoft, IBM and the like are happy that Google are the ones to have made this move so they don’t take the flak .They want you all to move forward too but it would hit their bottom line faster than it would hit Google’s which being a  a fairly new boy on the block can afford to make radical decisions like this.

Grace Mollison

1 Comment

Issues and benefits conflated with spiky, bursty applications moving to the cloud

Fellow cloudcomments.net poster Simon Munro made some excellent follow up comments to my recent comments about enterprise applications qualifying for the cloud. I’ve tried to mangle the offline conversation into a post:

Simon was remarking that there are some other qualities conflated with spikiness that are perhaps easily ignored:

Performance – you cannot assume that the required performance is fixed. During peak periods users may tolerate slower performance (and not even notice it). You would have to include something like ApDex in order to get a better sense and impact of the spike.

I think this is a very good point and, ironically, i think it goes both ways. Some apps just aren’t worth the extra resources to maintain peak performance. However, for some apps, the rise in demand might indicate you should make the performance even better as this is a critical moment; the moment that really counts. For example: customer loyalty, serving really stressed users who only care about this system RIGHT NOW or even responding to a crisis.

Base Cost – particularly with commercial enterprise software you have base situations where the cost and load relationship is non linear. Think of the cost of a base Oracle setup, with backup licences, installation costs, DBAs – where the cost is fixed, regardless on whether or not you are putting n or 3n transactions through.

Another excellent point. Although, hopefully this will change over time in response to cloud. Even now we have Oracle on AWS RDS / AWS DevPay or SQL Server via SQL Azure.

Simon then introduces us to a new term:

Business Oriented Service Degradation. I did remark that this concept is sort of covered in some SLAs and SLOs but this is way cooler ;) Simon’s point is that when an accounting system does it’s end of month run (the spike), the ability to process other transactions is irrelevant because the system is ‘closed’ for new transactions by the business anyway.

Sometimes I wonder if the constraints of the past are cast into the DNA of those operating models. This month is closed but plenty of people could be entering new transactions for the next quarter. Is it possible the resource constraints meant that this was historically a better solution?

The point remains though if the spike is huge and infrequent there is a mismatch in total resources deployed (cost) and the level of utilisation. That means waste.

Interestingly, if the utilisation is sat at 100% for extended periods there is also the case for giving it access to more resources. Clearly, with more resources, the system could complete this effort far sooner. Would there be benefit in that? Enabling “closing” the month even later than usual and capturing more transactions? Better compliance rating by never being late despite failed runs and reruns?

@jamessaull

Leave a comment

Just because your application would never earn a place on highscalability.com doesn’t mean it won’t qualify for cloud.

Working with large enterprises with 1000s of applications it is useful to assess hundreds of applications at a time to determine which ones will benefit from moving to the cloud. The first iteration to reveal the “low hanging fruit” requires a fairly swift “big handfuls” approach. One of the most commonly stated economic benefits of the cloud is elastically coping with “spiky” or “bursty” demand. OK, so whilst we scan through the application portfolio one of the things we have to spot are applications expressing this characteristic.

I first want to tackle one of the, ever so slightly, frustrating consequence of the original cloud use cases and case studies. You know – the one where some image processing service goes bananas on Sunday night when everyone uploads their pictures and they scale up with an extra 4000 nodes for a few hours to cope with the queue. The consequence is that people now only perceive such skewed scenarios as cloud candidates along the “spiky” dimension. Hence my parting note in my recent post on Evernote, Zynga and Netflix – even normal business systems express spiky behaviour when you consider the usage cycle over a day, month or year. This is one of the reasons, despite all the virtualisation magic, for still such low average utilisation figures in the data centre.

Some classic mitigations for this (excluding cloud) are to distribute the load more evenly across time and resources e.g.:

  • run maintenance tasks such as index rebuilds, backups, consistency checks during the quiet periods
  • build decisions support data structures overnight and prepare the most common or resource intensive reports before people turn up to work
  • make the application more asynchronous and use queues to defer processing to quieter periods or on to other resources
  • use virtualisation to consolidate VMs on to less physical hosts during quiet periods so you can power off hosts (those hosts are now idle assets returning no business value but at least they aren’t consuming power/cooling)

Despite all this, many applications continue to exhibit spiky behaviour – just not in the extreme headline sense that we see on highscalability.com.

In a cloud assessment you just want to identify this behaviour as one reason (among many) to put that application on your list of candidates for further study and business justification. In a cloud migration some applications may be immune from the bulleted options above for a number of reasons. Anyway, those techniques will work nicely in the cloud too of course.

@jamessaull

2 Comments

Cloud or not to cloud? Evernote, Zynga and Netflix

Recently I read a couple of articles from highscalability.com. One about Evernote and one about Zynga. I’d love to see their TCO models. Evernote have chosen not to use a cloud but instead opted to architect, build, operate and maintain themselves. Zynga have chosen to use the AWS cloud initially and then once usage growth/pattern stabilises move it on to their own cloud that mimics AWS. They don’t name Eucalyptus or anything.

Both these offerings are from relatively new companies and are pure web, high scale, green field applications. No legacy. So a couple of thoughts occurred to me about Evernote:

  • This is an advanced engineering team with a “not invented here” attitude. They want to piece all the excellent open source pieces together and control every last element. Perhaps they are “controlling costs” but not really thinking about Total Cost of Ownership.
  • Actually they are a totally dispassionate about technology but maniacally obsessed with cost. They crunched the numbers and found that they are significantly better off building, operating and maintaining all the hardware, software and people as well as all the qualitative benefits of cloud.

With Zynga I thought:

  • Once you have a sustained high utilisation pattern for a game you shift it to on-premise fixed infrastructure that you can provision like crazy. Cool. What happens when the game goes completely out of favour at the whim of the web and declines at web-pace?

At the same time we hear plenty of Netflix and their commitment to the cloud.

I’d love to see more detail around the TCO models these companies have created around the cloud and how the variables tipped each company in favour of their different results.

As an aside, I often hear the statement “we don’t have spiky usage” as if this is the main economic driver for cloud adoption. What do people mean by that? Do they mean an extra 1000 VMs for a couple of hours? An extra Petabyte of storage for a week? It’d be useful if people started to qualify these statements more nowadays.

Even regular enterprise apps (that are perceived to have non-spiky usage) may only be in use for 8 hours per day with a few intra day surges. That means for nearly 66% of each day systems are significantly under utilised – maybe not totally idle. Yet this is not really seen as “spiky” by most. However, if I can conservatively turn off half of my servers for half of the time that could make a significant cost saving when amplified by the scale of a large enterprise.

@jamessaull

1 Comment

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: