James Saull

This user hasn't shared any biographical information

Homepage: http://jamessaull.wordpress.com

Simple Static Web Hosting: EC2 vs. S3

The question: should you use S3 to host a high volume static web site or should you configure and operate a load balanced and auto-scaling EC2 Multi A-Z cluster?

Assuming:

  • Ignore cost of storage. You get loads bundled on EC2 and a couple of GB is only a few cents on S3. This is not true for big media streaming of course.
  • Bandwidth costs are the same for S3 and EC2 and can be excluded from the comparison.
  • Management costs are 1:1 with VM costs. This in turn assumes existing management infrastructure and people are in place and this website is an incremental requirement.
  • EC2 will require two instances running. Ideally one in each Zone to achieve a vaguely similar sort of availability target as S3.
  • A home page could require circa 100 GET requests (perhaps overdoing it a little bit)
  • A UK only web site may only be truly busy for 12 hours per day.
  • The cost of a “Heavy Utilisation 1 Year Reserved Instance Small Linux Instance in EU” is $45.65 per month. Two instances: $91.30 per month. Total managed cost: $200 per month.

You would have to make 200,000,000 GET requests per month to reach $200. That is 2,000,000 Page Views per month. Considering 12 hours per day: 91 pages per second. This is a small load shared between two web servers only serving static content. Surely within the reach of two small Linux instances – in fact shouldn’t they serve 10x that volume for the same price?

Conclusion:

Because S3 sites are so incredibly simple to setup and have high availability, scalability and performance baked in you can’t possibly justify building up EC2 based web servers at low page volumes. However, the primary cost of S3 is down to GET requests and there are no price/volume breaks in the pricing of GET requests. The costs scale with the volume of requests in a linear way and much faster than they do if you were to build your own EC2 fleet.

If you don’t already have a management function in place for EC2 then the level of scale needed to justify this expense would be considerably higher. The big benefit of S3 sites is in the “static web site as a service” element – i.e. a highly scalable, available, high performance, simple and managed environment.

The linear relationship between scale and costs whilst a disadvantage in one way could be seen as an advantage. The S3 site scales from 0 to Massive and back again instantly. It can remain dormant for days and then fend off massive media focus.

However, I was surprised to see that this wasn’t as cut and dried in favour of S3 as I’d assumed or hoped.

S3 sites still indicate that you should optimise pages to make as few GET requests as possible by caching in the browser, combining JavaScript files etc. The more you can do this, the longer you can delay (perhaps indefinitely) the crossover to replacing it with an EC2 solution.

Leave a comment

More Cloud Computing Economics

I have just come across a site dedicated to in depth studies of the economics of cloud computing : http://cloud-computing-economics.com/

One of their authors touched on a pet subject of mine (Spot Pricing) with a paper on how to deliver a robust SaaS whilst using using IaaS and Spots to maximise revenue http://cloud-computing-economics.com/business-benefits-applications/delivering-reliable-services-spot-instances/

@jamessaull

Leave a comment

Doing BI to hunt for Spot Price trends: only for low margin high scale tasks

A couple of  hours in the evening meddling with Python and Excel, using the AWS Spot price history as a dataset, and I have concluded that a little data mining is probably worth doing if you are going to consume at scale where a few cents/hour could justify the cost of doing the analysis! Continuously updating the data and observing trends along all the dimensions of Region, Zone, Day of Week, Hour of Day, Instance Type could yield reasonable savings; and whilst it might not matter much now, who is to say more significant fluctuations and trends won’t emerge in the future?

My analysis didn’t reveal any gob smacking trends – maybe AWS suppress that sort of thing by the way they operate Spots. Sure, we can see that Tuesday 2AM UTC in the EU generally yields a poor price but by 10AM it is generally the best price. Overall these variations can only be worth tracking when amplified by scale.

The chart below shows each day of the week and spot instance pricing for m2.xlarge in the EU between 9th Feb 2011 and 9th May 2011. There is nothing immediate to suggest that AWS, for example, have huge amounts of excess capacity on Sundays (which would be evident by no plots on the top right). Nothing to suggest constrained capacity on Mondays (which would be shown by no plots on the bottom left). Most instance types show this pattern. The big price void in the middle is discussed in more detail in a previous post. But in the grand schemes of things it looks as if AWS Spot prices flip between a low price and a high price – but pretty much always better than the On Demand price.

image

Looking at the Spot prices by hour of the day also does not immediately show any long-term pattern of abundance or scarcity during the day.

image

Transforming the Spot data from Amazon and feeding it into a pivot table chart we can overlay some trend lines. It would seem that Monday in general gets more expensive as the day progresses and Tuesday declines in price. Therefore, in general the better prices would be found by using Monday mornings and Tuesday afternoons. In reality this may only account for a few cents per instance-hour so this only matters for those consuming at scale – supplementing an EMR cluster with an extra 50 Spot instances for 8 hours per day over a year is 146,000 instance hours. Whilst this would represent thousands of dollars of savings over On Demand instances, you’d need to be working at much larger scale to justify fussing over which part of which day typically give the best prices.

The chart below shows the effective aggregated cost per hour per day of the week over the sample period – e.g. the aggregate cost of running an instance from 9AM – 10AM every Monday. The blue trend line is Monday, and the red trend line is Tuesday.

image

 

@jamessaull

1 Comment

Google App Engine and instant-on for hibernating applications

At the #awssummit @Werner mentioned the micro instances and how they were useful to support use cases such as simple monitoring roles or to host services that need to always be on and listen for occasional events. @simonmunro cynically suggested it was a way of squeezing life out of old infrastructure.

I would prefer applications to hibernate, release resources and stop billing me but come back on instantly on demand.

Virtual Machine centric compute clouds (whether they be PaaS or IaaS oriented) exhibit this annoying “irreducible minimum” issue, whereby you have to leave some lights on. Not necessarily for storage, cloud databases or queues such as S3, SQS, SimpleDB, SES, SNS, Azure Table Storage etc. – they are always alive and will respond to a request irrespective of volume or frequency. They scale from 0 upwards. Not so for compute. Compute typically scales from 1 or 2 upwards (depending on your SLA view of availability).

This is one feature I really like about Google’s App Engine. The App Engine fabric brokers incoming requests, resolves it to the serving application and launches instances of the application if none are already running. An application can be idle for weeks consuming no compute resources and then near-instantly burst into life and scale up fast and furiously before returning to 0 when everything goes quiet. This is how elasticity of compute should behave.

My own personal application on App Engine remains idle most of the time. I am unable to detect that my first request has forced App Engine to instantiate an instance of my application first. My application is simple, but a new instance of my Python application will get a memcached cache-miss for all objects and proceed to the datastore to query for my data, put this data into the cache, and then pass the view-model objects to Django for rendering. Brilliantly fast. I can picture a pool of Python EXEs idling and suddenly the fabric picks one, hands it a pointer to my code base and an active request – bam – instant-on application.

For those applications that cannot give good performance from a cold start, App Engine supports the notion of “Always On” by forcing instances to stay alive with caches all loaded and ready for action: http://code.google.com/appengine/docs/adminconsole/instances.html

The screen shots below show my App Engine dashboard before my first request, how it spins up a single instance to cope with demand followed by termination after ten minutes of being idle.

Stage 1: View of the dashboard – no instances running – no activity in the past 24 hours

NoActivity

Stage 2: A request has created a single instance of the application and the average latency is 147ms. The page appeared PDQ in my browser.

SomeActivity

Stage 3: 17 Requests later, and the average latency has dropped. One instance is clearly sufficient to support one user poking around.

OnlyOneInstance

Stage 4: I left the application alone for nearly ten minutes. My instance is still alive, but nothing happening.

IdleButStillInstances

Stage 5: After about ten minutes of being idle my application instance vanishes. App Engine has reclaimed the resources.

IdleFor10minsAndNoInstances

@jamessaull

Leave a comment

Passing the private cloud duck test

Clouds need Self Service Portals (SSP). I often wonder to whom “self” refers to and I think it would help a lot if people clarified that when describing their products. Is it a systems administrator, a software developer, a sales manager?

I have just read the Forrester report “Market Overview: Private Cloud Solutions, Q2 2011 by James Staten and Lauren E Nelson” which is actually pretty good. They cover IaaS private cloud solutions from the likes of Eucalyptus Systems, Dell, HP, IBM, Microsoft, VMWare etc. What I particularly liked is the way they interviewed and asked the vendors to demonstrate their cloud from a user centred perspective: “as a cloud administrator do…” , “as an engineering user perform…” , “logged in as a marketing user show…”. This moves the conversation away from rhetoric and techy details about hypervisors to the benefits realised by the consumers.

If it doesn’t look like a duck or quack like a duck it probably isn’t a duck.

Forrester have also tried to be quite strict in narrowing down the vendors they included in this report because, frankly, things weren’t passing the duck test. They also asked them to supply solid enterprise customer references where their solution was being used as a private cloud and they found: “Sadly, even some of those that stepped up to our requests failed in this last category”.

Good. Let’s get tough on cloud-washing.

@jamessaull

Leave a comment

Qualifying and quantifying “spiky” and “bursty” workloads as candidates for Cloud

Enterprises are looking to migrate applications to the cloud. Enterprises with thousands of applications require a fast, consistent and repeatable process to identify which applications could stand to benefit. One of the benefits of cloud is how on-demand elasticity of seemingly infinite resources can be an advantage to “spiky” or “bursty” workloads. But as I mentioned in a recent post, people may have a very different view on what constitutes “spiky”. This could yield very unpredictable results when trying to identify those cloud candidates.

It would be useful to have a consistent and repeatable way to determine whether an application was spiky.

We normally translate spiky and bursty to “high variability” which is a good term to use as it indicates the statistical methods by which we can assess whether our utilization patterns match this description and hence benefit from the cloud.

Consider the utilization graphs below (numbers displayed beneath). It could equally be transactions per second – just don’t mix and match.

  1. “Very High” line shows a single, short lived spike.
  2. “High” shows two slightly longer lived spikes.
  3. “Mild” shows the same two spikes, less exaggerated and decaying to a non-zero utilization.
  4. “None” shows small fluctuations but no real spikes – essentially constant utilization.

2,3 and 4 actually have the same average utilization of c. 25%. This means that they consume the same amount of compute cycles during the day irrespective of their pattern; however we can see that to cater for “High” we actually have to have 3x the capacity to service the peak than for “None” – leaving resources underutilized most of the time. The curse of the data centre.

Clearly the Average utilization isn’t the whole picture. We need to look at the Standard Deviation to see into the distribution of utilization:

  • High = 42.5
  • Mild = 21.2
  • None = 5.0

Excellent, so the high standard deviation is starting to show which ones are variable and those which are not. But what is the standard deviation of the spikiest load of all “Very High”? Only 20.0? Much the same as the “Mild” line! The final step is to look at the Coefficient of Variation which is the ratio of the Standard Deviation to the Mean. The Coefficient of Variation is:

  • Very High = 4.00
  • High = 1.59
  • Mild = 0.82
  • None = 0.20

@grapesfrog asked me to describe something like AWS Elastic Map Reduce in these terms. Think of EMR as utilisation of {0,0,0,0,5000,0,0,0,0,0,0,0,0,0,0,0…} where 5000% utilisation is 50 x 1 machine at 100% utilisation. So if you used 50 machines @ 100% for one hour every day your CV would be 4.8. If you used 50 machines for one hour every month your CV would rise to 27.2.

Conclusions:

Comparing the mean utilization allows us to compare the relative amount of resource used over a period of time. This showed that they nearly all consumed the same amount of resources with the very noticeable exception of the spikiest of them all. It actually consumed very little resource in total.

Comparing the coefficient of variation reveals the spikiest workload. In this example, the spikiest would require 3x the resources of the least spikey to service the demand BUT would only actually consume 20% of the resources consumed by the least! Sometimes this is the point: spikiest loads require the largest amount of resources to be deployed but can actually consume the least.

Further work:

Our assessment could state “any workload showing a CV>0.5” is a candidate for cloud – revealing applications with spiky intra day behaviour as well as month-end classics.

Workloads that oscillate with a high frequency between extremes may show a CV>0.5 but we begin to trespass on topics such as dead beat control within control theory, and will start to challenge the ability of cloud monitoring/control time resolution etc. I’ll leave it there for the time being though…

clip_image002

Hour Very High High Mild None
1 0 0 10 30
2 0 0 10 20
3 0 0 10 30
4 0 0 10 20
5 0 0 10 30
6 0 0 15 20
7 0 10 20 30
8 0 100 50 20
9 0 100 80 30
10 0 100 50 20
11 0 10 20 30
12 0 0 15 20
13 0 0 15 30
14 10 0 15 20
15 100 0 15 30
16 10 0 15 20
17 0 0 15 30
18 0 10 20 20
19 0 100 50 30
20 0 100 80 20
21 0 100 50 30
22 0 10 20 20
23 0 0 15 30
24 0 0 10 20
         
Standard Deviation 20.0 42.5 21.2 5.0
Average 5 27 26 25
Coefficient of Variation 4.00 1.59 0.82 0.20
Median 0 0 15 25
Minimum 0 0 10 20
Maximum 100 100 80 30

 

@jamessaull

Leave a comment

Issues and benefits conflated with spiky, bursty applications moving to the cloud

Fellow cloudcomments.net poster Simon Munro made some excellent follow up comments to my recent comments about enterprise applications qualifying for the cloud. I’ve tried to mangle the offline conversation into a post:

Simon was remarking that there are some other qualities conflated with spikiness that are perhaps easily ignored:

Performance – you cannot assume that the required performance is fixed. During peak periods users may tolerate slower performance (and not even notice it). You would have to include something like ApDex in order to get a better sense and impact of the spike.

I think this is a very good point and, ironically, i think it goes both ways. Some apps just aren’t worth the extra resources to maintain peak performance. However, for some apps, the rise in demand might indicate you should make the performance even better as this is a critical moment; the moment that really counts. For example: customer loyalty, serving really stressed users who only care about this system RIGHT NOW or even responding to a crisis.

Base Cost – particularly with commercial enterprise software you have base situations where the cost and load relationship is non linear. Think of the cost of a base Oracle setup, with backup licences, installation costs, DBAs – where the cost is fixed, regardless on whether or not you are putting n or 3n transactions through.

Another excellent point. Although, hopefully this will change over time in response to cloud. Even now we have Oracle on AWS RDS / AWS DevPay or SQL Server via SQL Azure.

Simon then introduces us to a new term:

Business Oriented Service Degradation. I did remark that this concept is sort of covered in some SLAs and SLOs but this is way cooler ;) Simon’s point is that when an accounting system does it’s end of month run (the spike), the ability to process other transactions is irrelevant because the system is ‘closed’ for new transactions by the business anyway.

Sometimes I wonder if the constraints of the past are cast into the DNA of those operating models. This month is closed but plenty of people could be entering new transactions for the next quarter. Is it possible the resource constraints meant that this was historically a better solution?

The point remains though if the spike is huge and infrequent there is a mismatch in total resources deployed (cost) and the level of utilisation. That means waste.

Interestingly, if the utilisation is sat at 100% for extended periods there is also the case for giving it access to more resources. Clearly, with more resources, the system could complete this effort far sooner. Would there be benefit in that? Enabling “closing” the month even later than usual and capturing more transactions? Better compliance rating by never being late despite failed runs and reruns?

@jamessaull

Leave a comment

Just because your application would never earn a place on highscalability.com doesn’t mean it won’t qualify for cloud.

Working with large enterprises with 1000s of applications it is useful to assess hundreds of applications at a time to determine which ones will benefit from moving to the cloud. The first iteration to reveal the “low hanging fruit” requires a fairly swift “big handfuls” approach. One of the most commonly stated economic benefits of the cloud is elastically coping with “spiky” or “bursty” demand. OK, so whilst we scan through the application portfolio one of the things we have to spot are applications expressing this characteristic.

I first want to tackle one of the, ever so slightly, frustrating consequence of the original cloud use cases and case studies. You know – the one where some image processing service goes bananas on Sunday night when everyone uploads their pictures and they scale up with an extra 4000 nodes for a few hours to cope with the queue. The consequence is that people now only perceive such skewed scenarios as cloud candidates along the “spiky” dimension. Hence my parting note in my recent post on Evernote, Zynga and Netflix – even normal business systems express spiky behaviour when you consider the usage cycle over a day, month or year. This is one of the reasons, despite all the virtualisation magic, for still such low average utilisation figures in the data centre.

Some classic mitigations for this (excluding cloud) are to distribute the load more evenly across time and resources e.g.:

  • run maintenance tasks such as index rebuilds, backups, consistency checks during the quiet periods
  • build decisions support data structures overnight and prepare the most common or resource intensive reports before people turn up to work
  • make the application more asynchronous and use queues to defer processing to quieter periods or on to other resources
  • use virtualisation to consolidate VMs on to less physical hosts during quiet periods so you can power off hosts (those hosts are now idle assets returning no business value but at least they aren’t consuming power/cooling)

Despite all this, many applications continue to exhibit spiky behaviour – just not in the extreme headline sense that we see on highscalability.com.

In a cloud assessment you just want to identify this behaviour as one reason (among many) to put that application on your list of candidates for further study and business justification. In a cloud migration some applications may be immune from the bulleted options above for a number of reasons. Anyway, those techniques will work nicely in the cloud too of course.

@jamessaull

2 Comments

Cloud or not to cloud? Evernote, Zynga and Netflix

Recently I read a couple of articles from highscalability.com. One about Evernote and one about Zynga. I’d love to see their TCO models. Evernote have chosen not to use a cloud but instead opted to architect, build, operate and maintain themselves. Zynga have chosen to use the AWS cloud initially and then once usage growth/pattern stabilises move it on to their own cloud that mimics AWS. They don’t name Eucalyptus or anything.

Both these offerings are from relatively new companies and are pure web, high scale, green field applications. No legacy. So a couple of thoughts occurred to me about Evernote:

  • This is an advanced engineering team with a “not invented here” attitude. They want to piece all the excellent open source pieces together and control every last element. Perhaps they are “controlling costs” but not really thinking about Total Cost of Ownership.
  • Actually they are a totally dispassionate about technology but maniacally obsessed with cost. They crunched the numbers and found that they are significantly better off building, operating and maintaining all the hardware, software and people as well as all the qualitative benefits of cloud.

With Zynga I thought:

  • Once you have a sustained high utilisation pattern for a game you shift it to on-premise fixed infrastructure that you can provision like crazy. Cool. What happens when the game goes completely out of favour at the whim of the web and declines at web-pace?

At the same time we hear plenty of Netflix and their commitment to the cloud.

I’d love to see more detail around the TCO models these companies have created around the cloud and how the variables tipped each company in favour of their different results.

As an aside, I often hear the statement “we don’t have spiky usage” as if this is the main economic driver for cloud adoption. What do people mean by that? Do they mean an extra 1000 VMs for a couple of hours? An extra Petabyte of storage for a week? It’d be useful if people started to qualify these statements more nowadays.

Even regular enterprise apps (that are perceived to have non-spiky usage) may only be in use for 8 hours per day with a few intra day surges. That means for nearly 66% of each day systems are significantly under utilised – maybe not totally idle. Yet this is not really seen as “spiky” by most. However, if I can conservatively turn off half of my servers for half of the time that could make a significant cost saving when amplified by the scale of a large enterprise.

@jamessaull

1 Comment

Don’t forget Spot Instances on AWS

For those with an attention span of 30 seconds: Spot Instances on AWS are not a curiosity. Seriously consider using Spot Instances on AWS wherever possible, even if you need the instance all the time. Yes. All the time. Carry on reading if you like or jump to the conclusions towards the end.

I previously demonstrated that if you can guarantee that you are going to use an AWS EC2 instance for more than 24% of the time (over a 3 year period) then you should absolutely be using reserved instances. This happens to be very close to 8 hours per day, 5 days per week, 52 weeks per year; highlighting that in fact On Demand instances are probably not your starting point.

This begged the question: when to use a Spot Instance? Let’s have go at answering that question, if not conclusively but at least provide more information to help make that decision.

The first question to ask is: given a certain bid price, then what percentage of the time can I actually get an instance at that price? This will help answer questions like: how often is the spot price actually less than the On Demand and Reserved Instance Price? If the answer is “hardly ever” then we can probably forget about them. It seems the answer is better than that.

Pulling down the spot price history for the main instance types in the EU region and then calculating the percentage of time spent below a specific price reveals a common pattern:

image

Figure 1 Spot bid price and percentage yield (availability)

It would seem that there is a discontinuity in the relationship at about 66%. Indicating, for example, that if your need for compute is less than 66% of the time then it would appear that you can get a good price for it. The question is, how does this compare with Reserved and On Demand instances. Let’s zoom in on the m1.large instance in the EU:

image

Figure 2 The cost of EU m1.large instances at different levels of utilisation

Be clear how to read this chart – it answers questions such as: if you were to run an instance for 50% of the time, what would the Reserved Instance price be and what price would I need to bid for Spot Instances to guarantee that level of utilisation.

The above chart shows, again, how 3 Year Reserved Instances become more economical than On Demand at >24% utilisation and 1 yr Reserved Instances at >47%. It also shows the asymptotic line of Reserved Instance pricing as utilisation drops towards 0% – i.e. they rapidly become several times more expensive than On Demand instances below 10% utilisation. Not surprising, because at 0% utilisation you have paid the upfront fee for no compute time.

Let’s zoom in and see where the lines intersect:

image

Figure 3 Cost of EU m1.large instances at different levels of utilization – zoomed in on intersections

Interestingly if we were looking for 92% utilisation we could set the Spot Price as high as a 1 yr Reserved Instance price. Importantly, remember, the bid price would be the maximum you would pay. In reality a lot of the time the price paid would be much lower. Clearly setting the Spot price a smidgen higher than the On Demand price gives 100% utilisation but with the reality that you pay a lot less almost all the time!

According to this data if you set the Spot ceiling at $0.4 (c.f. $0.38 for On Demand) you would be given an instance 100% of the time, yet:

  • 0.013% of the time you would pay $0.02 more than the On Demand price – just over one hour per year. To all intents and purposes the Spot Instance will always cost less than On Demand.
  • In fact, 66% of the time you would pay less than half the On Demand price and the remaining 34% of the time you would be paying about 70% of the On Demand price. This translates to approximately $0.20 overall – half price.
  • Magically, running a Spot Instance 100% of the time costs about the same price as a 3yr Reserved Instance running 100% of the time…

So the next question is: if I do set the Spot ceiling so high that it runs for 100% of the time, and taking into account all the price fluctuations, what is the effective price paid and how does that compare with running On Demand or Reserved for 100%?

image

Figure 4 Running all instance types for 100% of the time for each of the pricing options (taking into account Spot fluctuations)

The above chart shows that even running Spot Instances 100% of the time (by setting the bid price sufficiently high) actually works out cheaper than 3 yr Reserved Instances running at 100%.

Conclusions and generalisations:

If you can deploy your application to run within Spot Instances, you can set the Spot price to the max Spot Price and run even more cheaply than the effective hourly rate of a 3yr Reserved Instance – without the need to venture anything upfront.

If you are really trying to push the cost-of-execution right down, then setting the Spot bid price to half the On Demand price will return a 66% yield. This will be 25% cheaper than 3yr Reserved Instances at 66% utilisation.

The gradient of the Spot Instance pricing lines shows that there is actually little incentive to set the bid price below the 66% utilisation point. Availability drops dramatically with little reward in price. In other words, you will get significantly less likelihood of being given an instance, a much higher chance of having it terminated abruptly, for very little gain in hourly price.

Remember, for EU m1.large, Reserved Instance Price states $0.16 but when you take in to account the $1400 up front then the effective hourly price never drops below $0.21 (100% utilisation). It is often stated that the Spot price never really drops below the Reserved Instance price – this is not true as we can see that it is less for 66% of the time.

Cautions:

This data is only based on 88 days of spot price history in the EU region.

@jamessaull

5 Comments

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: