James Saull

This user hasn't shared any biographical information

Homepage: http://jamessaull.wordpress.com

Simple Static Web Hosting: EC2 vs. S3

The question: should you use S3 to host a high volume static web site or should you configure and operate a load balanced and auto-scaling EC2 Multi A-Z cluster?

Assuming:

  • Ignore cost of storage. You get loads bundled on EC2 and a couple of GB is only a few cents on S3. This is not true for big media streaming of course.
  • Bandwidth costs are the same for S3 and EC2 and can be excluded from the comparison.
  • Management costs are 1:1 with VM costs. This in turn assumes existing management infrastructure and people are in place and this website is an incremental requirement.
  • EC2 will require two instances running. Ideally one in each Zone to achieve a vaguely similar sort of availability target as S3.
  • A home page could require circa 100 GET requests (perhaps overdoing it a little bit)
  • A UK only web site may only be truly busy for 12 hours per day.
  • The cost of a “Heavy Utilisation 1 Year Reserved Instance Small Linux Instance in EU” is $45.65 per month. Two instances: $91.30 per month. Total managed cost: $200 per month.

You would have to make 200,000,000 GET requests per month to reach $200. That is 2,000,000 Page Views per month. Considering 12 hours per day: 91 pages per second. This is a small load shared between two web servers only serving static content. Surely within the reach of two small Linux instances – in fact shouldn’t they serve 10x that volume for the same price?

Conclusion:

Because S3 sites are so incredibly simple to setup and have high availability, scalability and performance baked in you can’t possibly justify building up EC2 based web servers at low page volumes. However, the primary cost of S3 is down to GET requests and there are no price/volume breaks in the pricing of GET requests. The costs scale with the volume of requests in a linear way and much faster than they do if you were to build your own EC2 fleet.

If you don’t already have a management function in place for EC2 then the level of scale needed to justify this expense would be considerably higher. The big benefit of S3 sites is in the “static web site as a service” element – i.e. a highly scalable, available, high performance, simple and managed environment.

The linear relationship between scale and costs whilst a disadvantage in one way could be seen as an advantage. The S3 site scales from 0 to Massive and back again instantly. It can remain dormant for days and then fend off massive media focus.

However, I was surprised to see that this wasn’t as cut and dried in favour of S3 as I’d assumed or hoped.

S3 sites still indicate that you should optimise pages to make as few GET requests as possible by caching in the browser, combining JavaScript files etc. The more you can do this, the longer you can delay (perhaps indefinitely) the crossover to replacing it with an EC2 solution.

Leave a comment

More Cloud Computing Economics

I have just come across a site dedicated to in depth studies of the economics of cloud computing : http://cloud-computing-economics.com/

One of their authors touched on a pet subject of mine (Spot Pricing) with a paper on how to deliver a robust SaaS whilst using using IaaS and Spots to maximise revenue http://cloud-computing-economics.com/business-benefits-applications/delivering-reliable-services-spot-instances/

@jamessaull

Leave a comment

Doing BI to hunt for Spot Price trends: only for low margin high scale tasks

A couple of  hours in the evening meddling with Python and Excel, using the AWS Spot price history as a dataset, and I have concluded that a little data mining is probably worth doing if you are going to consume at scale where a few cents/hour could justify the cost of doing the analysis! Continuously updating the data and observing trends along all the dimensions of Region, Zone, Day of Week, Hour of Day, Instance Type could yield reasonable savings; and whilst it might not matter much now, who is to say more significant fluctuations and trends won’t emerge in the future?

My analysis didn’t reveal any gob smacking trends – maybe AWS suppress that sort of thing by the way they operate Spots. Sure, we can see that Tuesday 2AM UTC in the EU generally yields a poor price but by 10AM it is generally the best price. Overall these variations can only be worth tracking when amplified by scale.

The chart below shows each day of the week and spot instance pricing for m2.xlarge in the EU between 9th Feb 2011 and 9th May 2011. There is nothing immediate to suggest that AWS, for example, have huge amounts of excess capacity on Sundays (which would be evident by no plots on the top right). Nothing to suggest constrained capacity on Mondays (which would be shown by no plots on the bottom left). Most instance types show this pattern. The big price void in the middle is discussed in more detail in a previous post. But in the grand schemes of things it looks as if AWS Spot prices flip between a low price and a high price – but pretty much always better than the On Demand price.

image

Looking at the Spot prices by hour of the day also does not immediately show any long-term pattern of abundance or scarcity during the day.

image

Transforming the Spot data from Amazon and feeding it into a pivot table chart we can overlay some trend lines. It would seem that Monday in general gets more expensive as the day progresses and Tuesday declines in price. Therefore, in general the better prices would be found by using Monday mornings and Tuesday afternoons. In reality this may only account for a few cents per instance-hour so this only matters for those consuming at scale – supplementing an EMR cluster with an extra 50 Spot instances for 8 hours per day over a year is 146,000 instance hours. Whilst this would represent thousands of dollars of savings over On Demand instances, you’d need to be working at much larger scale to justify fussing over which part of which day typically give the best prices.

The chart below shows the effective aggregated cost per hour per day of the week over the sample period – e.g. the aggregate cost of running an instance from 9AM – 10AM every Monday. The blue trend line is Monday, and the red trend line is Tuesday.

image

 

@jamessaull

1 Comment

Google App Engine and instant-on for hibernating applications

At the #awssummit @Werner mentioned the micro instances and how they were useful to support use cases such as simple monitoring roles or to host services that need to always be on and listen for occasional events. @simonmunro cynically suggested it was a way of squeezing life out of old infrastructure.

I would prefer applications to hibernate, release resources and stop billing me but come back on instantly on demand.

Virtual Machine centric compute clouds (whether they be PaaS or IaaS oriented) exhibit this annoying “irreducible minimum” issue, whereby you have to leave some lights on. Not necessarily for storage, cloud databases or queues such as S3, SQS, SimpleDB, SES, SNS, Azure Table Storage etc. – they are always alive and will respond to a request irrespective of volume or frequency. They scale from 0 upwards. Not so for compute. Compute typically scales from 1 or 2 upwards (depending on your SLA view of availability).

This is one feature I really like about Google’s App Engine. The App Engine fabric brokers incoming requests, resolves it to the serving application and launches instances of the application if none are already running. An application can be idle for weeks consuming no compute resources and then near-instantly burst into life and scale up fast and furiously before returning to 0 when everything goes quiet. This is how elasticity of compute should behave.

My own personal application on App Engine remains idle most of the time. I am unable to detect that my first request has forced App Engine to instantiate an instance of my application first. My application is simple, but a new instance of my Python application will get a memcached cache-miss for all objects and proceed to the datastore to query for my data, put this data into the cache, and then pass the view-model objects to Django for rendering. Brilliantly fast. I can picture a pool of Python EXEs idling and suddenly the fabric picks one, hands it a pointer to my code base and an active request – bam – instant-on application.

For those applications that cannot give good performance from a cold start, App Engine supports the notion of “Always On” by forcing instances to stay alive with caches all loaded and ready for action: http://code.google.com/appengine/docs/adminconsole/instances.html

The screen shots below show my App Engine dashboard before my first request, how it spins up a single instance to cope with demand followed by termination after ten minutes of being idle.

Stage 1: View of the dashboard – no instances running – no activity in the past 24 hours

NoActivity

Stage 2: A request has created a single instance of the application and the average latency is 147ms. The page appeared PDQ in my browser.

SomeActivity

Stage 3: 17 Requests later, and the average latency has dropped. One instance is clearly sufficient to support one user poking around.

OnlyOneInstance

Stage 4: I left the application alone for nearly ten minutes. My instance is still alive, but nothing happening.

IdleButStillInstances

Stage 5: After about ten minutes of being idle my application instance vanishes. App Engine has reclaimed the resources.

IdleFor10minsAndNoInstances

@jamessaull

Leave a comment

Passing the private cloud duck test

Clouds need Self Service Portals (SSP). I often wonder to whom “self” refers to and I think it would help a lot if people clarified that when describing their products. Is it a systems administrator, a software developer, a sales manager?

I have just read the Forrester report “Market Overview: Private Cloud Solutions, Q2 2011 by James Staten and Lauren E Nelson” which is actually pretty good. They cover IaaS private cloud solutions from the likes of Eucalyptus Systems, Dell, HP, IBM, Microsoft, VMWare etc. What I particularly liked is the way they interviewed and asked the vendors to demonstrate their cloud from a user centred perspective: “as a cloud administrator do…” , “as an engineering user perform…” , “logged in as a marketing user show…”. This moves the conversation away from rhetoric and techy details about hypervisors to the benefits realised by the consumers.

If it doesn’t look like a duck or quack like a duck it probably isn’t a duck.

Forrester have also tried to be quite strict in narrowing down the vendors they included in this report because, frankly, things weren’t passing the duck test. They also asked them to supply solid enterprise customer references where their solution was being used as a private cloud and they found: “Sadly, even some of those that stepped up to our requests failed in this last category”.

Good. Let’s get tough on cloud-washing.

@jamessaull

Leave a comment

Qualifying and quantifying “spiky” and “bursty” workloads as candidates for Cloud

Enterprises are looking to migrate applications to the cloud. Enterprises with thousands of applications require a fast, consistent and repeatable process to identify which applications could stand to benefit. One of the benefits of cloud is how on-demand elasticity of seemingly infinite resources can be an advantage to “spiky” or “bursty” workloads. But as I mentioned in a recent post, people may have a very different view on what constitutes “spiky”. This could yield very unpredictable results when trying to identify those cloud candidates.

It would be useful to have a consistent and repeatable way to determine whether an application was spiky.

We normally translate spiky and bursty to “high variability” which is a good term to use as it indicates the statistical methods by which we can assess whether our utilization patterns match this description and hence benefit from the cloud.

Consider the utilization graphs below (numbers displayed beneath). It could equally be transactions per second – just don’t mix and match.

  1. “Very High” line shows a single, short lived spike.
  2. “High” shows two slightly longer lived spikes.
  3. “Mild” shows the same two spikes, less exaggerated and decaying to a non-zero utilization.
  4. “None” shows small fluctuations but no real spikes – essentially constant utilization.

2,3 and 4 actually have the same average utilization of c. 25%. This means that they consume the same amount of compute cycles during the day irrespective of their pattern; however we can see that to cater for “High” we actually have to have 3x the capacity to service the peak than for “None” – leaving resources underutilized most of the time. The curse of the data centre.

Clearly the Average utilization isn’t the whole picture. We need to look at the Standard Deviation to see into the distribution of utilization:

  • High = 42.5
  • Mild = 21.2
  • None = 5.0

Excellent, so the high standard deviation is starting to show which ones are variable and those which are not. But what is the standard deviation of the spikiest load of all “Very High”? Only 20.0? Much the same as the “Mild” line! The final step is to look at the Coefficient of Variation which is the ratio of the Standard Deviation to the Mean. The Coefficient of Variation is:

  • Very High = 4.00
  • High = 1.59
  • Mild = 0.82
  • None = 0.20

@grapesfrog asked me to describe something like AWS Elastic Map Reduce in these terms. Think of EMR as utilisation of {0,0,0,0,5000,0,0,0,0,0,0,0,0,0,0,0…} where 5000% utilisation is 50 x 1 machine at 100% utilisation. So if you used 50 machines @ 100% for one hour every day your CV would be 4.8. If you used 50 machines for one hour every month your CV would rise to 27.2.

Conclusions:

Comparing the mean utilization allows us to compare the relative amount of resource used over a period of time. This showed that they nearly all consumed the same amount of resources with the very noticeable exception of the spikiest of them all. It actually consumed very little resource in total.

Comparing the coefficient of variation reveals the spikiest workload. In this example, the spikiest would require 3x the resources of the least spikey to service the demand BUT would only actually consume 20% of the resources consumed by the least! Sometimes this is the point: spikiest loads require the largest amount of resources to be deployed but can actually consume the least.

Further work:

Our assessment could state “any workload showing a CV>0.5” is a candidate for cloud – revealing applications with spiky intra day behaviour as well as month-end classics.

Workloads that oscillate with a high frequency between extremes may show a CV>0.5 but we begin to trespass on topics such as dead beat control within control theory, and will start to challenge the ability of cloud monitoring/control time resolution etc. I’ll leave it there for the time being though…

clip_image002

Hour Very High High Mild None
1 0 0 10 30
2 0 0 10 20
3 0 0 10 30
4 0 0 10 20
5 0 0 10 30
6 0 0 15 20
7 0 10 20 30
8 0 100 50 20
9 0 100 80 30
10 0 100 50 20
11 0 10 20 30
12 0 0 15 20
13 0 0 15 30
14 10 0 15 20
15 100 0 15 30
16 10 0 15 20
17 0 0 15 30
18 0 10 20 20
19 0 100 50 30
20 0 100 80 20
21 0 100 50 30
22 0 10 20 20
23 0 0 15 30
24 0 0 10 20
         
Standard Deviation 20.0 42.5 21.2 5.0
Average 5 27 26 25
Coefficient of Variation 4.00 1.59 0.82 0.20
Median 0 0 15 25
Minimum 0 0 10 20
Maximum 100 100 80 30

 

@jamessaull

Leave a comment

Issues and benefits conflated with spiky, bursty applications moving to the cloud

Fellow cloudcomments.net poster Simon Munro made some excellent follow up comments to my recent comments about enterprise applications qualifying for the cloud. I’ve tried to mangle the offline conversation into a post:

Simon was remarking that there are some other qualities conflated with spikiness that are perhaps easily ignored:

Performance – you cannot assume that the required performance is fixed. During peak periods users may tolerate slower performance (and not even notice it). You would have to include something like ApDex in order to get a better sense and impact of the spike.

I think this is a very good point and, ironically, i think it goes both ways. Some apps just aren’t worth the extra resources to maintain peak performance. However, for some apps, the rise in demand might indicate you should make the performance even better as this is a critical moment; the moment that really counts. For example: customer loyalty, serving really stressed users who only care about this system RIGHT NOW or even responding to a crisis.

Base Cost – particularly with commercial enterprise software you have base situations where the cost and load relationship is non linear. Think of the cost of a base Oracle setup, with backup licences, installation costs, DBAs – where the cost is fixed, regardless on whether or not you are putting n or 3n transactions through.

Another excellent point. Although, hopefully this will change over time in response to cloud. Even now we have Oracle on AWS RDS / AWS DevPay or SQL Server via SQL Azure.

Simon then introduces us to a new term:

Business Oriented Service Degradation. I did remark that this concept is sort of covered in some SLAs and SLOs but this is way cooler ;) Simon’s point is that when an accounting system does it’s end of month run (the spike), the ability to process other transactions is irrelevant because the system is ‘closed’ for new transactions by the business anyway.

Sometimes I wonder if the constraints of the past are cast into the DNA of those operating models. This month is closed but plenty of people could be entering new transactions for the next quarter. Is it possible the resource constraints meant that this was historically a better solution?

The point remains though if the spike is huge and infrequent there is a mismatch in total resources deployed (cost) and the level of utilisation. That means waste.

Interestingly, if the utilisation is sat at 100% for extended periods there is also the case for giving it access to more resources. Clearly, with more resources, the system could complete this effort far sooner. Would there be benefit in that? Enabling “closing” the month even later than usual and capturing more transactions? Better compliance rating by never being late despite failed runs and reruns?

@jamessaull

Leave a comment

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: