Amazon And Availability | Force of Good

Today Amazon reported earnings and they were pretty outstanding (I do not own any AMZN and have not since I unloaded what I purchased as "a friend" on their IPO). Profits doubled. Sales up 41%. Not a bad quarter.

Except it reminded me of what happened this past Sunday.

Amazon’s S3 online storage service experienced significant downtime. It was not available for about 6 hours (they were also down for about 3 hours back in February). That is a ton which I will explain shortly. But for those that do not know, S3 is a distributed storage cloud that many applications such as Twitter and SmugMug use. They charge $.15 per GB/month and $.10 per GB of data transfer in and a sliding scale starting at $.17 per GB of data out (this is more because Amazon has less slack data out due to the site traffic www.amazon.com generates, is mostly all out).

To provide a little background here, it is a little known fact that for a period of time I ran MindSpring’s product development team. The developers all reported up to me. The short story is we were growing so fast we started having some issues pushing products out the door, we went away for a few days to something that became known as the "technology summit", and at the end of the said summit the president of the company gave me the dev team because he wanted one person responsible. We fixed the issues. It is during this time that I become intimately knowledgeable about things such as reliability and availability.

Availability is simply the amount of time a service is available (scheduled maintenance downtown excluded). Availability is typically measured by the number of "nines" delivered. For example the plain old telephone service that you get from AT&T has an availability of five nines or 99.999% (this is sometimes referred to as telco grade). Now back in my ISP days we did not shoot for five nines for core services such as mail and web, it would cost too much to so do. Our objective was four nines or 99.99%. Now removing one little nine may not seem like much but it makes a big difference in the amount of downtime a service has as you can see in the table below.

Uptime (%)	Downtime/Year
99%	87.6 hours (3.65 days)
99.9%	8.76 hours
99.99%	52.56 minutes
99.999%	5.256 minutes
99.9999%	31.536 seconds

99.99% still seems an acceptable level for a mature Internet service. A startup can get away with two, an emerging company three. So how is Amazon doing? For S3 it is pretty straightforward. S3 year to date is less then 99.9% available, which to me, for a paid service is not acceptable. This got me to wonder how Amazon’s own site faired. So I did a little investigating. And to do so I used information from a neat little company called Pingdon that issues reports on such things from time to time.

The only report on amazon.com was from last April, but through that point of the year the site was only down 21 minutes which on an annual basis equates to just under 99.99% . I find the disparity very interesting.

Now granted most newer web apps are not going to deliver four nines and this pingdom report shows that. The 17 social networking apps that they reported on had an average availability of 99.7%.

Here’s the point. Back in those days of providing Internet services we had "14 Deadly Sins". One of them was "Rely on outside vendors who let us down". All the cuteness of the Twitter fail whale meme aside (which i believe is making some companies/individuals behave like excessive downtime is acceptable), if you are going to build a healthy web or SaaS application business the service needs to be available. If you rely on S3 alone to deliver that availability in a production environment you will fail. You need an effective fall over plan for when S3 goes down.