« Google's footprint | Main | HD DVD RIP »

Why S3 failed

February 16, 2008

Late last night, Amazon issued a statement explaining the cause of the problem that hobbled its S3 storage system yesterday morning. It was not a hardware failure. Rather, the service's authentication system, which verifies the identity of a user, became overloaded with user requests. As one person explained to me, it amounted to a kind of accidental DDoS (distributed denial of service) attack, and Amazon didn't have enough capacity in place in one of its data centers to handle the surge.

Here's Amazon's explanation:

Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations. While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests. Importantly, these cryptographic requests consume more resources per call than other request types.

Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls. The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place. In addition to processing authenticated requests, the authentication service also performs account validation on every request Amazon S3 handles. This caused Amazon S3 to be unable to process any requests in that location, beginning at 4:31am PST. By 6:48am PST, we had moved enough capacity online to resolve the issue.

Amazon promises quick action to ensure the problem doesn't happen again and that users are supplied with better information on system status:

As we said earlier today, though we're proud of our uptime track record over the past two years with this service, any amount of downtime is unacceptable. As part of the post mortem for this event, we have identified a set of short-term actions as well as longer term improvements. We are taking immediate action on the following: (a) improving our monitoring of the proportion of authenticated requests; (b) further increasing our authentication service capacity; and (c) adding additional defensive measures around the authenticated calls. Additionally, we’ve begun work on a service health dashboard, and expect to release that shortly.

All in all, I think Amazon has handled this outage well. The problem revealed some flaws in Amazon's otherwise highly reliable system, including shortcomings in its communications with users, and the company will make important improvements as a result, to the benefit of all its customers. These kinds of small but embarrassing failures - the kind that get you asking WTF? - can be blessings in disguise.

UPDATE: This post originally contained an excerpt from an internal Amazon email, with the subject line "WTF," which traced the source of the outage to a barrage of authentication requests generated by "a single EC2 customer." (EC2 is Amazon's online computing service.) I decided to delete the email because an Amazon spokesman subsequently informed me that it appeared to have been written before a full analysis was done on the root cause of the outage and hence did not accurately portray that cause. I apologize for supplying what appears to have been misleading or at least incomplete information.

Advertisement: Are you ready for "The Big Switch"? Fast Company calls Nicholas Carr's new book "compulsively readable - for nontechies, too." Salon says it's "magisterial." Order now from Amazon.com.

Comments

Can this be caused by a competitor eager to discredit Amazon? EMC may be? Oh, those darn conspiracies...

Posted by: asenski [TypeKey Profile Page] at February 16, 2008 10:15 AM

Isn't it ironic that Amazon sells their S3/EC2 service as something that you add as you need it, dynamically growing, and that their own authentication service doesn't eat their own dogfood?

But as long as they learn lessons from this...

Posted by: pforret [TypeKey Profile Page] at February 16, 2008 10:23 AM

maybe it was google?

Posted by: IsaacGarcia [TypeKey Profile Page] at February 16, 2008 11:25 AM

Nick, storage customers tend to be far more conservative about risk than your average web services customers. Paranoid skepticism about storage creates different market dynamics. In other words, it ain't cost-driven. Yesterday I wrote about it on my blog. (click my name below to read)

Asenski - I hope that was tongue in cheek?

Posted by: MarcFarley [TypeKey Profile Page] at February 16, 2008 11:55 AM

My tour of a Data Center my client was going to use, if I blessed:

"...and this is out SAN, 100% uptime, full-time engineers in the plant 7-24. Microsoft SAN engineers on call, 30 minutes response time...bahl, blah".

4 day outage due to SAN, and restore mistake made by Microsoft engineer.

We had 200 mobile work order accounts with the messaging vendor, and fell to 24 accounts. Another promising business wiped off the map by hosting snafus.

Posted by: abm [TypeKey Profile Page] at February 16, 2008 01:45 PM

I would be interested to know how many sites were affected and what site overloaded it. What adverse effects did this have on the sites? I agree though that this will make S3 better in the end although it hurts their reputation some.

Posted by: Devin [TypeKey Profile Page] at February 16, 2008 06:06 PM

Hello!

What do you think about (MS) "Home Server" concept vs. "Cloud" concept ...

http://www.roughtype.com/archives/2007/10/google_apple_an.php [Filip's comments]

Regards, Roman

Posted by: doknir [TypeKey Profile Page] at February 17, 2008 07:20 AM

The explanation from Amazon is an explanation of the immediate or visible cause, at best, maybe the proximate cause.
The root cause has not been mentioned, see http://thinkingproblemmanagement.blogspot.com/2008/02/reporting-on-major-incidents.html

The litmus test would be to assume the outage was aircraft related, and would the FAA have accepted the explanation? I think not and here I am suspicious.
1. Peak load times are usually at 11:30 am for any web site. This did not happen at peak load times.
2. The time of the incident is suspiciously close to change management windows for ecommerce sites, i.e. early morning.
Methinks there was a change management process failure or other type of human error.
Until the root cause is identified, no appropriate counter-measures are possible!

Posted by: redpineapple [TypeKey Profile Page] at February 18, 2008 06:06 AM

Nick,

The Cloud concept has had a lot of coverage since the New Year, your book playing no small part in getting the debate going.

In order to discuss some of the issues surrounding The Cloud, I think it is important to place it in historical context, looking at the Cloud's forerunners and the problems they encountered before being adopted.

On my blog, I've tried to do that in my Cloud Computing post.

One of the current barriers in the way of The Cloud is economics. I argue that,

"Telecom prices have fallen and bandwidth has increased, but more slowly than processing power, leaving the economics worse than in 2003".

And “I'm sure that advances will appear over the coming years to bring us closer, but at the moment there are too many issues and costs with network traffic and data movements to allow it to happen for all but select processor intensive applications, such as image rendering and finite modelling.”

Any thoughts?

Regards

PJW

Posted by: Paul Wallis [TypeKey Profile Page] at February 20, 2008 11:52 AM

Post a comment

Thanks for signing in, . Now you can comment. (sign out)

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)


Remember me?


 Subscribe to Rough Type

Nick's new book: bigswitchcover2thumb.jpg "Future Shock for the web-apps era" -Fast Company

"Ominously prescient" -Kirkus Reviews

"Riveting stuff" -New York Post

Order from Amazon

Visit Big Switch site

Read Q&A; with Nick

Greatest hits

The amorality of Web 2.0

The editor and the crowd

Avatars consume as much electricity as Brazilians

The great unread

The love song of J. Alfred Prufrock's avatar

Sharecropping the long tail

The social graft

Steve Jobs' devices

MySpace's vacancy

Other writing

The ignorance of crowds

The recorded life

The end of corporate computing

IT doesn't matter

The parasitic blogger

The sixth force

Hypermediation

More

Nick's last book: Order from Amazon

Visit book site

Rough Type is:

Written and published by
Nicholas Carr

Designed by

JavaScript must be enabled to display this email address.

What?