What to do about cloud outages
Last week, Amazon experienced a major service disruption to it’s Simple Storage Services (S3). Here is their official (detailed) statement on the cause of the outage. As you probably heard, a lot of major online retailers and other companies who depend on these services were impacted. But what can you do about something like this, if your business depends on these cloud services?
Nothing.
I’ll say it again, because it bears repeating: Nothing. There is nothing that you can do. And that is both the blessing and the curse of placing your data into a third-party data center. There is nothing you can do, and thankfully, there is nothing you need to do, either.
Murphy is Unavoidable, Even in the Cloud
This article makes the argument that you cannot trust cloud providers to be looking out for you, the customer. I don’t know if I agree with that, but it is at least a little disappointing to learn that a simple fat-fingered administrative operation could take down such a large section of critical, customer-facing services. Shouldn’t there just be zero single points of failure in any cloud design? Isn’t that like the first commandment of cloud building?
Well, if it is any consolation, it seems that the technical reason this happened is because too many redundant nodes were brought down simultaneously, on accident. In other words, there are no single points of failure, unless you bring the extra ones offline. So the steps they have taken to prevent this type of thing in the future appear (according to their statement) to revolve around preventing the system from removing more capacity than it needs at any given point in time.
Yeah, that’s not much better (wouldn’t that have already been assumed to be part of a good continuously available architecture)? Well, here’s the thing. These giant robots called clouds are exquisitely complex systems. It is probably safe to say that not every type of failure even can be predicted or prevented at this scale, even by very smart engineers. I’d wager that Amazon does not employ a single person who can explain or know every dependency in their S3 system–that knowledge is most likely spread out across a team of people, or even several teams.
So it may be unreasonable to expect that a system will never experience outages. For example, Azure Virtual Machines offer an SLA of 99.95% uptime–and they can safely offer that availability because they require you to deploy two of any given virtual machine into an “availability set”–meaning that the virtual machines inside that set cannot be deployed onto the same hardware / fault domain within their data centers. Likewise, Office 365 has a pretty damn good track record (they post quarterly updates at the Trust Center) over the past few years, but that’s not to say there couldn’t be a major outage in some Microsoft datacenter region tomorrow–it could totally happen.
Downtime and Reputation
Obviously, this week wasn’t great for Amazon–now they have egg on their face. But you know who doesn’t have (as much) egg on theirs? All of their customers who were affected by the outage. That’s right–one of the interesting side-effects of choosing to go with a cloud provider is reputation insulation. If a business were down because of a snafu like the one Amazon just had, relationships with customers could be damaged or lost. But now, these companies are able to say, “Yeah, it is unfortunate, but like a lot of companies that were impacted by the recent Amazon outage, we too felt the pain…”
Don’t get me wrong, I’m not advocating that you jump into the cloud just to abdicate responsibility like that, but nevertheless, it is an important and I think overlooked aspect of choosing to outsource your critical services. You give up a lot of control, true. Remember: there is nothing you can do. But, you give up a lot of responsibility too. There is nothing you need to do. As a corollary, customers are going to understand this to some degree, and probably that will lend itself to at least some level of sympathy–so damage control is pretty straightforward in many cases.
I think this is reflected in most Service Level Agreements as well. For example, that Azure SLA we mentioned earlier–if they fail to meet 99.95% up-time, the service credit offered to you is only 10%. Not 100%, not even 50%. Just 10%. Think about that: if you were to lose customers or business because of an outage, you would expect to be compensated pretty well for those lost dollars, right? But, a measly 10% service credit hardly could cover or make up for these types of losses. On the other hand, if your reputation is in fact somewhat insulated during the interruption, your losses may not be as great as you initially feared, anyway.
Conclusions
I work with some people who are on the other side of the fence from me–and I am sure they would latch onto this event as evidence in support of their prejudices: “Stay weary of cloud providers, and only trust yourself–or your consultants ;)–with the task of architecting solutions that are right for you.”
All I know is, the cloud does make a lot of things easier in my line of work. Sure, it complicates things too, but a lot of those complexities (and the associated headaches) are abstracted from the customer. Unfortunately, we can never be 100% certain that our data is “safe” up there, but then again, every event like this gives providers like Amazon more feedback–more data with which to adjust and make corrections–for a more stable system moving forward. This same learning curve and continuous improvement happens in almost every IT department, big and small, all across the world.
It should not be all that surprising, then, when things like this happen. Remember: there is no cloud, it’s just someone else’s computer.
Leave a Reply