I just realized that the comments from the O’Reilly story on Amazon are also really good. Here are some of the stand-outs.
I thought the failure was that multiple availability zones died simultaneously, something that by design and per Amazon’s docs should never happen short of a hurricane in Virginia. Note that out is exponentially harder to distribute your app across not only AZs but geographical areas as well: high speed links connect AZs within a geo, but going from one geo to another is extremely slow and not realtime.
Of course you design for failure, it happens every day on AWS. But can you design around multiple datacenters (availability zones) dying simultaneously? When AWS told you not to worry about that eventuality? Probably not without downtime and some serious compromises.
The problem is that once EVERYONE falls back to a service in another availability zone, that zone suddenly has to handle twice the load (probably a lot more when Virginia goes down, because it’s generally believed to have the most instances). We saw pretty heavy slowdown across zones even with only a handful of people following this approach. You need to either bring another provider into the mix, or just have faith that AWS keeps piles and piles of spare capacity.
AWS previously assured us that multiple Availability Zones wouldn’t realistically fail at the same time. Now that proved to be untrue, you choose to say “Ah - you shouldn’t have believed AWS, you should have been using multiple regions” Presumably when the next outage hits both US regions you’ll say “Ah - of course you should have used the EU and Asia regions as well”.
We should recognize AWS as a single point of failure and look at hosting across multiple providers. Fool me once, shame on you; fool me twice, shame on me.
This does require sophisticated management tools like enStratus, but you should use those tools to avoid putting all your eggs into the AWS basket.
I’m not sure that the rest of the technology stack has necessarily caught up to this model though - in particular NoSQL databases aren’t the panacea you appear to believe them to be. Hopefully all the pieces of the technology stack will evolve.
AWS has never in any conversation I have ever had said that multiple availability zones would not realistically fail at the same time. If they felt that way, don’t you think they’d have an SLA better than 99.9%?
Of course, if you want to survive the failure of multiple availability zones, you should spread yourself across regions. I don’t understand why this is so hard for people to understand.
Similarly, yes, you should have some ability to migrate your systems into another cloud. I don’t think actual technical loss of all AWS regions (or even multiple regions) can happen absent of nuclear war or asteroid strike, but companies do go out of business/get sued/etc.
"The strength of cloud computing is that it puts control over application availability in the hands of the application developer and not in the hands of your IT staff, data center limitations, or a managed services provider"
Sadly, no. That the developers are in charge of this stuff is why so many sites were down completely. They’re terrible at it, don’t value it, and even when they try to roll it out they do it poorly.
An IT guy would have spent 15 minutes on day 1 thinking about disaster recovery. A developer always wants to do it tomorrow and tomorrow, as we all know, never comes.
Certainly, you can design for failure. And for those where failure it literally not an option, like things that deal with life-safety or where thousands of dollars are lost every second, sure.
But one simple thing you don’t address is that doing so is a lot more expensive to design for failure during the development cycle. Yes, any bridge across troubled water can be over-built to ensure that it never, ever fails, but doing so is often so cost-prohibitive as to be unrealistic.
And lastly, your advocacy would have sounded more credible if you had stated up-front that you are CTO of a company that purposes to help people solve this problem in exchange for mucho deniro rather than bury that fact in the middle of the article.
"In short, if your systems failed in the Amazon cloud this week, it wasn’t Amazon’s fault. You either deemed an outage of this nature an acceptable risk or you failed to design for Amazon’s cloud computing model."
Oh yes, a company would provide you a cloud service to host your data and when that very service fails and renders your own operation useless it is your fault. So why pay for their service in the first place?
This sounded like an argument made by either a total irrational fanatic, or another network guy who is clueless about creating software, or both. Seems like outages these days are always the fault of the software developer(s) and never that of the one maintaining the network resource.
Anybody with common sense should stop reading right after that quoted sentence.
Well it’s the fault of the software developers for thinking they don’t need “clueless network guys”
You realize “in th3 cloud!!! ZOMG we’re in cloud!!!” means nothing more than you’re running a virtual machine in a data center somewhere. That’s all amazon does. It’s not magic - you’re not safe because you’re “in the cloud.”
It’s just a datacenter. That most software engineers don’t know that is why they need “clueless network guys” to point it out to them.
This article has a point; any company running on AWS could have designed its system to survive this outage. But this missed two key points:
1) How could they test this survivability, end-to-end, ahead of time? The rule it, if you didn’t test it, it probably won’t work. Companies that survived unscathed were prepared *and* lucky.
2) What about recovery? The statement “No humans” is wrong. A company may be able to design for the initial outage, but designing to automatically handle a period of days where Amazon are fiddling with flaky infrastructure is practically impossible. Everyone will have a lot of overtime afterwards, making sure their systems are working perfectly and data is consistent.
And the elephant in the room is that startups can only tackle a few problems at once. If they pile resources into 99.999% reliability, the opportunity cost is that the rest of their development goes slower and they fall behind the competition.
Crafting the test scenarios for cloud computing can definitely be challenging, but it is doable.
My best advice is to assume any where you have one concept (e.g. availability zone), that you have some kind of single point of failure and shut down access to that single point of failure.
Then automate your tests!
I wondered when the Cloud Snow Job would begin, surprised it’s on O’Reilly frankly.
What this article doesn’t “get” is that Amazon fundamentally did not deliver what it said on it’s tin.
Also, the solutions espoused here are pretty standard traditional datacentre operating procedures, which cost real money - the whole point of the “Cloud” was to avoid these costs, else why bother?
As to “Applications built with “design for failure” in mind don’t need SLAs.” - run a mile from anyone who suggests that.
I think the vitriol behind some of these comments shows exactly why the business is moving to the cloud.
IT: The Department of No
Do you guys think people are really duped into moving to the cloud because of pots of gold and promises of no worries?
No. They are moving there because you, the IT leader, make it impossible to do their jobs. Procuring a server or even a VM in most organizations takes 3-6 weeks (or, in some cases, 3-6 months) and the business has real work to do that actually generates revenue for your company.
But you are saying no to the business and going on forums yapping about why the cloud sucks.
In the mean time, they are sticking systems without any controls or risk analysis or redundancy in the cloud.
Stop bashing the cloud and do your job. Help them move into the cloud appropriately.The reason it can take some time to get a new system up (3-6 weeks is probably long for a small company but believable for a big one, and is certainly long for any company which has virtualized internally) is because that system needs to have controls and there needs to be risk analysis and redundancy planning exactly as you suggest. The cloud doesn’t mitigate any of that. It just shifts the responsibility for it from the “Department of No” to the “Developer who doesn’t know.” It still takes time and effort and skill to do it right. Which I think was the original point of your post.
But the bleeding edge developers who want to do something cool and not have to worry about those pesky details have to either wait for somebody to do that part for them, or take it to the cloud and accept that their amazing app is going to be subject to the whims of Amazon’s IT staff who don’t give a darn about them and their puny app, instead of the whims of their own IT guy that they could be taking out for a beer every now and then. IT can be a good friend. But you can’t treat them badly (or, for example, call them names) and expect them to still cater to your every whim when you have it. There are 50 other developers clambering for the same thing. They’re probably off helping the ones they like more.
I think that calling this week the cloud’s “shining moment” is stretching things. It would be more accurate to say that, since it’s the cloud, recovery is easier. With native Amazon tools in some cases, and strong devops tools and practices in others, DR can in fact be radically cheaper and easier in the cloud than in traditional, physical infrastructures.
Cheaper and easier, though, is only part of the point of the cloud. Another important point is that it lets IT focus more on the business and less on infrastructure complexities. At the moment, major cloud providers remove physical complexities but not really software ones. It’s still up to us to design, build, deploy, and manage those great devops practices. Vendors like Enstratus take the next step and bite off that layer. Either poetically or ironically, depending on your viewpoint, they use the cloud to recover from failures in the cloud. (If nothing else, we’re proving there is no such thing as THE cloud. Maybe we should call it “the sky” instead. The sky definitely was falling this week! :-).
In any case, it would be interesting to learn how much load the various cloud automation services others handled this week. Can they scale if thousands of AWS customers use their service to migrate tens of thousands of servers and terabytes/petabytes of data all at once, or does the meltdown cascade from one level of the cloud to the next?
Data, on the other hand, is a whole different kettle of fish. If we all waited to migrate to the cloud until we’d implemented true design-for-fail architectures, cloud adoption would be at least an order of magnitude slower than it is. The bottom line is that we’re still only partway through the journey to the holy grail of the cloud. So far we’ve pushed the complexity, difficulty, and cost up the stack from hardware to software and systems architecture. This week the tradeoffs that were made at the architecture level are revealing themselves for everyone to see.
No, that’s completely wrong. Partition tolerance is the ability to continue to meet your service guarantees in the presence of communication failures that isolate portions of your system.
Also, non-relational is not the same thing as eventually consistent (just ask the HBase folks). You can have strong consistency requirements without using a relational storage model.
IWhen you say, “The knee-jerk reaction is to look for an SLA from your cloud provider,” you are ignoring the point made above by Abol. Amazon claimed that zones were independent, but an EBS failure affected multiple zones. They fell into the same trap you are warning about: they didn’t design for failure.
This was a moderately interesting but also intensely frustrating posting.
"This should never have happened if you designed your services right, to never trust (that one) cloud!"
Yes, but, the number of people who have all of the CS and IT architecture and IT operations backgrounds to understand how to not trust any single point of failure and actually design really robust systems around that sort of thing is not that large.
What is being suggested is that the entire industry must suddenly develop a higher level of technical competence than it now has, by a large factor.
Would this be a good thing? Of course. Is it practically going to reach the priority level that real operational organizations can make it happen? Unlikely.
Eventually, attempting to wring the last 9’s out of a service, one runs into externalities such as partitioned and failing backbone ISPs, major DNS outages, physical damage to infrastructure, and other hard to solve problems. One can design right up to that ragged external unavoidable outage edge, with arbitrary amounts of time and money and expertise. I and a few others are happy to do that for clients. But knowing what is economical and sensible, and what is polishing the shine on areas when there are larger inherent risks accepted as costs of doing business, is important.
Actually, almost all you need to know to achieve AZ-redundancy is to simply follow programming best practices you were supposed to be following in the first place.
Getting x-region and x-cloud is moderately more difficult, but not as difficult as doing it for a traditional data center.
No, not even close. You need to follow system design and integration best practices you were supposed to be following in the first place, which includes programming and architecture and all the other subcomponents.
The number of organizations that actually meet system design and integration best practices, in the real world, is very small. Hence my frustration. Actual high availability and dependability is a much harder problem than people tend to think it is. Saying it’s just a programming problem is obfuscating.
All the things that need to be done are described in literature and operational reports and so forth. None of them are secret or particularly obscure. But rigorous study of systems architecture needed to understand the scope of it well enough to conceive of it and implement it is rare.
So a cloud provider is just like any other datacenter, except you have to spend a lot more money in development to work around their unreliability. Awesome!