Review: Amazon, the mother of all clouds | Cloud Computing - InfoWorld
A good run-down of the Amazon cloud’s technical features and services.
Selling servers by the hour was a bold idea when the Amazon cloud business launched a few years ago, but it seems quaint compared to all the options for sale today. There are currently 21 products available on Amazon Web Services, and only one of them is the classic EC2 machine, an abbreviation of the full name, the Elastic Compute Cloud. The original S3 (Simple Storage Service) now has cousins like the Simple Workflow Service and SimpleDB, a nonrelational data store. Then there are odder innovations like Amazon Glacier, a very cheap storage solution that takes hours to retrieve the data. Yes, hours. Not milliseconds, not seconds, not minutes — but hours.
It’s impossible to summarize it all in a paragraph or even an article. Amazon Web Services would require a book, but that tome would be out of date by the time it was printed because the service changes quickly. The best news is that Amazon is constantly looking at costs and generally lowering prices as it finds a way to deliver the product for less. Some prices have gone up occasionally over the years, an effort to make the prices reflect reality.
Amazon has also found plenty of supporters. A number of big companies such as Netflix are proud of using Amazon’s servers, and plenty of startups are glad they didn’t need to set up their own data centers to reach for the gold ring of IPO riches. Some customers brag about spending $1 million or more a month, an amount that would be more than enough for most companies to justify setting up an in-house facility and team. Clearly, Amazon is delivering a whole lot of value.
'Linux for cloud' floats anti-Amazon cloud taster • The Register
OpenStack has opened up a temporary cloud, called TryStack, to try and entice developers away from Amazon.
How does a cloud project solve a problem like Amazon? The once genteel etailer of toasters, books and CDs has now become a byword for “cloud”.
People just can’t stop stuffing more of their data into Amazon’s EC2 service. The number of objects held in EC2’s S3 storage service grew by nearly 200 per cent in 2011 to 762 billion, compared to 156 per cent growth in the year before.
More will be mopped up more in 2012, as Amazon has made it easier and even cheaper to use S3.
Bezos’s company cut its entry-level prices by 10 per cent from 1 Feb, to $0.125 for the first terabyte. In January it uncloaked the AWS Storage Gateway beta to act as a cloud-based data back-up service for companies, thereby hovering in even more.
Developers are the problem. Amazon has made it easy to test and develop apps at a very low price using open or at least familiar programming languages, tools and databases. The service couldn’t be more ubiquitous: fire up a browser and it’s easy to find and start EC2.
If you’re the OpenStack Project hoping to challenge Amazon for the business customers putting their apps in the cloud, then one approach might be to tempt developers with a suck-it-and-see scheme for this “Linux for the cloud” effort.
OpenStack is about to open up FreeCloud, which lets you test the idea of OpenStack while running it hosted in a sandboxed environment. FreeCloud’s also getting the re-branding iron run over its hyde: it’s now called TryStack. The idea is TryStack lets you test OpenStack without either having to download and install the code on your own servers or – if that sounds like to much hassle – have to sign a contract with a service provider running OpenStack in their cloud before you know you’ll like or need it. It’s meant to tie into the Devstack Openstack development toolkit here.
With Amazon, there’s no code or modules to download – the service is ready to go reducing the barriers to experimentation and producing potential future customers.
Jonathan Bryce, OpenStack project chairman and chief technology officer and founder of The Rackspace Cloud, told The Reg: “We heard from several people the are interested in using OpenStack, but don’t have the sysadmins do build the KVM or networking.
“Because it’s a set of services and servers you have to set up and run yourself, and if you are developer working on an application you are more used to having access to a running environment and putting code in rather than having to build that environment.
“One of the first targets that brought this need to our attention was app developers who write tools for managing clouds - companies like Cyberduck and EC2 Firefox extension. They need a functioning version of an OpenStack cloud that they can program against. To get that they’d have to install OpenStack somewhere and to that they’d need the hardware.
TryStack has been in use by about 12 customers for the last five months as FreeCloud. It was built by Rackspace and NTT using Dell servers and hosting and bandwidth from Equinox - all OpenStack Project members. TryStack runs the last release of OpenStack, called Diablo, coming with compute, object storage and Glance for doing snap shots, with dashboard and authentication.
You’d think TryStack would use stealth to attract and retain developers, but it doesn’t and TryStack is purely a developer service that’s not been built for long-term hosting. Not yet, at least.
TryStack users must buy into an invented currency, called Stack Dollars. Everybody gets 100,000 Stack Dollars that gives them “a couple of gigabytes for a couple of days”. Stack Dollars don’t have any actual monetary value, and they can be topped up. The smaller your compute instances the longer the dollars last, the larger the quicker they burn. Once your finished, your compute resources return to the pool while TryStack doesn’t come with any SLAs and the service might get taken offline for upgrades Bryce said.
The OpenStackers behind TryStack really don’t want you getting any ideas about hanging around. For OpenStack, however, this might be the best way to grow against Amazon. The problem is commercial operations like Dell, Rackspace, Citrix and Hewlett-Packard are either delivering or building clouds using OpenStack and so wouldn’t appreciate the free competition.
Rackspace, Bryce’s employer, was a instrumental in creating OpenStack and has led construction of significant portions of the code – compute and storage – that it’s now deploying on its hosting servers. Rackspace just notched up its first billion-dollar year after 12 years in the biz.
HP’s OpenStack cloud has been in beta since September – it’s currently an Infrastructure-as-a-Service service providing compute and storage. With Citrix’s CloudStack, you download either uncompiled or compiled binaries. How Citrix charges isn’t clear and it still seems to relying on the Cloud.com start-up it bought last year for users to configure the code. Dell’s not getting its hands dirty, and is now selling an OpenStack “solution” – blueprints and advice on setting up an OpenStack cloud running on Dell servers.
The thinking behind TryStack is that it will pique developers’ interest in a cloud architecture for everybody who’s not Amazon – PC and server manufacturers, service providers, software makers and others. “It’s about making it easy for people to try it out … hopefully that leads to business for all the OpenStack companies afterwards,” Bryce said.
But translating that into a paying business for those with a stake in the project will be another thing altogether. With the “deploy” portion of the development cycle not available and with so many OpenStack clouds in different states of public availability, it’s likely Amazon will continue to grow fat, and fast, before OpenStack starts racking up.
A CTO’s take on cloud — Cloud Computing News
Great article on how the CTO of Capgemini is looking at recent developments in enterprise adoption of the cloud. Good stuff.
As Capgemini’s CTO for North America, Joe Coyle hears an awful lot about cloud computing. He hears it from customers that want to evaluate cloud solutions and from vendors that want to win that business. Capgemini, a $12 billion global systems integrator, has relationships with all the major vendors and many enterprise customers, so it’s interesting to hear what Coyle has to say about the current state of the market.
Here are my main takeaways from a recent conversation with him.
1: IBM is cloudier than you think.
Big Blue has a pretty potent set of cloud options but it’s going about its business very cleverly. Given it’s big-iron heritage, IBM rarely talks about the hardware component of its cloud portfolio, Coyle said.
“They’re attacking this from a software perspective. They’ve taken Tivoli and are building this software umbrella so that you can take whatever you’re running in your data center now and put all or part of it in a public or private cloud,” he noted. IBM’s 2010 acquisition of Cast Iron also give it a slick appliance that lets customers integrate in-house apps with SaaS applications running outside.
He doesn’t see IBM cloud penetrating a ton of new smaller businesses, but for many existing IBM shops — and there are a ton of them — IBM cloud is a no brainer.
2: Microsoft Azure has a tough row to hoe
Coyle is of two minds on Windows Azure, the platform-as-a-service (PaaS) underlying Microsoft’s cloud strategy.
“Azure’s been a bit of a disappointment,” he said. “When Microsoft briefed us on it years ago, all the national [systems integrators] were chomping at the bit. But then it stumbled.”
“Then the message was the software would only run on Azure. That’ s fine, but by that point, the world had moved on, companies were already using Amazon,” he said. The usual argument that Azure is a PaaS while Amazon Web Services (AWS) is Infrastructure-as-a-Service (IaaS) simply doesn’t matter to most customers. The big AWS draw is they know they can deploy their applications on AWS now and move them to another hosted or in-house data center, later.
On the plus side, the Azure technology is solid and, unlike previous Microsoft development technologies, forces developers to follow the rules — they can’t design software services that misbehave. ”Azure is extremely powerful and if [Microsoft] can get its act together people will try it,” Coyle said.
But overshadowing all that technical mastery is the perception of Azure as a closed platform — despite its multi-language support. Microsoft’s single biggest problem is customer suspicion that it will use Azure to lock them into the next wave of Microsoft technologies, essentially replacing the Windows/Office upgrade cycle.
“I’m not saying it’s true, but it’s what people think,” Coyle said.
3: Amazon is Amazon
Amazon Web Services are what they are: extremely flexible and leading the league in public cloud. AWS suffered a couple black eyes in 2011 with an embarrassing four-day outage in April and then a widespread reboot glitch later in the year.
Coyle is pretty forgiving of these miscues. The April outage, he said, was largely due to people implementing their work incorrectly, something that AWS tried to fix manually. There are things you can do now in AWS to prevent this stuff, to build in more reliability and redundancy, although users will have to pay for it, he said.
The bottom line? Glitches and all, Amazon is the incumbent public cloud power and will stay that way, he said.
4: OpenStack as big-time cloud disruptor
Coyle is also bullish on the OpenStack movement, which is building a standard cloud foundation out of open-source tools. Initiated by Rackspace and NASA, it’s achieved critical mass with nearly every IT provider — from Dell, to HP, to Cisco, to Citrix — aboard and Rackspace offloading management to a more neutral OpenStack Foundation.
“OpenStack will change the world of cloud computing. As a lot of smaller companies look to build their own clouds, this will be a natural choice,” Coyle said.
Who stands to lose if that’s the case? Ironically, the Dells and HPs of the world — all of which are building their own clouds. “Why do you think they joined?” His feeling is these hardware companies — many of which were building their own more vendor-specific clouds — are hedging their bets.
Will OpenStack affect Amazon? “No. Amazon is Amazon,” he said.
5: CIOs are getting over cloud phobia
It’s taken time, but the economics of cloud computing are too good for CIOs to ignore, Coyle said. Any doubts they had about moving at least some corporate data to an outside cloud storage provider, for instance, have evaporated in recent months.
And they’re getting emboldened to do more than storage. The advent of Hadoop and NoSQL technologies means that companies could actually get some use out of all that old stuff sitting on tape or in platters, he said. Uploading that information, and massaging it with the latest analytics means that historical data can be used to test assumptions and new models, for example, seeing what a price change means to sales over time.
Wringing real value out of old data is a pretty good proposition for most CIOs.
Rip Rowan - Google - Stevey's Google Platforms Rant I was at Amazon for about…
This “rant” from a Google employee has been getting a lot of attention on the interwebs over the past couple of days. And for good reason. It’s a very good, IT manager-/developer-eye view of how to architect services in huge, web-centric organizations like Google and Amazon. Very interesting.
So one day Jeff Bezos issued a mandate. He’s doing that all the time, of course, and people scramble like ants being pounded with a rubber mallet whenever it happens. But on one occasion — back around 2002 I think, plus or minus a year — he issued a mandate that was so out there, so huge and eye-bulgingly ponderous, that it made all of his other mandates look like unsolicited peer bonuses.
His Big Mandate went something along these lines:
1) All teams will henceforth expose their data and functionality through service interfaces.
2) Teams must communicate with each other through these interfaces.
3) There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team’s data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.
4) It doesn’t matter what technology they use. HTTP, Corba, Pubsub, custom protocols — doesn’t matter. Bezos doesn’t care.
5) All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.
6) Anyone who doesn’t do this will be fired.
7) Thank you; have a nice day!
Ha, ha! You 150-odd ex-Amazon folks here will of course realize immediately that #7 was a little joke I threw in, because Bezos most definitely does not give a shit about your day.
#6, however, was quite real, so people went to work. Bezos assigned a couple of Chief Bulldogs to oversee the effort and ensure forward progress, headed up by Uber-Chief Bear Bulldog Rick Dalzell. Rick is an ex-Armgy Ranger, West Point Academy graduate, ex-boxer, ex-Chief Torturer slash CIO at Wal*Mart, and is a big genial scary man who used the word “hardened interface” a lot. Rick was a walking, talking hardened interface himself, so needless to say, everyone made LOTS of forward progress and made sure Rick knew about it.
Over the next couple of years, Amazon transformed internally into a service-oriented architecture. They learned a tremendous amount while effecting this transformation. There was lots of existing documentation and lore about SOAs, but at Amazon’s vast scale it was about as useful as telling Indiana Jones to look both ways before crossing the street. Amazon’s dev staff made a lot of discoveries along the way. A teeny tiny sampling of these discoveries included:
- pager escalation gets way harder, because a ticket might bounce through 20 service calls before the real owner is identified. If each bounce goes through a team with a 15-minute response time, it can be hours before the right team finally finds out, unless you build a lot of scaffolding and metrics and reporting.
- every single one of your peer teams suddenly becomes a potential DOS attacker. Nobody can make any real forward progress until very serious quotas and throttling are put in place in every single service.
- monitoring and QA are the same thing. You’d never think so until you try doing a big SOA. But when your service says “oh yes, I’m fine”, it may well be the case that the only thing still functioning in the server is the little component that knows how to say “I’m fine, roger roger, over and out” in a cheery droid voice. In order to tell whether the service is actually responding, you have to make individual calls. The problem continues recursively until your monitoring is doing comprehensive semantics checking of your entire range of services and data, at which point it’s indistinguishable from automated QA. So they’re a continuum.
- if you have hundreds of services, and your code MUST communicate with other groups’ code via these services, then you won’t be able to find any of them without a service-discovery mechanism. And you can’t have that without a service registration mechanism, which itself is another service. So Amazon has a universal service registry where you can find out reflectively (programmatically) about every service, what its APIs are, and also whether it is currently up, and where.
- debugging problems with someone else’s code gets a LOT harder, and is basically impossible unless there is a universal standard way to run every service in a debuggable sandbox.
That’s just a very small sample. There are dozens, maybe hundreds of individual learnings like these that Amazon had to discover organically. There were a lot of wacky ones around externalizing services, but not as many as you might think. Organizing into services taught teams not to trust each other in most of the same ways they’re not supposed to trust external developers.
This effort was still underway when I left to join Google in mid-2005, but it was pretty far advanced. From the time Bezos issued his edict through the time I left, Amazon had transformed culturally into a company that thinks about everything in a services-first fashion. It is now fundamental to how they approach all designs, including internal designs for stuff that might never see the light of day externally.
At this point they don’t even do it out of fear of being fired. I mean, they’re still afraid of that; it’s pretty much part of daily life there, working for the Dread Pirate Bezos and all. But they do services because they’ve come to understand that it’s the Right Thing. There are without question pros and cons to the SOA approach, and some of the cons are pretty long. But overall it’s the right thing because SOA-driven design enables Platforms.
That’s what Bezos was up to with his edict, of course. He didn’t (and doesn’t) care even a tiny bit about the well-being of the teams, nor about what technologies they use, nor in fact any detail whatsoever about how they go about their business unless they happen to be screwing up. But Bezos realized long before the vast majority of Amazonians that Amazon needs to be a platform.
You wouldn’t really think that an online bookstore needs to be an extensible, programmable platform. Would you?
Well, the first big thing Bezos realized is that the infrastructure they’d built for selling and shipping books and sundry could be transformed an excellent repurposable computing platform. So now they have the Amazon Elastic Compute Cloud, and the Amazon Elastic MapReduce, and the Amazon Relational Database Service, and a whole passel’ o’ other services browsable at aws.amazon.com. These services host the backends for some pretty successful companies, reddit being my personal favorite of the bunch.
The other big realization he had was that he can’t always build the right thing. I think Larry Tesler might have struck some kind of chord in Bezos when he said his mom couldn’t use the goddamn website. It’s not even super clear whose mom he was talking about, and doesn’t really matter, because nobody’s mom can use the goddamn website. In fact I myself find the website disturbingly daunting, and I worked there for over half a decade. I’ve just learned to kinda defocus my eyes and concentrate on the million or so pixels near the center of the page above the fold.
I’m not really sure how Bezos came to this realization — the insight that he can’t build one product and have it be right for everyone. But it doesn’t matter, because he gets it. There’s actually a formal name for this phenomenon. It’s called Accessibility, and it’s the most important thing in the computing world.
The. Most. Important. Thing.
If you’re sorta thinking, “huh? You mean like, blind and deaf people Accessibility?” then you’re not alone, because I’ve come to understand that there are lots and LOTS of people just like you: people for whom this idea does not have the right Accessibility, so it hasn’t been able to get through to you yet. It’s not your fault for not understanding, any more than it would be your fault for being blind or deaf or motion-restricted or living with any other disability. When software — or idea-ware for that matter — fails to be accessible to anyone for any reason, it is the fault of the software or of the messaging of the idea. It is an Accessibility failure.
Like anything else big and important in life, Accessibility has an evil twin who, jilted by the unbalanced affection displayed by their parents in their youth, has grown into an equally powerful Arch-Nemesis (yes, there’s more than one nemesis to accessibility) named Security. And boy howdy are the two ever at odds.
But I’ll argue that Accessibility is actually more important than Security because dialing Accessibility to zero means you have no product at all, whereas dialing Security to zero can still get you a reasonably successful product such as the Playstation Network.
So yeah. In case you hadn’t noticed, I could actually write a book on this topic. A fat one, filled with amusing anecdotes about ants and rubber mallets at companies I’ve worked at. But I will never get this little rant published, and you’ll never get it read, unless I start to wrap up.
That one last thing that Google doesn’t do well is Platforms. We don’t understand platforms. We don’t “get” platforms. Some of you do, but you are the minority. This has become painfully clear to me over the past six years. I was kind of hoping that competitive pressure from Microsoft and Amazon and more recently Facebook would make us wake up collectively and start doing universal services. Not in some sort of ad-hoc, half-assed way, but in more or less the same way Amazon did it: all at once, for real, no cheating, and treating it as our top priority from now on.
But no. No, it’s like our tenth or eleventh priority. Or fifteenth, I don’t know. It’s pretty low. There are a few teams who treat the idea very seriously, but most teams either don’t think about it all, ever, or only a small percentage of them think about it in a very small way.
It’s a big stretch even to get most teams to offer a stubby service to get programmatic access to their data and computations. Most of them think they’re building products. And a stubby service is a pretty pathetic service. Go back and look at that partial list of learnings from Amazon, and tell me which ones Stubby gives you out of the box. As far as I’m concerned, it’s none of them. Stubby’s great, but it’s like parts when you need a car.
A product is useless without a platform, or more precisely and accurately, a platform-less product will always be replaced by an equivalent platform-ized product.
Google+ is a prime example of our complete failure to understand platforms from the very highest levels of executive leadership (hi Larry, Sergey, Eric, Vic, howdy howdy) down to the very lowest leaf workers (hey yo). We all don’t get it. The Golden Rule of platforms is that you Eat Your Own Dogfood. The Google+ platform is a pathetic afterthought. We had no API at all at launch, and last I checked, we had one measly API call. One of the team members marched in and told me about it when they launched, and I asked: “So is it the Stalker API?” She got all glum and said “Yeah.” I mean, I was joking, but no… the only API call we offer is to get someone’s stream. So I guess the joke was on me.
Microsoft has known about the Dogfood rule for at least twenty years. It’s been part of their culture for a whole generation now. You don’t eat People Food and give your developers Dog Food. Doing that is simply robbing your long-term platform value for short-term successes. Platforms are all about long-term thinking.
Google+ is a knee-jerk reaction, a study in short-term thinking, predicated on the incorrect notion that Facebook is successful because they built a great product. But that’s not why they are successful. Facebook is successful because they built an entire constellation of products by allowing other people to do the work. So Facebook is different for everyone. Some people spend all their time on Mafia Wars. Some spend all their time on Farmville. There are hundreds or maybe thousands of different high-quality time sinks available, so there’s something there for everyone.
Our Google+ team took a look at the aftermarket and said: “Gosh, it looks like we need some games. Let’s go contract someone to, um, write some games for us.” Do you begin to see how incredibly wrong that thinking is now? The problem is that we are trying to predict what people want and deliver it for them.
You can’t do that. Not really. Not reliably. There have been precious few people in the world, over the entire history of computing, who have been able to do it reliably. Steve Jobs was one of them. We don’t have a Steve Jobs here. I’m sorry, but we don’t.
Larry Tesler may have convinced Bezos that he was no Steve Jobs, but Bezos realized that he didn’t need to be a Steve Jobs in order to provide everyone with the right products: interfaces and workflows that they liked and felt at ease with. He just needed to enable third-party developers to do it, and it would happen automatically.
I apologize to those (many) of you for whom all this stuff I’m saying is incredibly obvious, because yeah. It’s incredibly frigging obvious. Except we’re not doing it. We don’t get Platforms, and we don’t get Accessibility. The two are basically the same thing, because platforms solve accessibility. A platform is accessibility.
So yeah, Microsoft gets it. And you know as well as I do how surprising that is, because they don’t “get” much of anything, really. But they understand platforms as a purely accidental outgrowth of having started life in the business of providing platforms. So they have thirty-plus years of learning in this space. And if you go to msdn.com, and spend some time browsing, and you’ve never seen it before, prepare to be amazed. Because it’s staggeringly huge. They have thousands, and thousands, and THOUSANDS of API calls. They have a HUGE platform. Too big in fact, because they can’t design for squat, but at least they’re doing it.
Amazon gets it. Amazon’s AWS (aws.amazon.com) is incredible. Just go look at it. Click around. It’s embarrassing. We don’t have any of that stuff.
Apple gets it, obviously. They’ve made some fundamentally non-open choices, particularly around their mobile platform. But they understand accessibility and they understand the power of third-party development and they eat their dogfood. And you know what? They make pretty good dogfood. Their APIs are a hell of a lot cleaner than Microsoft’s, and have been since time immemorial.
Facebook gets it. That’s what really worries me. That’s what got me off my lazy butt to write this thing. I hate blogging. I hate… plussing, or whatever it’s called when you do a massive rant in Google+ even though it’s a terrible venue for it but you do it anyway because in the end you really do want Google to be successful. And I do! I mean, Facebook wants me there, and it’d be pretty easy to just go. But Google is home, so I’m insisting that we have this little family intervention, uncomfortable as it might be.
After you’ve marveled at the platform offerings of Microsoft and Amazon, and Facebook I guess (I didn’t look because I didn’t want to get too depressed), head over to developers.google.com and browse a little. Pretty big difference, eh? It’s like what your fifth-grade nephew might mock up if he were doing an assignment to demonstrate what a big powerful platform company might be building if all they had, resource-wise, was one fifth grader.
Please don’t get me wrong here — I know for a fact that the dev-rel team has had to FIGHT to get even this much available externally. They’re kicking ass as far as I’m concerned, because they DO get platforms, and they are struggling heroically to try to create one in an environment that is at best platform-apathetic, and at worst often openly hostile to the idea.
I’m just frankly describing what developers.google.com looks like to an outsider. It looks childish. Where’s the Maps APIs in there for Christ’s sake? Some of the things in there are labs projects. And the APIs for everything I clicked were… they were paltry. They were obviously dog food. Not even good organic stuff. Compared to our internal APIs it’s all snouts and horse hooves.
And also don’t get me wrong about Google+. They’re far from the only offenders. This is a cultural thing. What we have going on internally is basically a war, with the underdog minority Platformers fighting a more or less losing battle against the Mighty Funded Confident Producters.
Any teams that have successfully internalized the notion that they should be externally programmable platforms from the ground up are underdogs — Maps and Docs come to mind, and I know GMail is making overtures in that direction. But it’s hard for them to get funding for it because it’s not part of our culture. Maestro’s funding is a feeble thing compared to the gargantuan Microsoft Office programming platform: it’s a fluffy rabbit versus a T-Rex. The Docs team knows they’ll never be competitive with Office until they can match its scripting facilities, but they’re not getting any resource love. I mean, I assume they’re not, given that Apps Script only works in Spreadsheet right now, and it doesn’t even have keyboard shortcuts as part of its API. That team looks pretty unloved to me.
Ironically enough, Wave was a great platform, may they rest in peace. But making something a platform is not going to make you an instant success. A platform needs a killer app. Facebook — that is, the stock service they offer with walls and friends and such — is the killer app for the Facebook Platform. And it is a very serious mistake to conclude that the Facebook App could have been anywhere near as successful without the Facebook Platform.
You know how people are always saying Google is arrogant? I’m a Googler, so I get as irritated as you do when people say that. We’re not arrogant, by and large. We’re, like, 99% Arrogance-Free. I did start this post — if you’ll reach back into distant memory — by describing Google as “doing everything right”. We do mean well, and for the most part when people say we’re arrogant it’s because we didn’t hire them, or they’re unhappy with our policies, or something along those lines. They’re inferring arrogance because it makes them feel better.
But when we take the stance that we know how to design the perfect product for everyone, and believe you me, I hear that a lot, then we’re being fools. You can attribute it to arrogance, or naivete, or whatever — it doesn’t matter in the end, because it’s foolishness. There IS no perfect product for everyone.
And so we wind up with a browser that doesn’t let you set the default font size. Talk about an affront to Accessibility. I mean, as I get older I’m actually going blind. For real. I’ve been nearsighted all my life, and once you hit 40 years old you stop being able to see things up close. So font selection becomes this life-or-death thing: it can lock you out of the product completely. But the Chrome team is flat-out arrogant here: they want to build a zero-configuration product, and they’re quite brazen about it, and Fuck You if you’re blind or deaf or whatever. Hit Ctrl-+ on every single page visit for the rest of your life.
It’s not just them. It’s everyone. The problem is that we’re a Product Company through and through. We built a successful product with broad appeal — our search, that is — and that wild success has biased us.
Amazon was a product company too, so it took an out-of-band force to make Bezos understand the need for a platform. That force was their evaporating margins; he was cornered and had to think of a way out. But all he had was a bunch of engineers and all these computers… if only they could be monetized somehow… you can see how he arrived at AWS, in hindsight.
Microsoft started out as a platform, so they’ve just had lots of practice at it.
Facebook, though: they worry me. I’m no expert, but I’m pretty sure they started off as a Product and they rode that success pretty far. So I’m not sure exactly how they made the transition to a platform. It was a relatively long time ago, since they had to be a platform before (now very old) things like Mafia Wars could come along.
Maybe they just looked at us and asked: “How can we beat Google? What are they missing?”
The problem we face is pretty huge, because it will take a dramatic cultural change in order for us to start catching up. We don’t do internal service-oriented platforms, and we just as equally don’t do external ones. This means that the “not getting it” is endemic across the company: the PMs don’t get it, the engineers don’t get it, the product teams don’t get it, nobody gets it. Even if individuals do, even if YOU do, it doesn’t matter one bit unless we’re treating it as an all-hands-on-deck emergency. We can’t keep launching products and pretending we’ll turn them into magical beautiful extensible platforms later. We’ve tried that and it’s not working.
The Golden Rule of Platforms, “Eat Your Own Dogfood”, can be rephrased as “Start with a Platform, and Then Use it for Everything.” You can’t just bolt it on later. Certainly not easily at any rate — ask anyone who worked on platformizing MS Office. Or anyone who worked on platformizing Amazon. If you delay it, it’ll be ten times as much work as just doing it correctly up front. You can’t cheat. You can’t have secret back doors for internal apps to get special priority access, not for ANY reason. You need to solve the hard problems up front.
I’m not saying it’s too late for us, but the longer we wait, the closer we get to being Too Late.
I honestly don’t know how to wrap this up. I’ve said pretty much everything I came here to say today. This post has been six years in the making. I’m sorry if I wasn’t gentle enough, or if I misrepresented some product or team or person, or if we’re actually doing LOTS of platform stuff and it just so happens that I and everyone I ever talk to has just never heard about it. I’m sorry.
But we’ve gotta start doing this right.
Amazon deploys true-blue US gov cloud for secret arms data • The Register
Amazon is making some headway with purpose-built clouds for government.
Amazon is upping the pressure on both Microsoft and Google in the battle to scoop up cash-strapped government customers into the cloud.
The online bookseller has moved to win over more government customers with the launch of AWS GovCloud, its cloud for US gov departments that crunch super-secret defence data.
AWS GovCloud is a region, Amazon’s sixth, which is designed to meet a tough set of US government rules known as the International Traffic in Arms Regulations (ITAR).
ITAR says controlled data must be stored in an environment where logical and physical access is limited to US persons or permanent residents. That data covers the import and export of a long list of goods and services, ranging from warheads and missile-control systems to aircraft carriers and tanks.
Without ITAR compliance, US departments could not legally have uploaded this data to Amazon’s existing service, given that compute and storage happens in three non-US regions.
AWS GovCloud is based on the US West Coast, Amazon said, while the service already meets the same US regulatory controls as the rest of its regions – including Federal Information Security Management Act (FISMA), PCI DSS Level 1, ISO 27001, and SAS 70.
Now the etailer says it is ready to talk to other governments about adopting GovCloud, potentially adopting similar regulatory restrictions.
Werner Vogels, Amazon chief technology officer, blogged: “We do not envision that over time GovCloud will address only the needs of the US government and contractors. We are certainly interested in understanding whether there are opportunities in other governments with respect to their specific regulatory requirements that could be solved by a specialized region.”
Government is a hot area for cloud providers. Amazon claims more than 100 federal, state and local government agencies already onboard its existing services.
Microsoft and Google, meanwhile, are running PR campaigns against each other, claiming government customers’ scalps to prove their cloud is winning against the others.
National and local government has proved a fertile hunting ground as the US’s economy has stalled and austerity becomes the watchword for the public sector.
Microsoft and Google are pushing hard to sign up agencies to their hosted email and docs. For the public-sector IT people involved, this has meant they get the prospect of brand-new systems without either the up-front purchase cost or the long-term maintenance costs.
Much of the Microsoft and Google fight has been over email and docs; at times, though, it has been point-scoring over precisely the kind of regulatory approval Amazon has on GovCloud. The idea has been to sow uncertainty about their rivals in customers’ minds or to reverse customers’ decisions.
In April, Microsoft accused Google of misleading the market about the FISMA certification of Google Apps for Government. Google shot back saying Microsoft’s cloud services for government wasn’t FISMA-certified. It is now suing the US Department of the Interior for its selection of Microsoft’s Business Productivity Online Suite-Federal – Microsoft’s rival to Google Apps for Government – saying the department had violated government procurement policies.
While Google and Microsoft tussle over details, both are behind on what they’d really like to be: the leading platform for hosting devs apps and data. That crown belongs to Amazon, which is now beginning to serve as a platform for other cloud businesses such as Ruby host Heroku.
Microsoft’s alternative to Amazon is Azure: the company claims a list of customers using the Azure compute fabric and storage layer to deliver apps, but it has yet to find Amazon-levels of traction.
Microsoft this week re-jigged the pricing and packaging on its smallest level of compute instance for the third time since last autumn’s launch.
Let the Cloud Developer Wars begin • The Register
A good overview of the difference between IaaS, PaaS, and SaaS offerings and how the big boys are moving into the cloud game.
Microsoft is all for the cloud, says chief executive Steve Ballmer. IBM has its new Smart Business Cloud. Oracle has its Exalogic cloud in a box. Amazon’s cloud services are growing apace. Salesforce.com and Google have always been cloud.
The economic arguments are unassailable. Economies of scale make cloud computing more cost effective than running their own servers for all but the largest organisations. Cloud computing is also a perfect fit for the smart mobile devices that are eating into PC and laptop market.
Cloud computing is a fuzzy concept, though, and analysts distinguish between several models.
The first is infrastructure as a service (IaaS), where the customer replaces physical servers with virtual ones accessed over the internet. IaaS scales up or down on demand and replaces capital expenditure with operating expenditure, but customers still have to maintain the operating system, select and install applications, and solve dependency problems.
Fantastic elastic
Amazon dominates the IaaS market. Its Elastic Compute Cloud (EC2) became a production service in October 2008, and offers virtual machine instances of various sizes managed by a web services API. Most run some variety of Linux but EC2 also offers Windows Server.
Platform as a service (PaaS) abstracts much of the infrastructure away. You deploy an application onto a pre-existing platform which provides services such as data management, transactions, identity and authentication.
Google’s App Engine is an example of PaaS. You write your application in Java or Python and upload it. The application can use services such as a transactional datastore, task queue, user management, email, caching, and more. You have to trust Google to do these services right, but the benefit is zero maintenance. Also, if Google improves its implementation your application runs better automatically.
Like it or lump it
The third model, software as a service (SaaS), is the most abstracted form of cloud computing. Even the code that runs the application is managed by the provider and the customer simply signs up as a user. Salesforce.com and its customer relationship management application, Microsoft’s Business Productivity Online Suite, soon to be revised as Office 365, and Google Apps for email and document collaboration are all examples of SaaS.
SaaS enables the cloud provider to exploit another aspect of the cloud: multi-tenancy, where multiple customers run the same application. As features are added, all customers benefit, and the cloud provider can use its hardware at a high level of efficiency by tuning the workload across its servers…
Blurred vision
IaaS, PaaS and SaaS seem neat and tidy divisions but the distinctions between them are blurring as cloud vendors extend their offerings.
Take Microsoft Azure, for example. Microsoft positions Azure as PaaS, since its focus is on hosting applications built by its customers.
You can open up the Visual Studio development tool, select an Azure project and start building an application based on one or more roles: a web role for a web front end or for publishing a web services API, a worker role for background processing and a virtual machine role. This is where the PaaS/IaaS distinction begins to blur.
Azure is a platform play aimed squarely at enterprise customers. In contrast, Google has targeted smaller organisations first with App Engine, and is only now aiming at enterprise customers with its App Engine for Business, currently in preview.
App Engine for Business introduces a 99.9 per cent service-level agreement, an Enterprise Administration Console, and hosted SQL, all features which already have an equivalent in Azure.
At the same time, there are elements of IaaS in Microsoft’s offering. Azure computing is purchased in instances of varying capacity, and each instance is in fact a Windows server virtual machine. You can even remote desktop into an instance, giving full access to Windows. Microsoft has also introduced a role which lets you build a virtual machine on your own system and upload it to run on Azure.
Meanwhile, Salesforce.com has taken its SaaS offering and extended it to look more like PaaS. The product is called force.com and was extended last year to provide generic data services, called database.com.
There’s also a partnership with VMWare, called VMForce, which lets you run Java applications on a managed application server using the Spring development framework; and Salesforce.com has acquired Heroku which offers a platform for cloud-hosted Ruby applications.
Coming from the other end, Amazon is building PaaS on its IaaS foundation. In January Amazon announced its Elastic Beanstalk product, a Java application server where all you provide is the application; Amazon provisions a load balancer and deploys your app to one or more instances of Apache Tomcat. Elastic Beanstalk features automatic scaling on demand.
IaaS remains important but PaaS has more potential for shifting the burden of IT administration from customers to cloud providers and makes it easier for them to scale their IT resources according to needs.
Place your bets
PaaS is the sweet spot in cases where IaaS is too demanding and SaaS too inflexible. The distinctions between cloud platforms are breaking down but PaaS is increasingly prominent in many of them.
That leaves the question: who will win the cloud wars? Amazon is in an enviable position, with no legacy business to cannibalise and a strong foothold in the market. Google is also legacy-free, though App Engine has had a mixed reception and it is late in its enterprise play.
Microsoft by contrast is all legacy, and while Azure is technically a strong offering, the challenge lies in shifting its massive partner network and enterprise business from selling servers to cloud computing. Furthermore, the start-up community tends to look towards Linux and open-source platforms rather than Windows.
Summary of the Amazon EC2 and Amazon RDS Service Disruption
I will probably never read this all the way through. It’s dry… dry… dry! Consider this a link for posterity’s sake - Amazon’s official post mortem of their first major system disruption.
Now that we have fully restored functionality to all affected services, we would like to share more details with our customers about the events that occurred with the Amazon Elastic Compute Cloud (“EC2”) last week, our efforts to restore the services, and what we are doing to prevent this sort of issue from happening again. We are very aware that many of our customers were significantly impacted by this event, and as with any significant service issue, our intention is to share the details of what happened and how we will improve the service for our customers.
The issues affecting EC2 customers last week primarily involved a subset of the Amazon Elastic Block Store (“EBS”) volumes in a single Availability Zone within the US East Region that became unable to service read and write operations. In this document, we will refer to these as “stuck” volumes. This caused instances trying to use these affected volumes to also get “stuck” when they attempted to read or write to them. In order to restore these volumes and stabilize the EBS cluster in that Availability Zone, we disabled all control APIs (e.g. Create Volume, Attach Volume, Detach Volume, and Create Snapshot) for EBS in the affected Availability Zone for much of the duration of the event. For two periods during the first day of the issue, the degraded EBS cluster affected the EBS APIs and caused high error rates and latencies for EBS calls to these APIs across the entire US East Region. As with any complicated operational issue, this one was caused by several root causes interacting with one another and therefore gives us many opportunities to protect the service against any similar event reoccurring.
Preventing the Event
The trigger for this event was a network configuration change. We will audit our change process and increase the automation to prevent this mistake from happening in the future. However, we focus on building software and services to survive failures. Much of the work that will come out of this event will be to further protect the EBS service in the face of a similar failure in the future.
We will be making a number of changes to prevent a cluster from getting into a re-mirroring storm in the future. With additional excess capacity, the degraded EBS cluster would have more quickly absorbed the large number of re-mirroring requests and avoided the re-mirroring storm. We now understand the amount of capacity needed for large recovery events and will be modifying our capacity planning and alarming so that we carry the additional safety capacity that is needed for large scale failures. We have already increased our capacity buffer significantly, and expect to have the requisite new capacity in place in a few weeks. We will also modify our retry logic in the EBS server nodes to prevent a cluster from getting into a re-mirroring storm. When a large interruption occurs, our retry logic will back off more aggressively and focus on re-establishing connectivity with previous replicas rather than futilely searching for new nodes with which to re-mirror. We have begun working through these changes and are confident we can address the root cause of the re-mirroring storm by modifying this logic. Finally, we have identified the source of the race condition that led to EBS node failure. We have a fix and will be testing it and deploying it to our clusters in the next couple of weeks. These changes provide us with three separate protections against having a repeat of this event.
Impact to Multiple Availability Zones
EC2 provides two very important availability building blocks: Regions and Availability Zones. By design, Regions are completely separate deployments of our infrastructure. Regions are completely isolated from each other and provide the highest degree of independence. Many users utilize multiple EC2 Regions to achieve extremely-high levels of fault tolerance. However, if you want to move data between Regions, you need to do it via your applications as we don’t replicate any data between Regions on our users’ behalf. You also need to use a separate set of APIs to manage each Region. Regions provide users with a powerful availability building block, but it requires effort on the part of application builders to take advantage of this isolation. Within Regions, we provide Availability Zones to help users build fault-tolerant applications easily. Availability Zones are physically and logically separate infrastructure that are built to be highly independent while still providing users with high speed, low latency network connectivity, easy ways to replicate data, and a consistent set of management APIs. For example, when running inside a Region, users have the ability to take EBS snapshots which can be restored in any Availability Zone and can programmatically manipulate EC2 and EBS resources with the same APIs. We provide this loose coupling because it allows users to easily build highly-fault-tolerant applications…
Even though we provide a degree of loose coupling for our customers, our design goal is to make Availability Zones indistinguishable from completely independent. Our EBS control plane is designed to allow users to access resources in multiple Availability Zones while still being tolerant to failures in individual zones. This event has taught us that we must make further investments to realize this design goal. There are three things we will do to prevent a single Availability Zone from impacting the EBS control plane across multiple Availability Zones. The first is that we will immediately improve our timeout logic to prevent thread exhaustion when a single Availability Zone cluster is taking too long to process requests. This would have prevented the API impact from 12:50 AM PDT to 2:40 AM PDT on April 21st. To address the cause of the second API impact, we will also add the ability for our EBS control plane to be more Availability Zone aware and shed load intelligently when it is over capacity. This is similar to other throttles that we already have in our systems. Additionally, we also see an opportunity to push more of our EBS control plane into per-EBS cluster services. By moving more functionality out of the EBS control plane and creating per-EBS cluster deployments of these services (which run in the same Availability Zone as the EBS cluster they are supporting), we can provide even better Availability Zone isolation for the EBS control plane.
Making it Easier to Take Advantage of Multiple Availability Zones
We also intend to make it easier for customers to take advantage of multiple Availability Zones. First, we will offer multiple Availability Zones for all of our services, including Amazon Virtual Private Cloud (“VPC”). Today, VPC customers only have access to a single Availably Zone. We will be adjusting our roadmap to give VPC customers access to multiple Availability Zones as soon as possible. This will allow VPC customers to build highly-available applications using multiple Availability Zones just as EC2 customers not using a VPC do today.
A related finding from this event is we need to do a better job of making highly-reliable multi-AZ deployments easy to design and operate. Some customers’ applications (or critical components of the application like the database) are deployed in only a single Availability Zone, while others have instances spread across Availability Zones but still have critical, single points of failure in a single Availability Zone. In cases like these, operational issues can negatively impact application availability when a robust multi-Availability Zone deployment would allow the application to continue without impact. We will look to provide customers with better tools to create multi-AZ applications that can support the loss of an entire Availability Zone without impacting application availability. We know we need to help customers design their application logic using common design patterns. In this event, some customers were seriously impacted, and yet others had resources that were impacted but saw nearly no impact on their applications.
Improving Communication and Service Health Tools During Operational Issues
In addition to the technical insights and improvements that will result from this event, we also identified improvements that need to be made in our customer communications. We would like our communications to be more frequent and contain more information. We understand that during an outage, customers want to know as many details as possible about what’s going on, how long it will take to fix, and what we are doing so that it doesn’t happen again. Most of the AWS team, including the entire senior leadership team, was directly involved in helping to coordinate, troubleshoot and resolve the event. Initially, our primary focus was on thinking through how to solve the operational problems for customers rather than on identifying root causes. We felt that that focusing our efforts on a solution and not the problem was the right thing to do for our customers, and that it helped us to return the services and our customers back to health more quickly. We updated customers when we had new information that we felt confident was accurate and refrained from speculating, knowing that once we had returned the services back to health that we would quickly transition to the data collection and analysis stage that would drive this post mortem.
That said, we think we can improve in this area. We switched to more regular updates part of the way through this event and plan to continue with similar frequency of updates in the future. In addition, we are already working on how we can staff our developer support team more expansively in an event such as this, and organize to provide early and meaningful information, while still avoiding speculation.
Why I, Jeff Bezos, Keep Spending Billions On Amazon R&D (AMZN)
Bezos doesn’t simply talk about why he spends money on R&D, he talks about the philosophy behind how Amazon builds out its services and infrastructure.
Look inside a current textbook on software architecture, and you’ll find few patterns that we don’t apply at Amazon. We use high-performance transactions systems, complex rendering and object caching, workflow and queuing systems, business intelligence and data analytics, machine learning and pattern recognition, neural networks and probabilistic decision making, and a wide variety of other techniques. And while many of our systems are based on the latest in computer science research, this often hasn’t been sufficient: our architects and engineers have had to advance research in directions that no academic had yet taken. Many of the problems we face have no textbook solutions, and so we — happily — invent new approaches.
Our technologies are almost exclusively implemented as services: bits of logic that encapsulate the data they operate on and provide hardened interfaces as the only way to access their functionality. This approach reduces side effects and allows services to evolve at their own pace without impacting the other components of the overall system. Service-oriented architecture — or SOA — is the fundamental building abstraction for Amazon technologies. Thanks to a thoughtful and far-sighted team of engineers and architects, this approach was applied at Amazon long before SOA became a buzzword in the industry. Our e-commerce platform is composed of a federation of hundreds of software services that work in concert to deliver functionality ranging from recommendations to order fulfillment to inventory tracking. For example, to construct a product detail page for a customer visiting Amazon.com, our software calls on between 200 and 300 services to present a highly personalized experience for that customer.
State management is the heart of any system that needs to grow to very large size. Many years ago, Amazon’s requirements reached a point where many of our systems could no longer be served by any commercial solution: our key data services store many petabytes of data and handle millions of requests per second. To meet these demanding and unusual requirements, we’ve developed several alternative, purpose-built persistence solutions, including our own key-value store and single table store. To do so, we’ve leaned heavily on the core principles from the distributed systems and database research communities and invented from there. The storage systems we’ve pioneered demonstrate extreme scalability while maintaining tight control over performance, availability, and cost. To achieve their ultra-scale properties these systems take a novel approach to data update management: by relaxing the synchronization requirements of updates that need to be disseminated to large numbers of replicas, these systems are able to survive under the harshest performance and availability conditions. These implementations are based on the concept of eventual consistency. The advances in data management developed by Amazon engineers have been the starting point for the architectures underneath the cloud storage and data management services offered by Amazon Web Services (AWS). For example, our Simple Storage Service, Elastic Block Store, and SimpleDB all derive their basic architecture from unique Amazon technologies.
Other areas of Amazon’s business face similarly complex data processing and decision problems, such as product data ingestion and categorization, demand forecasting, inventory allocation, and fraud detection. Rule-based systems can be used successfully, but they can be hard to maintain and can become brittle over time. In many cases, advanced machine learning techniques provide more accurate classification and can self-heal to adapt to changing conditions. For example, our search engine employs data mining and machine learning algorithms that run in the background to build topic models, and we apply information extraction algorithms to identify attributes and extract entities from unstructured descriptions, allowing customers to narrow their searches and quickly find the desired product. We consider a large number of factors in search relevance to predict the probability of a customer’s interest and optimize the ranking of results. The diversity of products demands that we employ modern regression techniques like trained random forests of decision trees to flexibly incorporate thousands of product attributes at rank time. The end result of all this behind-the-scenes software? Fast, accurate search results that help you find what you want.All the effort we put into technology might not matter that much if we kept technology off to the side in some sort of R&D department, but we don’t take that approach. Technology infuses all of our teams, all of our processes, our decision-making, and our approach to innovation in each of our businesses. It is deeply integrated into everything we do.We live in an era of extraordinary increases in available bandwidth, disk space, and processing power, all of which continue to get cheap fast. We have on our team some of the most sophisticated technologists in the world – helping to solve challenges that are right on the edge of what’s possible today. As I’ve discussed many times before, we have unshakeable conviction that the long-term interests of shareowners are perfectly aligned with the interests of customers.
And we like it that way. Invention is in our DNA and technology is the fundamental tool we wield to evolve and improve every aspect of the experience we provide our customers. We still have a lot to learn, and I expect and hope we’ll continue to have so much fun learning it. I take great pride in being part of this team.
The AWS Outage: The Cloud's Shining Moment - O'Reilly Broadcast
I just realized that the comments from the O’Reilly story on Amazon are also really good. Here are some of the stand-outs.
I thought the failure was that multiple availability zones died simultaneously, something that by design and per Amazon’s docs should never happen short of a hurricane in Virginia. Note that out is exponentially harder to distribute your app across not only AZs but geographical areas as well: high speed links connect AZs within a geo, but going from one geo to another is extremely slow and not realtime.
Of course you design for failure, it happens every day on AWS. But can you design around multiple datacenters (availability zones) dying simultaneously? When AWS told you not to worry about that eventuality? Probably not without downtime and some serious compromises.
The problem is that once EVERYONE falls back to a service in another availability zone, that zone suddenly has to handle twice the load (probably a lot more when Virginia goes down, because it’s generally believed to have the most instances). We saw pretty heavy slowdown across zones even with only a handful of people following this approach. You need to either bring another provider into the mix, or just have faith that AWS keeps piles and piles of spare capacity.
AWS previously assured us that multiple Availability Zones wouldn’t realistically fail at the same time. Now that proved to be untrue, you choose to say “Ah - you shouldn’t have believed AWS, you should have been using multiple regions” Presumably when the next outage hits both US regions you’ll say “Ah - of course you should have used the EU and Asia regions as well”.
We should recognize AWS as a single point of failure and look at hosting across multiple providers. Fool me once, shame on you; fool me twice, shame on me.
This does require sophisticated management tools like enStratus, but you should use those tools to avoid putting all your eggs into the AWS basket.
I’m not sure that the rest of the technology stack has necessarily caught up to this model though - in particular NoSQL databases aren’t the panacea you appear to believe them to be. Hopefully all the pieces of the technology stack will evolve.AWS has never in any conversation I have ever had said that multiple availability zones would not realistically fail at the same time. If they felt that way, don’t you think they’d have an SLA better than 99.9%?
Of course, if you want to survive the failure of multiple availability zones, you should spread yourself across regions. I don’t understand why this is so hard for people to understand.
Similarly, yes, you should have some ability to migrate your systems into another cloud. I don’t think actual technical loss of all AWS regions (or even multiple regions) can happen absent of nuclear war or asteroid strike, but companies do go out of business/get sued/etc.
“The strength of cloud computing is that it puts control over application availability in the hands of the application developer and not in the hands of your IT staff, data center limitations, or a managed services provider”
Sadly, no. That the developers are in charge of this stuff is why so many sites were down completely. They’re terrible at it, don’t value it, and even when they try to roll it out they do it poorly.
An IT guy would have spent 15 minutes on day 1 thinking about disaster recovery. A developer always wants to do it tomorrow and tomorrow, as we all know, never comes.
Certainly, you can design for failure. And for those where failure it literally not an option, like things that deal with life-safety or where thousands of dollars are lost every second, sure.
But one simple thing you don’t address is that doing so is a lot more expensive to design for failure during the development cycle. Yes, any bridge across troubled water can be over-built to ensure that it never, ever fails, but doing so is often so cost-prohibitive as to be unrealistic.
And lastly, your advocacy would have sounded more credible if you had stated up-front that you are CTO of a company that purposes to help people solve this problem in exchange for mucho deniro rather than bury that fact in the middle of the article.
“In short, if your systems failed in the Amazon cloud this week, it wasn’t Amazon’s fault. You either deemed an outage of this nature an acceptable risk or you failed to design for Amazon’s cloud computing model.”
Oh yes, a company would provide you a cloud service to host your data and when that very service fails and renders your own operation useless it is your fault. So why pay for their service in the first place?
This sounded like an argument made by either a total irrational fanatic, or another network guy who is clueless about creating software, or both. Seems like outages these days are always the fault of the software developer(s) and never that of the one maintaining the network resource.
Anybody with common sense should stop reading right after that quoted sentence.
Well it’s the fault of the software developers for thinking they don’t need “clueless network guys”
You realize “in th3 cloud!!! ZOMG we’re in cloud!!!” means nothing more than you’re running a virtual machine in a data center somewhere. That’s all amazon does. It’s not magic - you’re not safe because you’re “in the cloud.”
It’s just a datacenter. That most software engineers don’t know that is why they need “clueless network guys” to point it out to them.
This article has a point; any company running on AWS could have designed its system to survive this outage. But this missed two key points:
1) How could they test this survivability, end-to-end, ahead of time? The rule it, if you didn’t test it, it probably won’t work. Companies that survived unscathed were prepared *and* lucky.
2) What about recovery? The statement “No humans” is wrong. A company may be able to design for the initial outage, but designing to automatically handle a period of days where Amazon are fiddling with flaky infrastructure is practically impossible. Everyone will have a lot of overtime afterwards, making sure their systems are working perfectly and data is consistent.And the elephant in the room is that startups can only tackle a few problems at once. If they pile resources into 99.999% reliability, the opportunity cost is that the rest of their development goes slower and they fall behind the competition.
Crafting the test scenarios for cloud computing can definitely be challenging, but it is doable.
My best advice is to assume any where you have one concept (e.g. availability zone), that you have some kind of single point of failure and shut down access to that single point of failure.
Then automate your tests!
I wondered when the Cloud Snow Job would begin, surprised it’s on O’Reilly frankly.
What this article doesn’t “get” is that Amazon fundamentally did not deliver what it said on it’s tin.
Also, the solutions espoused here are pretty standard traditional datacentre operating procedures, which cost real money - the whole point of the “Cloud” was to avoid these costs, else why bother?
As to “Applications built with “design for failure” in mind don’t need SLAs.” - run a mile from anyone who suggests that.
I think the vitriol behind some of these comments shows exactly why the business is moving to the cloud.
IT: The Department of No
Do you guys think people are really duped into moving to the cloud because of pots of gold and promises of no worries?
No. They are moving there because you, the IT leader, make it impossible to do their jobs. Procuring a server or even a VM in most organizations takes 3-6 weeks (or, in some cases, 3-6 months) and the business has real work to do that actually generates revenue for your company.
But you are saying no to the business and going on forums yapping about why the cloud sucks.
In the mean time, they are sticking systems without any controls or risk analysis or redundancy in the cloud.
Stop bashing the cloud and do your job. Help them move into the cloud appropriately.The reason it can take some time to get a new system up (3-6 weeks is probably long for a small company but believable for a big one, and is certainly long for any company which has virtualized internally) is because that system needs to have controls and there needs to be risk analysis and redundancy planning exactly as you suggest. The cloud doesn’t mitigate any of that. It just shifts the responsibility for it from the “Department of No” to the “Developer who doesn’t know.” It still takes time and effort and skill to do it right. Which I think was the original point of your post.
But the bleeding edge developers who want to do something cool and not have to worry about those pesky details have to either wait for somebody to do that part for them, or take it to the cloud and accept that their amazing app is going to be subject to the whims of Amazon’s IT staff who don’t give a darn about them and their puny app, instead of the whims of their own IT guy that they could be taking out for a beer every now and then. IT can be a good friend. But you can’t treat them badly (or, for example, call them names) and expect them to still cater to your every whim when you have it. There are 50 other developers clambering for the same thing. They’re probably off helping the ones they like more.I think that calling this week the cloud’s “shining moment” is stretching things. It would be more accurate to say that, since it’s the cloud, recovery is easier. With native Amazon tools in some cases, and strong devops tools and practices in others, DR can in fact be radically cheaper and easier in the cloud than in traditional, physical infrastructures.
Cheaper and easier, though, is only part of the point of the cloud. Another important point is that it lets IT focus more on the business and less on infrastructure complexities. At the moment, major cloud providers remove physical complexities but not really software ones. It’s still up to us to design, build, deploy, and manage those great devops practices. Vendors like Enstratus take the next step and bite off that layer. Either poetically or ironically, depending on your viewpoint, they use the cloud to recover from failures in the cloud. (If nothing else, we’re proving there is no such thing as THE cloud. Maybe we should call it “the sky” instead. The sky definitely was falling this week! :-).
In any case, it would be interesting to learn how much load the various cloud automation services others handled this week. Can they scale if thousands of AWS customers use their service to migrate tens of thousands of servers and terabytes/petabytes of data all at once, or does the meltdown cascade from one level of the cloud to the next?
Data, on the other hand, is a whole different kettle of fish. If we all waited to migrate to the cloud until we’d implemented true design-for-fail architectures, cloud adoption would be at least an order of magnitude slower than it is. The bottom line is that we’re still only partway through the journey to the holy grail of the cloud. So far we’ve pushed the complexity, difficulty, and cost up the stack from hardware to software and systems architecture. This week the tradeoffs that were made at the architecture level are revealing themselves for everyone to see.No, that’s completely wrong. Partition tolerance is the ability to continue to meet your service guarantees in the presence of communication failures that isolate portions of your system.
Also, non-relational is not the same thing as eventually consistent (just ask the HBase folks). You can have strong consistency requirements without using a relational storage model.
IWhen you say, “The knee-jerk reaction is to look for an SLA from your cloud provider,” you are ignoring the point made above by Abol. Amazon claimed that zones were independent, but an EBS failure affected multiple zones. They fell into the same trap you are warning about: they didn’t design for failure.
This was a moderately interesting but also intensely frustrating posting.
“This should never have happened if you designed your services right, to never trust (that one) cloud!”
Yes, but, the number of people who have all of the CS and IT architecture and IT operations backgrounds to understand how to not trust any single point of failure and actually design really robust systems around that sort of thing is not that large.
What is being suggested is that the entire industry must suddenly develop a higher level of technical competence than it now has, by a large factor.
Would this be a good thing? Of course. Is it practically going to reach the priority level that real operational organizations can make it happen? Unlikely.
Eventually, attempting to wring the last 9’s out of a service, one runs into externalities such as partitioned and failing backbone ISPs, major DNS outages, physical damage to infrastructure, and other hard to solve problems. One can design right up to that ragged external unavoidable outage edge, with arbitrary amounts of time and money and expertise. I and a few others are happy to do that for clients. But knowing what is economical and sensible, and what is polishing the shine on areas when there are larger inherent risks accepted as costs of doing business, is important.
Actually, almost all you need to know to achieve AZ-redundancy is to simply follow programming best practices you were supposed to be following in the first place.
Getting x-region and x-cloud is moderately more difficult, but not as difficult as doing it for a traditional data center.
No, not even close. You need to follow system design and integration best practices you were supposed to be following in the first place, which includes programming and architecture and all the other subcomponents.
The number of organizations that actually meet system design and integration best practices, in the real world, is very small. Hence my frustration. Actual high availability and dependability is a much harder problem than people tend to think it is. Saying it’s just a programming problem is obfuscating.
All the things that need to be done are described in literature and operational reports and so forth. None of them are secret or particularly obscure. But rigorous study of systems architecture needed to understand the scope of it well enough to conceive of it and implement it is rare.
So a cloud provider is just like any other datacenter, except you have to spend a lot more money in development to work around their unreliability. Awesome!
The AWS Outage: The Cloud's Shining Moment - O'Reilly Broadcast
This is, by far, the most interesting article I’ve seen on Amazon’s recent outage - a typically contrarian view from O’Reilly Radar - because it highlights what’s different about cloud computing vs. traditional computing. According to George Reese, the outage actually exposed the strength of cloud computing. The AWS cloud works best when the applications above it are “designed for failure,” shifting the burden of uptime from the infrastructure layer to the application layer. Read on…
So many cloud pundits are piling on to the misfortunes of Amazon Web Services this week as a response to the massive failures in the AWS Virginia region. If you think this week exposed weakness in the cloud, you don’t get it: it was the cloud’s shining moment, exposing the strength of cloud computing.
In short, if your systems failed in the Amazon cloud this week, it wasn’t Amazon’s fault. You either deemed an outage of this nature an acceptable risk or you failed to design for Amazon’s cloud computing model. The strength of cloud computing is that it puts control over application availability in the hands of the application developer and not in the hands of your IT staff, data center limitations, or a managed services provider.
The AWS outage highlighted the fact that, in the cloud, you control your SLA in the cloud—not AWS.
The Dueling Models of Cloud Computing
Until this past week, there’s been a mostly silent war ranging out there between two dueling architectural models of cloud computing applications: “design for failure” and traditional. This battle is about how we ultimately handle availability in the context of cloud computing.
The Amazon model is the “design for failure” model. Under the “design for failure” model, combinations of your software and management tools take responsibility for application availability. The actual infrastructure availability is entirely irrelevant to your application availability. 100% uptime should be achievable even when your cloud provider has a massive, data-center-wide outage.
Most cloud providers follow some variant of the “design for failure” model. A handful of providers, however, follow the traditional model in which the underlying infrastructure takes ultimate responsibility for availability. It doesn’t matter how dumb your application is, the infrastructure will provide the redundancy necessary to keep it running in the face of failure. The clouds that tend to follow this model are vCloud-based clouds that leverage the capabilities of VMware to provide this level of infrastructural support.
The advantage of the traditional model is that any application can be deployed into it and assigned the level of redundancy appropriate to its function. The downside is that the traditional model is heavily constrained by geography. It would not have helped you survive this level of cloud provider (public or private) outage.
The advantage of the “design for failure” model is that the application developer has total control of their availability with only their data model and volume imposing geographical limitations. The downside of the “design for failure” model is that you must “design for failure” up front.
Applied “Design for Failure”
In presentations, I refer to the “design for failure” model as the AWS model. AWS doesn’t have any particular monopoly on this model, but their lack of persistent virtual machines pushes this model to its extreme. Actually, best practices for building greenfield applications in most clouds fit under this model.
The fundamental principle of “design for failure” is that the application is responsible for its own availability, regardless of the reliability of the underlying cloud infrastructure. In other word, you should be able to deploy a “design for failure” application and achieve 99.9999% uptime (really, 100%) leveraging any cloud infrastructure. It doesn’t matter if the underlying infrastructural components have only a 90% uptime rating. It doesn’t matter if the cloud has a complete data center meltdown that takes it entirely off the Internet.
There are several requirements for “design for failure”:
- Each application component must be deployed across redundant cloud components, ideally with minimal or no common points of failure
- Each application component must make no assumptions about the underlying infrastructure—it must be able to adapt to changes in the infrastructure without downtime
- Each application component should be partition tolerant—in other words, it should be able to survive network latency (or loss of communication) among the nodes that support that component
- Automation tools must be in place to orchestrate application responses to failures or other changes in the infrastructure (full disclosure, I am CTO of a company that sells such automation tools, enStratus)
Applications built with “design for failure” in mind don’t need SLAs. They don’t care about the lack of control associated with deploying in someone else’s infrastructure. By their very nature, they will achieve uptimes you can’t dream of with other architectures and survive extreme failures in the cloud infrastructure.
Let’s look at a design for failure model that would have come through the AWS outage in flying colors:
- Dynamic DNS pointing to elastic load balancers in Virginia and California
- Load balancers routing to web applications in at least two zones in each region
- A NoSQL data store with the ring spread across all web application availability zones in both Virginia and California
- A cloud management tool (running outside the cloud!) monitoring this infrastructure for failures and handling reconfiguration
Upon failure, your California systems and the management tool take over. The management tool reconfigures DNS to remove the Virginia load balancer from the mix. All traffic is now going to California. The web applications in California are stupid and don’t care about Virginia under any circumstance, and your NoSQL system is able to deal with the lost Virginia systems. Your cloud management tool attempts to kill off all Virginia resources and bring up resources in California to replace the load.
Voila, no humans, no 2am calls, and no outage! Extra bonus points for “bursting” into Singapore, Japan, Ireland, or another cloud! When Virginia comes back up, the system may or may not attempt to rebalance back into Virginia.
Control, SLAs, Cloud Models, and You
When you make the move into the cloud, you are doing so exactly because you want to give up control over the infrastructure level. The knee-jerk reaction is to look for an SLA from your cloud provider to cover this lack of control. The better reaction is to deploy applications in the cloud designed to make your lack of control irrelevant. It’s not simply an availability issue; it also extends to other aspects of cloud computing like security and governance. You don’t need no stinking SLA.
As I stated earlier, this outage highlights the power of cloud computing. What about Netflix, an AWS customer that kept on going because they had proper “design for failure”? Try doing that in your private IT infrastructure with the complete loss of a data center. What about another AWS/enStratus startup customer who did not design for failure, but took advantage of the cloud DR capabilities to rapidly move their systems to California? What startup would ever have been able to relocate their entire application across country within a few hours of the loss of their entire data center without already paying through the nose for it?
These kinds of failures don’t expose the weaknesses of the cloud—they expose why the cloud is so important.