Dell's bespoke server unit pushes over $1bn of tin • The Register
An interesting article on Dell’s bespoke server unit which targets the world’s 20 largest hyperscale data center operators. Facebook was an early success story before they founded the Open Compute project and started designing their own servers, like Google before them. But now Dell is looking to commercialize some of the designs they originally developed for this space.
It has been five years since Forrest Norrod and his colleagues at Dell drew up the first custom server design on a napkin at a bar at the Driskill Hotel in Austin, Texas, getting the server maker into the tailoring business. Dell now custom fits servers for very precise workloads and can cater to the tight data center power and cooling requirements found at hyperscale web operators.
The tailoring unit, called Data Center Solutions (DCS), has now grown to over $1bn in sales, says Norrod. Norrod was vice president and general manager of the unit when Dell first started publicly talking about its operations back in October 2008. He is now general manager of server platforms at Dell, and tells El Reg that the DCS business is now a “greater than $1bn a year business,” adding: “We beat that a while ago.”
DCS was founded originally to chase the world’s top 20 hyperscale data center operators, and creates stripped-down, super-dense, and energy-efficient machines that can mean the different between a profit and a loss for those data center operators. These DCS machines were not aimed at general purpose server users, whose workloads generally run on one machine and who need RAID disk controllers, service processors, and other high availability features because all of their eggs are in one basket (if not literally, then there is one server per workload – so it amounts to the same if you are looking at it at the app level). The DCS custom designs were built for companies running parallel workloads that have redundancy, data replication, and failover built into the software stack – so a regular PowerEdge server would not just be overkill, it would be plain stupid.
Facebook was the poster child for the DCS business before the social media giant decided to launch the Open Compute project last April, open-sourcing its own server and data center designs and going straight to original design manufacturers (ODMs) to build its gear.
Dell struck oil with this custom server thing, but Google builds its own gear, and now so does Facebook. Even fellow Texan Rackspace Hosting is using whitebox servers to get more server for the dollar.
But a year before this all happened, Dell saw the writing on the bespoke server wall and did a smart thing: it partially commercialized some of the DCS designs and took them to market as the PowerEdge-C machines. But you can’t just log into the Dell site and buy a PowerEdge-C machine, you have to engage in a formal sales process so Dell can make sure you get the right iron.
“Our aim with the PowerEdge-Cs was for the next 1,000 customers who needed DCS-style machines, and we have blown way beyond that,” brags Steve Cumings, executive director of marketing for the DCS unit. And while Cumings won’t talk specific numbers, he adds “it is not 1,001, either”.
In the Dell lingo, the custom machines are known as the DCS “classic” boxes while the other cloudy boxes are called “PECs” after the abbreviation of their formal name. The PowerEdge-C machines not only traditional multi-node bare-bones servers in 1U and 2U chassis as well as a 3U chassis that can cram up to a dozen single-socket microservers into a 3U chassis. Dell has even built mini-servers based on VIA Technologies’ X86 processors for hosting customers looking for cheap, dedicated nodes. In all cases, these servers have shared power and fans and precious little else but CPUs, memory, and disks.
The DCS unit has another part of the business, which is the modular data centers built from the shipping container or from other modular components. Microsoft is a big Dell customer for modular data centers. Cumings says these containerized data centers are only available to the top-end hyperscale customers because there is so much demand that Dell can’t meet it. Of course, demand is a relative thing. Cumings estimates that worldwide, there are on the order of several hundred containers being used as data centers among the hyperscale crowd. “Our impression is that we are comfortably number one in that market,” says Cumings.
Whether containerized data centers can go quasi-volume, as the PowerEdge-C servers did with several of the DCS custom server designs, remains to be seen. A lot depends on how radically customers are prepared to change their data center operations, how outdated their glass houses are, and how quickly they need to upgrade.
Dell has not broken out shipments and revenues for the DCS unit in the past, but it has hinted that this stealthy part of its business would have been among the top five server shippers in 2008 – and that in some quarters would have ranked as high as number three. This ranking would also depend on the shipments for Dell’s other stealth server business, its OEM Solutions Group, which OEMs Dell’s traditional PowerEdge servers for appliances, kiosks, and other interested parties. Three years ago, the OEM Server Group was twice as large, in terms of revenues, as the DCS unit. The gap has probably closed significantly since then.
Back in October 2008, the DCS unit had 200 employees, mostly engineers and sales people who helped craft machines and the houses they run in for hyperscale customers. Three-and-a-half years later, DCS now has 400 employees. Both then and now, DCS relied on other parts of Dell for back office functions, parts acquisitions, and manufacturing. Norrod says that while DCS uses the same manufacturing facilities as the general-purpose PowerEdge machines, they tend to have dedicated lines because the DCS run is usually 10,000 or more of the same thing, rather than a handful of similar machines built on demand for one set of customer orders that day, followed by a complete new set of machines with different configurations once they are built. You have to tool and run a DCS line differently from a PowerEdge line, he says – and in some ways, it is easier.
If the employee count is any guide, then it looks like the DCS biz has at least doubled in that time and is probably about the same size as the Dell OEM Solutions Group. Dell had about $8.2bn in servers and networking revenues in the trailing four quarters, and it is not unreasonable that these two stealthy server units could account for around a third of Dell’s server revenues.
In the latest server number out of IDC, which covered the fourth quarter of 2011, density-optimized servers like those sold by the DCS unit accounted for 132,876 units (up 51.5 per cent) and generated $458m (up 33.8 per cent). Dell had 39 per cent share of shipments of these boxes (that’s 51,821 machines) and 45.2 per cent share of sales (or $195m). IDC says that this was double the revenue and shipments of the nearest competitor in this category. Clearly, IDC’s definition of a density optimized machine and DCS’ sales SKUs don’t overlap completely for DCS to be pushing more than $1bn in sales per year. But Dell wanted to brag about the IDC numbers just the same.
ARMed … and possibly Tilera’d and FPGA’d
Dell Enterprise Products Group, which makes the PowerEdge line, has been monkeying around with ARM-based servers for years now, and Cumings says that the DCS engineers took another stab at it with the latest rev of ARM chips from unnamed suppliers in the past year – just to keep up to speed on what is possible with these ARM chips in terms of performance and thermals.
“If we see a market develop, we are ready to go,” says Cumings. “We have done a good job in seeing potential markets coming. But we are not shipping a product now and I can’t tell you when we might.”
The DCS engineers have done some research on many-core processors from Tilera, but Cumings concedes that Dell has “spent less time looking at this”.
That said, when it comes to the DCS classic designs, Cumings says that Dell will “look at any technology that will solve a customer’s problem”.
That could mean field programmable gate arrays, GPU coprocessors, and all kinds of weird stuff someday. (That’s El Reg speculating, not Dell DCS talking.)
For now, the DCS unit is focused on five key markets: hyperscale web, big data, cloud, hosting, and high performance computing. The latest projections from IDC show CPU shipments in the cloud segment to be growing at a compound annual growth rate of 15 percent between 2012 and 2015, with HPC growing 7.3 per cent compared to shrinkage of 1.2 per cent in the traditional, general purpose server space. Cumings says that Dell’s DCS business in the five key growth areas is larger than the sub-markets at large, and won’t be specific about how much larger because this is a competitive advantage for Dell.
“The growth is significant enough that DCS was created with its own resources to chase the opportunity,” says Cumings. “Dell believes in this business and it continues to grow.”
This means we won’t hear about any defections from Facebook or any other name-brand hyperscale customers who don’t talk about where they get their servers.
Exascale by 2018: Crazy ...or possible? • The Register
A short history of supercomputing and the race to exascale computing.
I recently saw some estimates that show we should hit exascale supercomputer performance by around 2018. That seems a bit ambitious – if not stunningly optimistic – and the search to get some perspective led me on an hours-long meander through supercomputing history, plus what I like to call “Fun With Spreadsheets.”
Right now the fastest super is Fujitsu’s K system, which pegs the Flop-O-Meter at a whopping 10.51 petaflops. Looking at my watch, I notice that we’re barely into 2012; this gives the industry another six years or so to attain 990 more petaflops worth of performance and bring us to the exascale promised land.
This implies an increase in performance of around 115% per year over the next six years. Is this possible? Let’s take a trip in the way-back machine…
Just getting to megaflop performance took from the beginning of recorded history until 1964. If we start the clock with the Xia Dynasty at 2,000 BC, this means it took us 3,964 years to get from nothing to megaflops. This is a pretty meager rate of increase, probably somewhere around 0.17 per cent a year, but you have to factor in that everyone was busy fighting, exploring, coming up with new kinds of hats, and inventing the Morris Dance.
The first megaflop system, the Seymour Cray-designed Control Data CDC 6600, was delivered in 1964. It was a breakthrough in a number of ways: the first system to use newly-invented silicon-based processors, the first RISC-based CPU, and the first to use additional (but simpler) assist processors, called ‘peripheral processors,’ to handle I/O and feed tasks to the CPU. This was game-changing technology.
The transition from megaflop to gigaflop performance took only another 21 years with the introduction of the Cray-2, which hit the market in 1985. Seymour Cray broke away from Control Data in 1972 to start his own shop, Cray Research Inc. The Cray-2 delivered 1.9 gflops peak performance by extensively using integrated circuits (early use of modular building blocks), multiple processors (four units), and innovative full-immersion liquid cooling to handle the massive heat load. In its time, it was also game-changing technology. The Cray-2 was also highly stylish, with a futuristic design complimented by blue, red, or yellow panels. Here’s a PDF of a brochure covering the Cray-2.
Fast-forward another 11 years and we see the first system to sustain teraflop performance, the Intel-based ASCI Red system, which was also a big break from past supercomputer designs. Installed at Sandia National Lab in 1996, it’s an example of what we’ve come to expect from modern supercomputers with 9,298 Intel Pentium processors, a terabyte of RAM, and air cooling.
The compound annual performance growth rate (CAGR) for this move from gflop to tflop (another thousand-fold increase) is roughly 87.5 per cent per year, which won’t get us to exascale until midway through 2019 (just in time for the June Top500 list, I’d expect). Not too far off of the 2018 prediction, however.
Twelve years later, in 2008, the first petaflop (the IBM Roadrunner) system debuted. Achieving another 1000-fold performance increase in 12 years is equivalent to a 78 per cent compound annual growth rate. This is way faster than Moore’s Law, which has an implied CAGR of around 60%, but a little slower than the previous move from giga to teraflops. At this growth rate, we’ll reach exascale in 2020 – probably late in the year, but it might make the November 2020 Top500 list.
A mere three years after that, the K computer hit 10.51 pflops performance. The performance growth rate from Roadrunner to K? 116 per cent CAGR, which is almost exactly the growth rate necessary to deliver exascale by 2020.
Does this mean that we’ll see exascale systems in 2018 or even 2020? No, it doesn’t; it’s merely another data point in handicapping the race. This analysis simply looks at timelines; it ignores the problems inherent in housing, powering, and cooling a system that’s 1,000x faster than the current top performer, which sports more than 80,000 compute nodes, 700,000 processing cores, and uses enough power to run 12,000 households before they all get electric cars.
The technology challenges are mind-boggling, and it’s clear that simply applying ‘smaller but faster’ versions of today’s technology won’t get us over the exascale hump. It’s going to take some technology breakthroughs and new approaches. Even with these hurdles, I’m betting that we’ll see exascale performance before the end of 2020, putting us right in line with previous transitions.
But all bets are off if the Mayan prediction of global destruction in December of 2012 turns out to be true. In that case, I reserve the right to change my bet to the year 5976 – which is 2012 AD plus the 3,964 years it took us to get to megaflops. Seems like a safe enough hedge to me.
10 Virtualization and Cloud Predictions for 2012 | Andi Mann – Übergeek
Some predictions on cloud computing - with a shout out to the mainframe.
1. Brands May Come and Go – But No Technology Will Die
Not only are we not living in a ‘post-PC’ world, we are not even living in a ‘post-mainframe’ world! Cloud will not kill data centers, virtual will not kill physical, tablets will not kill PCs, Mac will not kill Windows, Android will not kill iOS, streaming will not kill DVDs. The technology pie is growing, our choices are expanding, and almost every slice is getting bigger. So be prepared to manage an ever-increasing selection of technologies across public and private boundaries.
2. Hybrid IT Will Be ‘The Next Big Thing’
‘Hybrid cloud’ was soooo 2011! In this new world of choices, business will expect hybrid IT: a combination of on-site and off-site; cloud and legacy; private and public; physical and virtual; social and secure; enterprise and consumer; desktop and server; mobile and static. Business will also expect IT to make them work together, whether IT owns the service or not. IT must act as a trusted advisor, as a service broker, and as quality assurance for this brave new world of complex Hybrid IT.
3. Service Quality Will Be IT’s Responsibility Again
As hybrid IT proliferates, business owners will (again) realize they do not want to manage technology; they just want it to work. In 2012, end users will increasingly expect IT to take responsibility for service quality, regardless of who is buying, selling, or delivering that service. IT will need to eliminate the blind spots in hybrid IT, actively support an explosion of devices, deal with complex cross-boundary services, and find a way to deliver a 360-degree service assurance across all facets of end-user experience.
4. Public Cloud Adoption Will Slow
Given the results of this year’s Longhaus research from Australia – an early adopter market and a bellwether for business technology – I suspect the rest of the world is in for a slowdown of public cloud adoption. Issues (perceived or real) with security, compliance, service quality, skills, staffing, complexity, and good old politics will all put the brakes on. Whether ‘cloud stall’ will be as pronounced as ‘virtual stall’ is unsure, but 2012 will see a marked slowdown in public cloud adoption.
5. Public Cloud ‘Gets’ Security
Sad but true – many (most?) enterprise decision-makers still do not trust public cloud. In 2012, IT must do a better job of deploying and explaining cloud security – and I believe we will! In 2012, CIOs will see security as less of a barrier to cloud adoption as organizations adopt more and better cloud-oriented security solutions – including solutions designed for complex hybrid cloud services, as well as solutions that are delivered through the cloud with easily-consumed Security SaaS options.
6. Big Iron is Back – Part I
No, mainframe is still not dead. On the contrary, 2012 will see the rise of the mainframe as a *gasp* cloud platform. Massively scalable, hosting critical (and underutilized) ‘big data’, capable of running complex cloud workloads on a variety of architectures (z/OS, Linux, UNIX, Windows), mainframe is really an obvious cloud platform. It will not replace commodity clouds, but large enterprises and governments especially will leverage their investments and bring big iron into their cloud mix.
7. Cloud Gets Heterogeneous
Not only will mainframe become part of the cloud landscape, but public cloud providers will also start to offer UNIX and maybe even other non-x86 platforms. I have recently seen this in action (CA did it internally years ago), and most large enterprises are heavily dependent on heterogeneous systems for their mission-critical applications. Despite the common myth that cloud == commodity servers, heterogeneous servers will start to become more available for large enterprise deployments.
8. Big Iron is Back – Part II
Big iron concepts of integrated compute, network, and storage are resurgent – but this is not your grandpa’s mainframe. Deployment of integrated fabrics like Cisco UCS and VCE Vblock will accelerate rapidly in 2012 as IT changes the way it thinks about integrated infrastructure for virtualization and cloud – and realizes how amazing these integrated boxes are for diverse, dynamic, high-volume workloads like desktop virtualization, pop-up data centers, and cloudbursting.
9. ‘Grown-up’ Cloud Service Management Comes To The Forefront
In 2011, the NIST Cloud Reference Architecture devoted a whole section to ‘Cloud Service Management’, and IT started to talk about ‘grown-up’ disciplines – planning, budgeting, performance, asset, inventory, service levels, audit, etc. In 2012, even ‘commodity’ cloud vendors will finally take cloud management seriously, as enterprises and governments demand these disciplines – and smaller providers differentiate on service and security, not just price.
10. Virtualization Management Becomes Irrelevant
In January 2009 I predicted, “in 3-5 years … niche [Virtual System Management] vendors will no longer survive, as virtualization becomes a core part of the enterprise compute fabric.” Three years later this trend has definitely started, and will accelerate in 2012 as IT turns instead to hybrid IT management, recognizing that silos of standalone virtualization management is a costly and inefficient burden. Maybe 2012 is not the end of Virtualization Management, but it is going to be the start of the demise.
High Scalability - High Scalability - Tumblr Architecture - 15 Billion Page Views a Month and Harder to Scale than Twitter
A technical look at the infrastructure that supports Tumblr and how it is designed to scale - very interesting as a real world case study.
With over 15 billion page views a month Tumblr has become an insanely popular blogging platform. Users may like Tumblr for its simplicity, its beauty, its strong focus on user experience, or its friendly and engaged community, but like it they do.
Growing at over 30% a month has not been without challenges. Some reliability problems among them. It helps to realize that Tumblr operates at surprisingly huge scales: 500 million page views a day, a peak rate of ~40k requests per second, ~3TB of new data to store a day, all running on 1000+ servers.
One of the common patterns across successful startups is the perilous chasm crossing from startup to wildly successful startup. Finding people, evolving infrastructures, servicing old infrastructures, while handling huge month over month increases in traffic, all with only four engineers, means you have to make difficult choices about what to work on. This was Tumblr’s situation. Now with twenty engineers there’s enough energy to work on issues and develop some very interesting solutions.
Tumblr started as a fairly typical large LAMP application. The direction they are moving in now is towards a distributed services model built around Scala, HBase, Redis, Kafka, Finagle, and an intriguing cell based architecture for powering their Dashboard. Effort is now going into fixing short term problems in their PHP application, pulling things out, and doing it right using services.
The theme at Tumblr is transition at massive scale. Transition from a LAMP stack to a somewhat bleeding edge stack. Transition from a small startup team to a fully armed and ready development team churning out new features and infrastructure. To help us understand how Tumblr is living this theme is startup veteran Blake Matheny, Distributed Systems Engineer at Tumblr. Here’s what Blake has to say about the House of Tumblr:
Site: http://www.tumblr.com/Stats
- 500 million page views a day
- 15B+ page views month
- ~20 engineers
- Peak rate of ~40k requests per second
- 1+ TB/day into Hadoop cluster
- Many TB/day into MySQL/HBase/Redis/Memcache
- Growing at 30% a month
- ~1000 hardware nodes in production
- Billions of page visits per month per engineer
- Posts are about 50GB a day. Follower list updates are about 2.7TB a day.
- Dashboard runs at a million writes a second, 50K reads a second, and it is growing.
Software
- OS X for development, Linux (CentOS, Scientific) in production
- Apache
- PHP, Scala, Ruby
- Redis, HBase, MySQL
- Varnish, HA-Proxy, nginx,
- Memcache, Gearman, Kafka, Kestrel, Finagle
- Thrift, HTTP
- Func - a secure, scriptable remote control framework and API
- Git, Capistrano, Puppet, Jenkins
Hardware
- 500 web servers
- 200 database servers (many of these are part of a spare pool we pulled from for failures)
- 47 pools
- 30 shards
- 30 memcache servers
- 22 redis servers
- 15 varnish servers
- 25 haproxy nodes
- 8 nginx
- 14 job queue servers (kestrel + gearman)
Architecture
- Tumblr has a different usage pattern than other social networks.
- With 50+ million posts a day, an average post goes to many hundreds of people. It’s not just one or two users that have millions of followers. The graph for Tumblr users has hundreds of followers. This is different than any other social network and is what makes Tumblr so challenging to scale.
- #2 social network in terms of time spent by users. The content is engaging. It’s images and videos. The posts aren’t byte sized. They aren’t all long form, but they have the ability. People write in-depth content that’s worth reading so people stay for hours.
- Users form a connection with other users so they will go hundreds of pages back into the dashboard to read content. Other social networks are just a stream that you sample.
- Implication is that given the number of users, the average reach of the users, and the high posting activity of the users, there is a huge amount of updates to handle.
- Tumblr runs in one colocation site. Designs are keeping geographical distribution in mind for the future.
- Two components to Tumblr as a platform: public Tumblelogs and Dashboard
- Public Tumblelog is what the public deals with in terms of a blog. Easy to cache as its not that dynamic.
- Dashboard is similar to the Twitter timeline. Users follow real-time updates from all the users they follow.
- Very different scaling characteristics than the blogs. Caching isn’t as useful because every request is different, especially with active followers.
- Needs to be real-time and consistent. Should not show stale data. And it’s a lot of data to deal with. Posts are only about 50GB a day. Follower list updates are 2.7TB a day. Media is all stored on S3.
- Most users leverage Tumblr as tool for consuming of content. Of the 500+ million page views a day, 70% of that is for the Dashboard.
- Dashboard availability has been quite good. Tumblelog hasn’t been as good because they have a legacy infrastructure that has been hard to migrate away from. With a small team they had to pick and choose what they addressed for scaling issues.
Old Tumblr
- When the company started on Rackspace it gave each custom domain blog an A record. When they outgrew Rackspace there were too many users to migrate. This is 2007. They still have custom domains on Rackspace. They route through Rackspace back to their colo space using HAProxy and Varnish. Lots of legacy issues like this.
- A traditional LAMP progression.
- Historically developed with PHP. Nearly every engineer programs in PHP.
- Started with a web server, database server and a PHP application and started growing from there.
- To scale they started using memcache, then put in front-end caching, then HAProxy in front of the caches, then MySQL sharding. MySQL sharding has been hugely helpful.
- Use a squeeze everything out of a single server approach. In the past year they’ve developed a couple of backend services in C: an ID generator and Staircar, using Redis to power Dashboard notifications
- The Dashboard uses a scatter-gather approach. Events are displayed when a user access their Dashboard. Events for the users you follow are pulled and displayed. This will scale for another 6 months. Since the data is time ordered sharding schemes don’t work particularly well.
New Tumblr
- Changed to a JVM centric approach for hiring and speed of development reasons.
- Goal is to move everything out of the PHP app into services and make the app a thin layer over services that does request authentication, presentation, etc.
- Scala and Finagle Selection
- Internally they had a lot of people with Ruby and PHP experience, so Scala was appealing.
- Finagle was a compelling factor in choosing Scala. It is a library from Twitter. It handles most of the distributed issues like distributed tracing, service discovery, and service registration. You don’t have to implement all this stuff. It just comes for free.
- Once on the JVM Finagle provided all the primitives they needed (Thrift, ZooKeeper, etc).
- Finagle is being used by Foursquare and Twitter. Scala is also being used by Meetup.
- Like the Thrift application interface. It has really good performance.
- Liked Netty, but wanted out of Java, so Scala was a good choice.
- Picked Finagle because it was cool, knew some of the guys, it worked without a lot of networking code and did all the work needed in a distributed system.
- Node.js wasn’t selected because it is easier to scale the team with a JVM base. Node.js isn’t developed enough to have standards and best practices, a large volume of well tested code. With Scala you can use all the Java code. There’s not a lot of knowledge of how to use it in a scalable way and they target 5ms response times, 4 9s HA, 40K requests per second and some at 400K requests per second. There’s a lot in the Java ecosystem they can leverage.
- Internal services are being shifted from being C/libevent based to being Scala/Finagle based.
- Newer, non-relational data stores like HBase and Redis are being used, but the bulk of their data is currently stored in a heavily partitioned MySQL architecture. Not replacing MySQL with HBase.
- HBase backs their URL shortner with billions of URLs and all the historical data and analytics. It has been rock solid. HBase is used in situations with high write requirements, like a million writes a second for the Dashboard replacement. HBase wasn’t deployed instead of MySQL because they couldn’t bet the business on HBase with the people that they had, so they started using it with smaller less critical path projects to gain experience.
- Problem with MySQL and sharding for time series data is one shard is always really hot. Also ran into read replication lag due to insert concurrency on the slaves.
- Created a common services framework.
- Spent a lot of time upfront solving operations problem of how to manage a distributed system.
- Built a kind of Rails scaffolding, but for services. A template is used to bootstrap services internally.
- All services look identical from an operations perspective. Checking statistics, monitoring, starting and stopping all work the same way for all services.
- Tooling is put around the build process in SBT (a Scala build tool) using plugins and helpers to take care of common activities like tagging things in git, publishing to the repository, etc. Most developers don’t have to get in the guts of the build system.
- Front-end layer uses HAProxy. Varnish might be hit for public blogs. 40 machines.
- 500 web servers running Apache and their PHP application.
- 200 database servers. Many database servers are used for high availability reasons. Commodity hardware is used an the MTBF is surprisingly low. Much more hardware than expected is lost so there are many spares in case of failure.
- 6 backend services to support the PHP application. A team is dedicated to develop the backend services. A new service is rolled out every 2-3 weeks. Includes dashboard notifications, dashboard secondary index, URL shortener, and a memcache proxy to handle transparent sharding.
- Put a lot of time and effort and tooling into MySQL sharding. MongoDB is not used even though it is popular in NY (their location). MySQL can scale just fine..
- Gearman, a job queue system, is used for long running fire and forget type work.
- Availability is measured in terms of reach. Can a user reach custom domains or the dashboard? Also in terms of error rate.
- Historically the highest priority item is fixed. Now failure modes are analyzed and addressed systematically. Intention is to measure success from a user perspective and an application perspective. If part of a request can’t be fulfilled that is account for
- Initially an Actor model was used with Finagle, but that was dropped. For fire and forget work a job queue is used. In addition, Twitter’s utility library contains a Futures implementation and services are implemented in terms of futures. In the situations when a thread pool is needed futures are passed into a future pool. Everything is submitted to the future pool for asynchronous execution.
- Scala encourages no shared state. Finagle is assumed correct because it’s tested by Twitter in production. Mutable state is avoided using constructs in Scala or Finagle. No long running state machines are used. State is pulled from the database, used, and writte n back to the database. Advantage is developers don’t need to worry about threads or locks.
- 22 Redis servers. Each server has 8 - 32 instances so 100s of Redis instances are used in production.
- Used for backend storage for dashboard notifications.
- A notification is something like a user liked your post. Notifications show up in a user’s dashboard to indicate actions other users have taken on their content.
- High write ratio made MySQL a poor fit.
- Notifications are ephemeral so it wouldn’t be horrible if they were dropped, so Redis was an acceptable choice for this function.
- Gave them a chance to learn about Redis and get familiar with how it works.
- Redis has been completely problem free and the community is great.
- A Scala futures based interface for Redis was created. This functionality is now moving into their Cell Architecture.
- URL shortener uses Redis as the first level cache and HBase as permanent storage.
- Dashboard’s secondary index is built around Redis.
- Redis is used as Gearman’s persistence layer using a memcache proxy built using Finagle.
- Slowly moving from memcache to Redis. Would like to eventually settle on just one caching service. Performance is on par with memcache.
Internal Firehose
- Internally applications need access to the activity stream. An activity steam is information about users creating/deleting posts, liking/unliking posts, etc. A challenge is to distribute so much data in real-time. Wanted something that would scale internally and that an application ecosystem could reliably grow around. A central point of distribution was needed.
- Previously this information was distributed using Scribe/Hadoop. Services would log into Scribe and begin tailing and then pipe that data into an app. This model stopped scaling almost immediately, especially at peak where people are creating 1000s of posts a second. Didn’t want people tailing files and piping to grep.
- An internal firehose was created as a message bus. Services and applications talk to the firehose via Thrift.
- LinkedIn’s Kafka is used to store messages. Internally consumers use an HTTP stream to read from the firehose. MySQL wasn’t used because the sharding implementation is changing frequently so hitting it with a huge data stream is not a good idea.
- The firehose model is very flexible, not like Twitter’s firehose in which data is assumed to be lost.
- The firehose stream can be rewound in time. It retains a week of data. On connection it’s possible to specify the point in time to start reading.
- Multiple clients can connect and each client won’t see duplicate data. Each client has a client ID. Kafka supports a consumer group idea. Each consumer in a consumer group gets its own messages and won’t see duplicates. Multiple clients can be created using the same consumer ID and clients won’t see duplicate data. This allows data to be processed independently and in parallel. Kafka uses ZooKeeper to periodically checkpoint how far a consumer has read.
Cell Design for Dashboard Inbox
- The current scatter-gather model for providing Dashboard functionality has very limited runway. It won’t last much longer.
- The solution is to move to an inbox model implemented using a Cell Based Architecture that is similar to Facebook Messages.
- An inbox is the opposite of scatter-gather. A user’s dashboard, which is made up posts from followed users and actions taken by other users, is logically stored together in time order.
- Solves the scatter gather problem because it’s an inbox. You just ask what is in the inbox so it’s less expensive then going to each user a user follows. This will scale for a very long time.
- Rewriting the Dashboard is difficult. The data has a distributed nature, but it has a transactional quality, it’s not OK for users to get partial updates.
- The amount of data is incredible. Messages must be delivered to hundreds of different users on average which is a very different problem than Facebook faces. Large date + high distribution rate + multiple datacenters.
- Spec’ed at a million writes a second and 50K reads a second. The data set size is 2.7TB of data growth with no replication or compression turned on. The million writes a second is from the 24 byte row key that indicates what content is in the inbox.
- Doing this on an already popular application that has to be kept running.
- Cells
- A cell is a self-contained installation that has all the data for a range of users. All the data necessary to render a user’s Dashboard is in the cell.
- Users are mapped into cells. Many cells exist per data center.
- Each cell has an HBase cluster, service cluster, and Redis caching cluster.
- Users are homed to a cell and all cells consume all posts via firehose updates.
- Each cell is Finagle based and populates HBase via the firehose and service requests over Thrift.
- A user comes into the Dashboard, users home to a particular cell, a service node reads their dashboard via HBase, and passes the data back.
- Background tasks consume from the firehose to populate tables and process requests.
- A Redis caching layer is used for posts inside a cell.
- Request flow: a user publishes a post, the post is written to the firehose, all of the cells consume the posts and write that post content to post database, the cells lookup to see if any of the followers of the post creator are in the cell, if so the follower inboxes are updated with the post ID.
- Advantages of cell design:
- Massive scale requires parallelization and parallelization requires components be isolated from each other so there is no interaction. Cells provide a unit of parallelization that can be adjusted to any size as the user base grows.
- Cells isolate failures. One cell failure does not impact other cells.
- Cells enable nice things like the ability to test upgrades, implement rolling upgrades, and test different versions of software.
- The key idea that is easy to miss is: all posts are replicated to all cells.
- Each cell stores a single copy of all posts. Each cell can completely satisfy a Dashboard rendering request. Applications don’t ask for all the post IDs and then ask for the posts for those IDs. It can return the dashboard content for the user. Every cell has all the data needed to fulfill a Dashboard request without doing any cross cell communication.
- Two HBase tables are used: one that stores a copy of each post. That data is small compared to the other table which stores every post ID for every user within that cell. The second table tells what the user’s dashboard looks like which means they don’t have to go fetch all the users a user is following. It also means across clients they’ll know if you read a post and viewing a post on a different device won’t mean you read the same content twice. With the inbox model state can be kept on what you’ve read.
- Posts are not put directly in the inbox because the size is too great. So the ID is put in the inbox and the post content is put in the cell just once. This model greatly reduces the storage needed while making it simple to return a time ordered view of an users inbox. The downside is each cell contains a complete copy of call posts. Surprisingly posts are smaller than the inbox mappings. Post growth per day is 50GB per cell, inbox grows at 2.7TB a day. Users consume more than they produce.
- A user’s dashboard doesn’t contain the text of a post, just post IDs, and the majority of the growth is in the IDs.
- As followers change the design is safe because all posts are already in the cell. If only follower posts were stored in a cell then cell would be out of date as the followers changed and some sort of back fill process would be needed.
- An alternative design is to use a separate post cluster to store post text. The downside of this design is that if the cluster goes down it impacts the entire site. Using the cell design and post replication to all cells creates a very robust architecture.
- A user having millions of followers who are really active is handled by selectively materializing user feeds by their access model (see Feeding Frenzy).
- Different users have different access models and distribution models that are appropriate. Two different distribution modes: one for popular users and one for everyone else.
- Data is handled differently depending on the user type. Posts from active users wouldn’t actually be published, posts would selectively materialized.
- Users who follow millions of users are treated similarly to users who have millions of followers.
- Cell size is hard to determine. The size of cell is the impact site of a failure. The number of users homed to a cell is the impact. There’s a tradeoff to make in what they are willing to accept for the user experience and how much it will cost.
- Reading from the firehose is the biggest network issue. Within a cell the network traffic is manageable.
- As more cells are added cells can be placed into a cell group that reads from the firehose and then replicates to all cells within the group. A hierarchical replication scheme. This will also aid in moving to multiple datacenters.
…
Software Deployment
- Started with a set of rsync scripts that distributed the PHP application everywhere. Once the number of machines reached 200 the system started having problems, deploys took a long time to finish and machines would be in various states of the deploy process.
- The next phase built the deploy process (development, staging, production) into their service stack using Capistrano. Worked for services on dozens of machines, but by connecting via SSH it started failing again when deploying to hundreds of machines.
- Now a piece of coordination software runs on all machines. Based around Func from RedHat, a lightweight API for issuing commands to hosts. Scaling is built into Func.
- Build deployment is over Func by saying do X on a set of hosts, which avoids SSH. Say you want to deploy software on group A. The master reaches out to a set of nodes and runs the deploy command.
- The deploy command is implemented via Capistrano. It can do a git checkout or pull from the repository. Easy to scale because they are talking HTTP. They like Capistrano because it supports simple directory based versioning that works well with their PHP app. Moving towards versioned updates, where each directory contains a SHA so it’s easy to check if a version is correct.
- The Func API is used to report back status, to say these machines have these software versions.
- Safe to restart any of their services because they’ll drain off connections and then restart.
- All features run in dark mode before activation.
Development
- Started with the philosophy that anyone could use any tool that they wanted, but as the team grew that didn’t work. Onboarding new employees was very difficult, so they’ve standardized on a stack so they can get good with those, grow the team quickly, address production issues more quickly, and build up operations around them.
- Process is roughly Scrum like. Lightweight.
- Every developer has a preconfigured development machine. It gets updates via Puppet.
- Dev machines can roll changes, test, then roll out to staging, and then roll out to production.
- Developers use vim and Textmate.
- Testing is via code reviews for the PHP application.
- On the service side they’ve implemented a testing infrastructure with commit hooks, Jenkins, and continuous integration and build notifications.
…
Lessons learned
- Automation everywhere.
- MySQL (plus sharding) scales, apps don’t.
- Redis is amazing.
- Scala apps perform fantastically.
- Scrap projects when you aren’t sure if they will work.
- Don’t hire people based on their survival through a useless technological gauntlet. Hire them because they fit your team and can do the job.
- Select a stack that will help you hire the people you need.
- Build around the skills of your team.
- Read papers and blog posts. Key design ideas like the cell architecture and selective materialization were taken from elsewhere.
- Ask your peers. They talked to engineers from Facebook, Twitter, LinkedIn about their experiences and learned from them. You may not have access to this level, but reach out to somebody somewhere.
- Wade, don’t jump into technologies. They took pains to learn HBase and Redis before putting them into production by using them in pilot projects or in roles where the damage would be limited.
IBM’s Watson: A First Step in a New Era of Computing « A Smarter Planet Blog
A thoughtful piece by IBM’s David Ferrucci on the future of cognitive computing.
A year has passed since the Watson computer developed by my team at IBM Research defeated two all-time champions on the TV quiz show Jeopardy! A lot has happened since then. IBM launched a new business, IBM Watson Solutions, which is tasked with commercializing the technology. The Solutions team is developing versions of Watson for a number of industries, starting with healthcare and financial services. (Suggestions? Tweet to #WhatShouldWatsonDoNext?) Meanwhile, there’s plenty to do in IBM Research. We spent four years developing Watson for Jeopardy!, but that’s just the beginning of what Watson can become.
Watson is a first step in a new era of computing. There were two previous eras in the evolution of data processing machines: the tabulating era, which began in the late 1800s; and the computing era, which started in the 1940s. We’re now entering a period when machines will become increasingly capable of learning – graduating from moving bits around to understanding what they mean and how they apply to our lives. These machines will be ubiquitous. They’ll be extremely powerful. And they’ll utterly transform the relationships between humans with computers. No longer will computers be simply data processing devices. Think of them as intelligent machines.
This new era of computing is being enabled in part by epochal shifts in technology that we believe will enable people to make the planet smarter in every dimension. Because the world is increasingly instrumented, we can gather immense amounts of information about everything from climate change, to the way transportation systems interact with one another, to the changes in society caused by government actions. Because computing devices are interconnected, all of that data plus many of the digital communications between people can be combined in ways that turn it into useful information. Using analytics tools, we can explore our troves of information and understand better how the world works, predict what will happen next and make better decisions.
With the Jeopardy! experiment, we showed that we could teach Watson how to gather information, how to interpret it in context and how to share insights with humans. Watson 2.0 and 3.0 will continue to expand this ability to learn. A future Watson will learn by analyzing the huge reservoirs of knowledge captured in human language, drawing inferences and engaging with humans to expand and validate its knowledge. Watson will help us grapple with information overload by enabling people to absorb, integrate, evaluate and apply otherwise unimaginably large volumes of data. Think of it this way: A new era of computing will facilitate an intelligent dialogue between an individual and ALL other humans. These machines will make it possible for humans to collaborate in much more powerful ways than they can today.
Here’s a scenario that helps me picture the role that intelligent machines could play in society a decade or so from now: Today, governmental leaders in democratic societies make their decisions based on the best information they can gather, their own beliefs and political calculations. The problem is, the systems they deal with–everything from healthcare to national defense–are not only extremely complicated, but they evolve over time. These systems are beyond the ability of one person or even teams of people to understand, manage and predict. So we’re on the defensive. We are in reactive mode — waiting for things to break and then responding to local disruptions and often missing the big picture.
As a result, leaders and decision makers lack the information and analysis necessary to know what is truly the best course of action or the best policy for their city or nation. They end up making decisions that are primarily weighted towards personal beliefs and political considerations.
But what if they knew much more? What if computers could gather all manner of data about the complex systems of society, digest it, and make rational and well-documented predictions about the consequences of particular actions? Also, imagine that this data is open and available to all citizens where they can explore the implications of different policies as if exploring a simulated world.
In this scenario, leaders will have a crisp and more transparent understanding of what the best decision or policies might be. Every citizen can easily explore the simulated outcomes of well rationalized scenarios rather than rely on sound bites or politically motivated guesses. Leaders will be under tremendous pressure to do what makes the most sense.. In this way, intelligent computer systems can help us control our collective destiny.
There’s a lot of work to be done by scientists to get from where we are today to a future where intelligent machines can help transform the way societies, governments and businesses operate. At IBM, we’re improving the Watson technology in four key dimensions. We’ll extend the information that Watson understands from specific questions to problem-solving scenarios. We’ll shift from simple question-and-answer interactions with humans to rich conversations. We’ll enhance Watson’s ability to explain its results. And, finally, we’ll change how Watson learns. Instead of depending on human programs to feed it information in batches, Watson will be able gather information continuously and learn deeply about specific domains of knowledge.
It was a thrill to lead the team that created a software program that beat very smart humans at Jeopardy! But it’s even more thrilling to lead the team into a new era of computing.
EMC crashes the server flash party • The Register
EMC shook things up a bit last week when they introduced a new, very smart flash storage solution - VFCache. It caches “hot” data (important information that is needed again and again) from a storage array so that servers can access it extremely quickly and efficiently - until its not needed.
The perfect server flash storm hitting storage arrays has generated EMC’s well-signalled Lightning strike; VFCache has arrived, extending FAST technology from the array to the server. Project Thunder is following close behind, promising an EMC server-networked flash array.
This is a major announcement and we are covering it in depth.
FAST (Fully-Automated Storage Tiering] moves data in an EMC array into higher-speed storage tiers when it is being accessed repeatedly and server applications don’t want to wait for slow disks to find their data.
EMC boasts that its customers have purchased 1.3EB of fast-enabled since January 2010, and it has shipped more than 24PB of flash drive capacity, more than any other storage vendor. Times have changed and flash in the array is no longer enough.
Fusion-io has attacked the slow I/O problem by building server PCIe bus-connected flash memory cards holding 10TB or more of NAND flash, and giving applications microsecond-class access to random data instead of the milliseconds needed from a networked array. The threat here is that primary data could move from networked arrays into direct-attached server flash storage.
EMC’s response is to put hot data from its arrays into a VFCache (Virtual Flash Cache) solid state drive in VMware, Windows or RedHat Linux X86 servers from Cisco, Dell, HP and IBM. This provides random read access performance equivalent to Fusion-io once the cache has warmed up and is loaded with hot data.
VFCache is a 20-300GB PCIe-connected flash memory card, using, as rumoured, a Micron SLC card (P320h we think) or LSI WarpDrive SLC flash, Micron being the primary supplier. (Micron has just tragically lost its CEO, Steve Appleton, in a plane crash after winning what must have been a hotly-desired OEM contract.) The P320h is a fast flash card, doing 750,000 random 4K block read IOPS.
The EMC cache increases 4KB - 64KB block random read I/O speed but not write I/O speed. VFCache will not cache read I/Os larger than 64KB. There is no write caching.
We’re told “testing in an Oracle environment showed [an] up to 3X throughput improvement and 50 per cent reduction in latency.” EMC asserts that “VFCache is the fastest PCIe server Flash caching solution available today.” This does not necessarily mean it is faster than Fusion-io’s server solid state storage; that is not a “caching solution” in the way VFCache is.
Storage array and cache interoperability
EMC says VFCache works with EMC VMAX, VMAXe, VNX and VNXe array FAST. Does VFCache only work with EMC VMAX and VNX arrays? No, indeed not; VFCache is storage-agnostic and will work with all 4Gbit/s and 8Gbit/s Fibre Channel-connected block storage. No change is needed in the back-end block arrays.
We’re told that, by working in conjunction with EMC FAST on the storage array, VFCache offers coordinated caching between the server and the array. How does this work? EMC says VFCache’s caching algorithms promote the most frequently referenced data into the cache. Okay, but this isn’t co-ordinated caching between the array and the server. This is VFCache doing its own caching on the server irrespective of whatever caching the array is doing. For example, EMC doesn’t say the array will not cache data that VFCache is caching.
There appears to be no active interaction between VFCache and array FAST at all, EMC saying VFCache is transparent to storage, application, and user.
With writes, the VFCache driver writes data to the array LUN and, when that completes, write data is asynchronously written to the flash cache. It appears the back-end array is not involved in managing VFCache at all; in fact; it doesn’t even know VFCache exists.
Limitations and futures
VFCache has to be disabled and removed for vMotion to take place, it being a local resource for the virtual machine. It’s not possible to configure automatic ESX server failover if it’s being used and things like vCenter Site Recovery Manager with it or use it in a cluster that uses vMotion to balance workloads.
The VFCache card can have separated off DAS partitions for server app use but data loaded into them is not written to the back-end array. This is called split-card mode. It should only be used for temporary data, stuff that doesn’t need safeguarding by being written to the back-end array
EMC will add deduplication to VFCache later this year, increasing its effective capacity. It does not say where the deduplication will be done, with our assumption being that it will be less burdensome on the host CPU to have it carried out on the card itself.
There will be additional capacity points in the future and VFCache will be more deeply integrated with EMC’s storage management products and with the FAST architecture. This could be a hint that EMC’s storage arrays will co-ordinate more actively with VFCache.
VFCache and Fusion-io
Fusion-io has sold to early adopters in EMC’s view. It believes Fusion-io-type server DAS approaches do not protect data against a server crash or provide data sharing. By storing the server’s data in a back-end array it scan be protected via snapshots, replication, etc, and made available to other servers. Management is also easier. Mainstream server flash use won’t happen unless these data protection and management features are added.
Specifically, MC says VFCache is less of a drain on server resources than a Fusion-io flash store because because VFCache hands off flash and wear-level management to the PCIe card itself, whereas the host CPU does this for the Fusion-io product. It says Fusion-io CPU overhead could be up to 20 per cent higher than that with VFCache.
Development and Project Thunder
EMC will add deduplication technology to VFCache later this year, enabling an effective increase in capacity. Clearly this will best be carried out on the card to avoid burdening the host CPU. Whether that will be the case remains to be seen.
There will be larger capacity points for VFCache, possibly going beyond 300GB, and different form factors, adding blade environments to the current rack form factor. It will also integrate better with EMC storage management technologies, and there will be additional integration with FAST architecture. This means active co-ordination between caching by VFCache and EMC’s VMAX and VNX array FAST capabilities. We’ll be seeing:
- Enhanced VMAX/VNX array integration – hinting, tagging, pre-fetch for even greater performance
- Distributed cache coherency for active-active clustered environments
- VMAX and VNX management integrationEMC has also announced Project Thunder, a “low-latency, server networked flash appliance that is scalable, serviceable, and shareable. It is intended to “deliver I/Os measured in millions and timed in microseconds.”
This suggests it will use a fast server-appliance interconnect such as InfiniBand, which Oracle uses in its Exadata systems, or some form of PCIe I/O virtualisation. EMC confirmed this and said it will be working with its customers in a second quarter early access program to determine the interconnect to use.
It will be “optimised for high-frequency, low-latency read/write workloads” and will build upon the PCIe technology in VFCache. In effect this will provide a combination of the functionality offered by Fusion-io and Violin Memory products; the speed of directly-attached flash and the sharability of a networked flash memory array.
El Reg thinks other mainstream storage, PCIe flash storage, server/storage and flash array vendors will be forced to follow suit, giving a boost to either InfiniBand or IOV high-speed server-storage interconnect technologies, or both.
Storage flash landscape
VFCache is a fast fix to the threat - and opportunity - posed by Fusion-io and other PCIe flash caching suppliers to EMC on the one hand, and networked flash arrays such as those from Violin Memory, WhipTail and Nimbus now, and startups like Pure Storage, SolidFire and XtremIO on the other.
As with the introduction of SSDs into storage arrays, EMC is the first mainstream vendor to jump on the server PCIe flash bandwagon. We understand Dell is actively working in this area and expect HP, IBM and NetApp to follow suit together with Fujitsu and, maybe, HDS. It represents a recognition that disk drive latency is no longer acceptable for primary data access and that network latency is also becoming unacceptable.
Virtualised, multi-socket, multi-core, multi-threaded servers demand faster I/O than Ethernet and Fibre Channel networked disk drive arrays can provide. EMC has seen that there is a need for mainstream enterprise-class data centre I/O speed and flash dash is the way to get it.
Inside the mind of EMC: Is storage just a launchpad? • The Register
A good analysis of where EMC is heading. Hint: it’s not just storage anymore.
It’s a vision thing: EMC was a storage company and is an information company, but in the next decade it looks like it will be a data centre infrastructure company.
This thought comes from a parsing of two Pat Gelsinger replies to an interview with EMC’s Mark Twomey, otherwise known as the blogger Storagezilla.
Reply number one was to the question “Between VMware, revenue leadership in Backup & Recovery from BRS and analytics from Greenplum is EMC still a storage company?” and here it is:
“Going forward EMC wants to be the most disruptive data centre infrastructure company in our industry. While today storage is our centre and our heritage, we’ve just shipped a new storage offering with VFCache and announced another with Thunder, increasingly virtualisation, security, management and analytics will complement that foundation to give us a broad data centre footprint.”
Hold this in your mind while we look at reply number two in answer to the question: “To you what does the competitive field look like?”:
“In some ways, no real change from the usual. NetApp, Hitachi, Dell, HP and so on but we do see two new areas of threat in storage. One is Huawei, the other is the bevy of Flash startups. We’ve spent time looking at them both. Beyond those we see that to be the leader in the data centre market we will be competing with bigger and broader IT players in the future.”
Huawei and Flash are the new areas of threat for EMC in storage. Huawei is a well-funded and determined Chinese storage supplier that has just bought out Huawei-Symantec, exiting the US in the process, and is, in the telecommunications area, something of a Chinese version of Cisco, and Cisco is concerned about Huawei as well.
For EMC, Huawei will be a competitor roughly analogous to HP, having servers, storage and networking and the ability to integrate them and extract synergies.
Flash is going to be the means to store primary data and get it shipped to servers faster than any other mainstream storage technology. I believe EMC thinks that flash will also become storage memory in servers and, as such, is likely to be sold primarily by server vendors. It will be software that will be a key technology here, with the ability to read and write data by applications in DRAM to data containers in storage memory (flash) without going through the host O/S’ disk-based I/O subsystem, looking to be a huge latency reduction factor.
Atomic writes
Two articles on atomic writes from the Wikibon consultancy – here and here – together with Fusion-io’s Auto-Commit Memory illustrate this line of thought.
The logical conclusion of this line of thought is that the long period of dominance in which primary data – both block and file – was stored in networked drive arrays is coming to an end. Disk drive arrays will become capacity vaults that protect and feed the flash stores. The Fibre Channel drive array mammoths are about to enter the deep freeze and primary data is going to be co-located with the servers whose apps process and generate that data.
One more jump needs to be made. The atomic write, bypassing the O/S IO subsystem idea, is between an app in a server’s DRAM and a data area in direct-attached server memory: DAS flash in other words. We’re looking forward to shareable flash arrays networked to servers by InfiniBand or some form of PCIe IO virtualisation. In that case the networked flash array can form all or part of a server’s storage memory, given that latency is low enough, and a server app’s atomic writes can be made to shared flash storage memory and not just to direct-attach storage memory.
Whichever supplier pulls this off and provides shareable storage memory arrays between servers – such that vMotion and the equivalents from other hypervisors becomes easily achievable – and also manages to hook the storage memory up to back-end disk drive arrays for data protection and wide area network data sharing, will be in a very strong position.
VMAXalytics and VMAXalogic
Back to the Gelsinger replies. First of all, this one: “To be the leader in the data centre market we will be competing with bigger and broader IT players in the future” – meaning the server-system vendors, like Oracle.
Secondly: “EMC wants to be the most disruptive data centre infrastructure company in our industry … increasingly virtualisation, security, management and analytics will complement that foundation to give us a broad data centre footprint.”
EMC cannot compete with companies like Oracle and the server-system companies like HP which are busy developing integrated stacks – let alone compete with Huawei – without developing its own integrated stack offerings. Of course it is doing this with VCE, which is really a partnership between EMC and Cisco, EMC owning 80 per cent of VMware.
Now let’s try putting EMC’s current and future lego blocks together with all this in mind. Cisco can provide the networking and servers and EMC the shared PCIe-speed capable flash arrays (Project Thunder). VMware, meanwhile, can offer the atomic write capability in flash super-charged Vblocks with VMAX, VNX or Isilon arrays for large-scale block and scale-out file and big data storage. The Oracle strategists have got to be assuming that VCE is going to bring out VMAXalytics- or VMAXalogic-type converged stack boxes to compete with their own Exalytics and Exalogic-engineered systems.
What EMC is doing is using storage as a launchpad to vault up into the same data centre supplier rankings as HP, IBM and Oracle.
What does this mean for other storage suppliers?
The worrying kind
Strategists and chief technology officers (CTO) will be concerned that this will mark a seismic shift in the data centre and that the era of the stand-alone storage supplier is coming to an end. The stand-alone storage suppliers have got to get deep into the converged stack platform business or face the same fate as mammoths and sabre tooth tigers.
Customers are going to buy storage as part of a system, like a car, and not as their own selected and integrated best-of-breed parts. Of course they might rent the system – from the cloud data centre equivalent of Hertz and Avis rent-a-car – but that doesn’t alter the basic fact.
Less paranoid CTOs and strategists will say: “Nonsense. Data centres are not like a car fleet. Enterprise customers will always go for best-of-breed because of the total cost of ownership advantages, performance and manageability advantages, etc.”
Will they? What if a flash super-charged Exalogic/Exalyics/Database appliance box is best of breed, and makes better use of its component software, server and storage resources than any bolted-together, DIY stack created by an enterprise or even a FlexPod-type consortium?
We’re in the realm of don’t know/can’t know for now, but one thing is for sure: if flash super-charged servers doing atomic writes run 10 times more virtual machines than current servers and are affordable, then customers are going to buy them. End of. And that will stimulate the end of the primary data-storing drive array as we know it.
So Mr Storage Supplier, are you going to crash the flash super-charged server party like EMC is doing, or will you sit this seismic shift out? Careful – you could be betting the future of your company.
Where to Put Flash for Enterprise Performance? (IDEAS Insights)
IBM and EMC represent two different approaches to enterprise-class flash-based storage.
The approach to utilizing flash for high-performance, enterprise use cases is evolving. Compute-intensive servers have already been enlisted to enable high-performance applications like enterprise resource planning (ERP), customer relationship management (CRM), online transaction processing (OLTP) databases, and more recently virtual desktop infrastructure (VDI). But a high-performance storage architecture to match the high performance compute has not yet been decided on. Flash seems to be the core of the solution, but where to put this flash for optimal enterprise performance? Announcements this week from EMC and IBM suggest two approaches: one server-based, one array-based.
One of the original strengths of SAN-based storage arrays was that they could support high-performance transactional databases at higher capacities. To improve performance, faster and faster hard-disk drives (HDDs) were employed in the array: 7,200 rpm, 10,000 rpm, 15,000 rpm. Eventually, flash-based solid-state drives (SSDs) were introduced with the ability to provide up to a 300X increase in IOPS over HDDs. Although more expensive, the performance benefits of flash SSDs have become too compelling not to support. This week, IBM announced SSD support for its XIV enterprise storage array, joining its own DS8000 series and every major storage vendor in offering flash SSD support for their enterprise class arrays.
But using flash as a storage tier in an array does not fully utilize its potential for performance. PCIe-based flash deployed in the server can offer up to a 20X increase in IOPS over array-based flash. Fusion-io has capitalized on this market with its ioMemory platform of flash-based PCIe cards aimed at accelerating high-performance applications, databases, and VDI, independent from the storage array. And just this past week, EMC’s announcement of VFCache, its own server-based flash product (previously code-named Lightning), shows that EMC is also embracing this approach and wants a piece of this growing market.
There are some drawbacks to current server-based flash solutions. Their capacities are smaller, they do not scale well in the server, they are not cache coherent between servers, and they cannot provide the enterprise data integrity and data management capabilities that flash in a storage array can. So while some of the highest-performance use cases may be moving to flash on a server, these issues will prevent primary storage from leaving the SAN any time soon.
There are also several flash-based solutions between the primary disk array and the server that offer different tradeoffs. “All flash” arrays like those from Violin Memory and Texas Memory Systems (TMS) can provide higher performance than typical tiered flash in an array, addressing the capacity and data integrity gaps of server-based flash, but at a higher cost. “SAN proxy appliances” like GridIron’s TurboCharger offer a less-pricey, high performance, flash-based caching solution for the SAN that can easily be dropped into an existing architecture. “Server network flash appliances,” such as EMC’s code-name Thunder, promise to address the limited scalability and shareability of server-based flash while retaining its higher performance.
As flash prices decrease, capacities increase, and as data integrity, cache coherency, and scalability are addressed, flash will continue its slow march towards the application. But it is clear today that flash in the server is an optimal solution for smaller data sets and highest performance, while flash in the array should be leveraged for larger or more mission-critical data sets, and a number of solutions in between can improve performance while balancing enterprise priorities with cost.
HP gives sysadmins a little mobility • The Register
HP is upping their game in the mobile administration department. Worth noting.
HP is embracing mobility with apps to allow sysadmins to receive alerts, manage systems and even shut down servers, all from the comfort of their booth seats at the pub.
HP already provides SiteScope, with which one can monitor servers and receive alerts on Android and iOS devices. But in a presentation at the HP Global Partner event in Las Vegas the company promised a good deal more functionality would be coming to mobile clients.
The new mobile management apps will be designed to show off the capabilities of HP’s just-announced Gen8 servers. Those servers apparently monitor 1,600 systems parameters, data which will be available on the mobile app.
But mobile users will also be able to create management scripts, PC World reports, allowing the user to automate common management tasks even if those tasks are unique to their environment. The app will even report the physical location of a failing server, which can only help.
It’s not clear when the app will be available for download, but the Gen8 servers will be available next month and HP will likely want to show off the mobile clients with them.
So there’s still a few weeks to convince the company that an Android (or iOS) device is essential to modern system administration… if that’s not already obvious.
EMC Greenplum Hadoop elephant straddles Cisco iron • The Register
EMC and Cisco are releasing a pretty scary looking converged system featuring the ultra-fast Greenplum Hadoop database.
Well, that took long enough. Cisco Systems and the Greenplum big data unit of server partner EMC have finally gotten together and put the Greenplum wares on Cisco’s Unified Computing System servers.
In a blog posting, Raghunath Nambiar, an architect at Cisco’s Server Access and Virtualization Technology Group, reveals that the two partners in the Virtual Computing Environment Company has circled back and are now offering pre-configured Hadoop stacks that marry Cisco’s C-Series rack servers and Greenplum’s eponymous Greenplum MR Hadoop distribution.
Greenplum doesn’t like to talk about the hardware its data warehousing and Hadoop clusters run upon, mainly because EMC, as an independent disk array maker and the owner of server virtualization juggernaut VMware, has to position itself as Switzerland in the server racket. Before it was acquired by EMC in July 2010 for an undisclosed sum, Greenplum had run its heavily customized implementation of the PostgreSQL database, which was parallelized and juiced to run data warehouse clusters, on Sun Fire x86 servers from Sun Microsystems. This was a good choice at the time, given the large amount of disk capacity that Sun had crammed onto its Opteron and Xeon servers, but a bad choice in the long term because database rival Oracle ate Sun. In the wake of the Sun acquisition, Greenplum has certified its code to run on Dell, Hewlett-Packard, and Huawei Technologies x86 servers and OEMs this iron from those companies, depending on what customers want.
EMC did not, interestingly enough, plunk the Greenplum Modular Data Computing Appliance data warehouse or its Hadoop appliance, which is actually based on a rebadged Hadoop stack from MapR Technologies, on the Vblock server-storage clusters it cooked up with Cisco to chase server virtualization and private cloud business in data centers and now virtual desktops. While the B Series blade servers in the UCS family may not be suitable for Greenplum workloads, the C Series rack servers could certainly be configured in a Vblock by EMC and Cisco to run this Greenplum code, but were not.
Part of the problem was that Hadoop doesn’t use external storage, so there would be no EMC iron in such a Vblock. It is very likely that EMC and Cisco were waiting for Cisco to get a little more traction in the server racket – Cisco’s server business now has more than 10,000 customers and a $1bn annual revenue run rate that will probably nearly double in the next year – before committing the Greenplum wares to the UCS platform.
According to Nambiar, the fully integrated Cisco-EMC stack takes Cisco’s UCS C Series rack servers and its UCS 6200 converged server-storage 10GE switches and fabric interconnects and configures up the Greenplum MR Hadoop distro to run on the boxes. (This Hadoop distro is MapR’s M5 Hadoop distribution with the names changed.) The setups start at a single rack and can be expanded to cover multiple racks. The UCS 6200 switch links into UCS 2200 fabric extenders, and according to the reference architecture (PDF), the UCS C210 M2 server is the workhorse that Cisco and EMC have chosen to run Hadoop. The C210 M2 server was announced in March 2010 and is a two-socket box that uses Intel’s six-core Xeon 5600 processors and will no doubt be replaced by a new machine using Intel’s “Sandy Bridge-EP” Xeon E5 chip. The C210 M2 can support up to 192GB of DDR3 main memory and has room for 16 2.5-inch disk drives and one or two RAID disk controllers.
In a single-rack configuration, the Greenplum MR-UCS stack has two 48-port UCS 6248UP fabric interconnects and two 2232PP 10GE fabric extenders. These link down into 16 of the C210 M2 servers, which have 96GB of main memory and 16 1TB disk drives, an LSI MegaRAID 9261-8i disk controller, and a Cisco UCS P81F virtual interface card that presents two 10GE ports to the fabric extenders. Cisco is dropping in the six-core Xeon X5670 processors, which run at 2.93GHz. Each rack has 192 cores, 256TB of raw storage capacity, and up to 350TB of usable Hadoop capacity with three-way data replication across the nodes and data compression turned on. The nodes are configured with Red Hat Enterprise Linux Standard Edition.