IEEE Xplore Download
A really interesting article by Intel for IEEE Xplore on the future of big data in the entertainment industry. Despite the entertainment focus, there are some good points about big data in general.
AnandTech - Intel's Haswell Architecture Analyzed: Building a New PC and a New Intel
Great analysis of Intel’s coming troubles - and what they’re trying to do to head them off.
When I first started writing about x86 CPUs Intel was on the verge of entering the enterprise space with its processors. At the time, Xeon was a new brand, unproven in the market. But it highlighted a key change in Intel’s strategy for dominance: leverage consumer microprocessor sales to help support your fabs while making huge margins on lower volume, enterprise parts. In other words, get your volume from the mainstream but make your money in the enterprise. Intel managed to double dip and make money on both ends, it just made substantially more in servers.
Today Intel’s magic formula is being threatened. Within 8 years many expect all mainstream computing to move to smartphones, or whatever other ultra portable form factor computing device we’re carrying around at that point. To put it in perspective, you’ll be able to get something faster than an Ivy Bridge Ultrabook or MacBook Air, in something the size of your smartphone, in fewer than 8 years. The problem from Intel’s perspective is that it has no foothold in the smartphone market. Although Medfield is finally shipping, the vast majority of smartphones sold feature ARM based SoCs. If all mainstream client computing moves to smartphones, and Intel doesn’t take a dominant portion of the smartphone market, it will be left in the difficult position of having to support fabs that no longer run at the same capacity levels they once did. Without the volume it would become difficult to continue to support the fab business. And without the mainstream volume driving the fabs it would be difficult to continue to support the enterprise business. Intel wouldn’t go away, but Wall Street wouldn’t be happy. There’s a good reason investors have been reaching out to any and everyone to try and get a handle on what is going to happen in the Intel v ARM race.
To make matters worse, there’s trouble in paradise. When Apple dropped PowerPC for Intel’s architectures back in 2005 I thought the move made tremendous sense. Intel needed a partner that was willing to push the envelope rather than remain content with the status quo. The results of that partnership have been tremendous for both parties … What once was the perfect relationship, is now on rocky ground.
The A6 SoC in Apple’s iPhone 5 features the company’s first internally designed CPU core. When one of your best customers is dabbling in building CPUs of its own, there’s reason to worry. In fact, Apple already makes the bulk of its revenues from ARM based devices. In many ways Apple has been a leading indicator for where the rest of the PC industry is going (shipping SSDs by default, moving to ultra portables as mainstream computers, etc…). There’s even more reason to worry if the post-Steve Apple/Intel relationship has fallen on tough times. While I don’t share Charlie’s view of Apple dropping Intel as being a done deal, I know there’s truth behind his words. Intel’s Ultrabook push, the close partnership with Acer and working closely with other, non-Apple OEMs is all very deliberate. Intel is always afraid of customers getting too powerful and with Apple, the words too powerful don’t even begin to describe it.
What does all of this have to do with Haswell? As I mentioned earlier, Intel has an ARM problem and Apple plays a major role in that ARM problem. Atom was originally developed not to deal with ARM but to usher in a new type of ultra mobile device. That obviously didn’t happen. UMPCs failed, netbooks were a temporary distraction (albeit profitable for Intel) and a new generation of smartphones and tablets became the new face of mobile computing. While Atom will continue to play in the ultra mobile space, Haswell marks the beginning of something new. Rather than send its second string player into battle, Intel is starting to prep its star for ultra mobile work.
Haswell is so much more than just another new microprocessor architecture from Intel. For years Intel has enjoyed a wonderful position in the market. With its long term viability threatened, Haswell is the first step of a long term solution to the ARM problem. While Atom was the first “fast-enough” x86 micro-architecture from Intel, Haswell takes a different approach to the problem. Rather than working from the bottom up, Haswell is Intel’s attempt to take its best micro-architecture and drive power as low as possible.
…
In the middle of 2011 Intel announced its Ultrabook initiative, and at the same time mentioned that Haswell would shift Intel’s notebook design target from 35 - 45W down to 10 - 20W.
At the time I didn’t think too much about the new design target, but everything makes a lot more sense now. This isn’t a “simple” architectural shift, it’s a complete rethinking of how Intel approaches platform design. More importantly than Haswell’s 10 - 20W design point, is the new expanded SoC design target. I’ll get to the second part shortly.
There will be four client focused categories of Haswell, and I can only talk about three of them now. There are the standard voltage desktop parts, the mobile parts and the ultra-mobile parts: Haswell, Haswell M and Haswell U. There’s a fourth category of Haswell that may happen but a lot is still up in the air on that line.
Of the three that Intel is talking about now, the first two (Haswell/Haswell M) don’t do anything revolutionary on the platform power side. Intel is promising around a 20% reduction in platform power compared to Sandy Bridge, but not the order of magnitude improvement it promised at IDF. These platforms are still two-chip solutions with the SoC and a secondary IO chip similar to what we have today with Ivy Bridge + PCH.
It’s the Haswell U/ULT parts that brings about the dramatic change. These will be a single chip solution, with part of the voltage regulation typically found on motherboards moved onto the chip’s package instead. There will still be some VR components on the motherboard as far as I can tell, it’s the specifics that are lacking at this point (which seems to be much of the theme of this year’s IDF).
Seven years ago Intel first demonstrated working silicon with an on-chip North Bridge (now commonplace) and on-package CMOS voltage regulation.
The benefits were two-fold: 1) Intel could manage fine grained voltage regulation with very fast transition times and 2) a tangible reduction in board component count.
The second benefit is very easy to understand from a mobile perspective. Fewer components on a motherboard means smaller form factors and/or more room for other things (e.g. larger battery volume via a reduction in PCB size).
The first benefit made a lot of sense at the time when Intel introduced it, but it makes even more sense when you consider the most dramatic change to Haswell: support for S0ix active idle.
…
Your smartphone and tablet both fetch emails, grab Twitter updates, receive messages and calls while in their sleep state. The prevalence of always-on wireless connectivity in these devices makes all of this easy, but the PC/smartphone/tablet convergence guarantees that if the PC doesn’t adopt similar functionality it won’t survive in the new world.
The solution is connected standby or active idle, a feature supported both by Haswell and Clovertrail as well as all of the currently shipping ARM based smartphones and tablets. Today, transitioning into S3 sleep is initiated by closing the lid on your notebook or telling the OS to go to sleep. In Haswell (and Clovertrail), Intel introduced a new S0ix active idle state (there are multiple active idle states, e.g. S0i1, S0i3). These states promise to deliver the same power consumption as S3 sleep, but with a quick enough wake up time to get back into full S0 should you need to do something with your device.
…
With Haswell U/ULT parts, Intel will actually go in and specify recommended components for the rest of the platform. I’m talking about everything from voltage regulators to random microcontrollers on the motherboard. Even more than actual component “suggestions”, Intel will also list recommended firmwares for these components. Intel gave one example where an embedded controller on a motherboard was using 30 - 50mW of power. Through some simple firmware changes Intel was able to drop this particular controller’s power consumption down to 5mW. It’s not rocket science, but this is Intel’s way of doing some of the work that its OEM partners should have been doing for the past decade. Apple has done some of this on its own (which is why OS X based notebooks still enjoy tangibly longer idle battery life than their Windows counterparts), but Intel will be offering this to many of its key OEM partners and in a significant way.
Intel’s focus on everything else in the system extends beyond power consumption - it also needs to understand the latency tolerance of everything else in the system. The shift to active idle states is a new way of thinking. In the early days of client computing there was a real focus on allowing all off-CPU controllers to work autonomously. The result of years of evolution along those lines resulted in platforms where any and everything could transact data whenever it wanted to.
By knowing how latency tolerant all of the controllers and components in the system are, hardware and OS platform power management can begin to align traffic better. Rather than everyone transacting data whenever it’s ready, all of the components in the system can begin to coalesce their transfers so that the system wakes up for a short period of time to do work then quickly return to sleep. The result is a system that’s more frequently asleep with bursts of lots of activity rather than frequently kept awake by small transactions.
Tilera preps many-cored Gx chips for March launch • The Register
It’s always nice to see an alternative to x86. Tilera is producing a massively multi-core RISC-based architecture that looks very interesting and has already attracted a number of customers.
Upstart multicore RISC chip maker Tilera is timing the launch of its third generation of Tile processors to rain a little on Intel’s forthcoming parade, and to try to blunt all of the excitement that is building for ARM-based alternatives for servers.
Tilera will today begin sampling of its Tile-Gx series of processors. As El Reg suspected back in June 2011 - when Tilera announced it was actually launching three different lines of Tile-Gx chips: Gx3000s for servers, Gx5000s for heavy media processing, and Gx8000s for network equipment makers – all three lines are based on Gx8000 chips with certain features deprecated and different pricing.
That means Tilera can offer variants of the chips with 16, 36, 64, and 100 cores and only have to do four chip layouts instead of as many as a dozen. It is the full-on Tile Gx8000 chips with 16 and 36 cores that are in fact sampling now at 1.2GHz, Bob Doud, director of marketing at the upstart chippery, tells El Reg.
All three generations of Tilera processors have the same idea behind them: use simple RISC cores tuned for Linux infrastructure workloads, put lots of them on a chip, and link them together using an on-chip a mesh network that makes all of those cores look like a single, monster, multithreaded processor to the Linux kernel…
Tilera does not do SMP to increase the performance of a server node, but rather uses the on-chip mesh to build a bigger socket image with more physical threads.
Each core on the new Tile-Gx chip has three instruction threads and has 32KB of L1 data cache and 32KB of L1 instruction cache, and also has a 256KB L2 cache; the mesh network is used to link those L1 and L2 caches into a single, coherent L3 cache shared by all the cores on the chip - so the top-end, 100-core variant of the Tile-Gx chip has 32MB of total cache.
The Tile-Gx also has math instructions that allow a floating point operating to be done in five cycles instead of hundreds of cycles when done in software, and believe it or not, this is important for some hyperscale Web applications built using PHP.
Doud says that the ramp for the Tilera chips has been pretty steep, with over 80 engagements with system and network equipment vendors of all colors and stripes, and 20 design wins where the company has committed to use a Tile processor. Embedded system maker Mercury Computer and video streaming equipment maker Harmonic have gone public admitting that they are using Tile chips in their gear.
Ihab Bishara, director of cloud computing applications at Tilera, says that three of the largest hyperscale data centers in the world have deployed Tile-based servers. With the Tile-Gx line, the 64-bitness and floating point instructions are attracting more interest, with a number of OEMs and ODMs placing orders for the chips even before they were sampling - even though the Tile chips have their own proprietary interconnect.
“Our view is, it is our ISA, get over it,” says Doud, and for the Linux crowd that compiles its own applications anyway, he makes a good point. (Jumping to ARM chips will require a recompile, too, after all.) “Once you have a chip that is supporting C, C++, Java, and PHP and you’re running Linux, it doesn’t matter. People are not writing assembler programs.”
Well, there are probably a few card-wallopers out there who are in mainframeland.
Tilera is putting the finishing touches on a Java JIT compiler, which should be done by the end of the first quarter, according to Bishara – and just in time to take on big Java workloads like Hadoop. The Tilera Linux stack is based on a derivative of CentOS that has around 2,000 packages ported over to run natively on the chips.
Tilera doesn’t just expect to sell Tile-Gx processors as the main engines inside of systems. In some cases, customers will want to use them as offload engines. To that end, the company has cooked up an evaluation adapter card that slips into a PCI-Express 1.0 or 2.0 slot and runs the Tilera Linux environment…
If you are really serious about putting the Tile-Gx chips through the server paces, Tilera will get you what it calls its Liberty-Gx platform, which crams four of these microserver boards into a single 1U rack machine.
The Tile-Gx processors sampled last July in limited quantities to selected partners, and alpha evaluation boards shipped in September. The company racked up ten design wins for the chip by November and has decided to “open up the flood gates” and do much more sampling in February with volume shipments to begin in March. The full-on Gx8016 is expected to cost around $450, with the Gx8036 at around $650. Presumably the parts aimed at servers will cost less, since some features are deactivated.
The 64-core and 100-core variants of the Tile-Gx chips will sample in late 2012 and ship sometime in the first half of 2013, according to Doud, and the company is on track with its 200-core “Stratton” chips with a shrink to 28 nanometers.
Bishara says that Tilera is not threatened by ARM contenders in the server racket, such as Calxeda with its 32-bit ARMv7 variant, called EnergyCore or Applied Micro Circuits with its 64-bit ARMv8 variant, called X-Gene.
“We’re here today shipping a 64-bit processor core and we are what looks like two years ahead of ARM,” says Bishara. “The architecture of the Tile-Gx is aligned to the workload and gives one server node per chip rather than a sea of wimpy nodes not acting in a cache coherent manner. We have been in this market for two years now and we know what hurts in data centers and what works. And 32-bit ARM just is not going to cut it. Applied Micro is doing their own core, and that adds a lot of risks.”
Tilera should know a thing or two about that. It didn’t just do its own cores, but its own instruction set and what really is a system on a chip.
No one knows how this will turn out, with server makers just trying to make a buck and take as few risks as possible. But one thing is for sure. Intel and AMD have a lot more problems than just each other from here forward.
Petaflops beater: Nvidia chief talks exascale • The Register
An interesting discussion of computing, energy consumption, and how new architectures will enable exascale computing.
“Power is now the limiter of every computing platform, from cellphones to PCs and even data centres,” said NVIDIA chief executive Jen-Hsun Huang, speaking at the company’s GPU Technology Conference in Beijing last week. There was much talk there about the path to exascale, a form of supercomputing that can execute 1018 flop/s (Floating Point Operations per Second).
Currently, the world’s fastest supercomputer, Japan’s K computer, achieves 10 petaflops (one petaflop = a thousand trillion floating point operations per second), just 1 per cent of exascale. The K computer consumes 12.66MW (megawatts), and Huang suggests that a realistic limit for a supercomputer is 20MW, which is why achieving exascale is a matter of power efficiency as well as size. At the other end of the scale, power efficiency determines whether your smartphone or tablet will last the day without a recharge, making this a key issue for everyone.
Huang’s thesis is that the CPU, which is optimised for single-threaded execution, will not deliver the required efficiency. “With four cores, in order to execute an operation, a floating point add or a floating point multiply, 50 times more energy is dedicated to the scheduling of that operation than the operation itself,” he says.
“We believe the right approach is to use much more energy-efficient processors. Using much simpler processors and many of them, we can optimise for throughput. The unfortunate part is that this processor would no longer be good for single-threaded applications. By adding the two processors, the sequential code can run on the CPU, the parallel code can run on the GPU, and as a result you can get the benefit of the both. We call it heterogeneous computing.”
He would say that. NVIDIA makes GPUs after all. But the message is being heard in the supercomputing world, where 39 of the top 500 use GPUs, up from 17 a year ago, and including the number 2 supercomputer: Tianhe-1A in China. Thirty-five of those 39 GPUs are from NVIDIA.
At a mere 2.57 petaflops though, Tianhe-1A is well behind the K computer, which does not use GPUs. Does that undermine Huang’s thesis? “If you were to design the K computer with heterogeneous architecture, it would be even more,” he insists. “At the time the K computer was conceived, almost 10 years ago, heterogeneous was not very popular.”
Using GPUs for purposes other than driving a display is only practical because of changes made to the architecture to support general-purpose programming. NVIDIA’s system is called CUDA and is programmed using CUDA C/C++. The latest CUDA compiler is based on LLVM, which makes it easier to add support for other languages. In addition, the company has just announced that it will release the compiler source code to researchers and tool vendors. “It’s open source enough that anybody who would like to develop their target compiler can do it,” says Huang…
The distinction between driving a display and general-purpose programming is blurring. As game visuals become more advanced, more of the code is devoted to simulating real-world physics. “The combination of simulation and visualisation is going to transform how people enjoy games,” Huang says.
In the same way, designers and engineers with workstations can use GPU accelerators to render accurate simulations of their designs. NVIDIA Maximus uses two GPUs, one from its Tesla line for general purpose programming and the other a Quadro for the display. “Now the workstation is completely changed because it can combine the workflow of two parts of the design, the design part, and the simulation part,” claims Huang.
Huang is looking forward to Windows on ARM. He talks about the Asus Transformer tablet and its long battery life, and then says: “Imagine Windows on ARM on that device, and next-generation versions of that device. It’s a foregone conclusion that the PC industry will be revolutionised. I’m anxious to see Windows on ARM come to market and I think Microsoft is going to be very successful with it.”
There are a few clouds on NVIDIA’s horizon. One is that ARM, which dominates the world of mobile CPUs, is now also designing mobile GPUs, under the brand Mali. That could undermine NVIDIA’s Tegra business, a SoC (System on a Chip) which combines an ARM CPU with an NVIDIA GPU. Huang does his best to dismiss Mali as having only “basic capabilities”. He adds, “We have to continue to find our value-add, if we don’t then we don’t have a role in the world.”
Huang will not be drawn on the subject of Kepler, his company’s next generation GPU family, which seems to be delayed though only in a notional sense since no date has been announced.
There is also Intel to think about. Intel’s multi-core evangelist James Reinders says its forthcoming “Knights Corner” MIC (Many Integrated Core) processor will solve the efficiency issues Huang describes. “Knights Corner is superior to any general-purpose GPU type solution for two reasons,” Reinders tells us.
“We don’t have the extra power-sucking silicon wasted on graphics functionality when all we want to do is compute in a power efficient manner, and - second - we can dedicate our design to being highly programmable because we aren’t a GPU - we’re an x86 core, a Pentium-like core for “in order” power efficiency - every algorithm that can run on GPGPUs will certainly be able to run on a MIC co-processor.
“MIC used to be a GPU,” says Huang when asked about Intel’s co-processor. “MIC is Larrabee 3, and Larrabee 1 was a GPU. So there is no difference, except of course that we care very much about GPU computing, and we believe this is going to be the way that high performance computing is performed.”
NVIDIA’s other advantage? CUDA is available now.
Oracle fires Itanium countersuit at HP • The Register
More legal shenanigans from HP and Oracle over Itanium. Oracle claims that HP is paying Intel to continue developing Itanium.
Late Friday, Oracle filed a countersuit against HP, which sued Oracle back in June because Oracle said in March that it would not be developing future releases of its database, middleware, and application software on future Itanium processors.
It’s hard to tell who is stretching the truth more it in the ongoing lawsuit, and now countersuit, between Hewlett-Packard and Oracle over the fate of the Itanium processor from Intel. The reason is that the court documents coming out describing the situation are heavily redacted, with all the juicy bits that might offer some clarity being blacked out.
In the amended cross-complaint filed last Friday, which was posted on the Scribed document sharing site here, Ellison & Co’s lawyers are slapping back at HP with seven counts, including charges of fraud, defamation, intentional interference with contractual relations, intentional interference with prospective economic advantage, as well as violation of the Lanham Act and two violations of the California Business and Professional Code.
“HP engaged in a multi-year campaign of secrecy and deception designed to conceal the truth about Intel Corporation’s commitment to the Itanium microprocessor in order to extend its Itanium server business at Oracle’s expense and reap large profits from its own unsuspecting installed base of Itanium users,” Oracle lawyers wrote in the brief sent to Judge James Kleinberg of the California Supreme Court in Santa Clara.
“When Oracle announced the truth about Itanium – that Intel’s strategic focus was not on Itanium but on its competing Xeon line of microprocessors, and that Itanium was nearing its end of life – HP reacted with a ferocious effort to foment false customer outrage and to vilify and defame Oracle, all to buy itself more time to milk its customer base and falsely blame Oracle for Itanium’s demise.”
Oracle says that in the process of document discovery in the Itanium case, it stumbled upon an agreement whereby HP is paying Intel to keep the Itanium processor alive – something Oracle says it did not know when it made the decision to pull software support from the future Itanium processors back in March. Oracle’s beef is that this revised “Itanium Collaboration Agreement” was done secretly, without partners or customers being told what the deal is.
“There is, of course, nothing wrong with entering into a contract with a supplier to ensure the supply of a key input,” Oracle said in its countersuit. “Had HP simply entered into the Intel deal and revealed it – perhaps taken credit for it – Oracle would have nothing to complain about.” Oracle contends that “Intel desperately wanted out of Itanium” and that this as well as the HP agreement to essentially pay Intel to keep the Itanium roadmap alive was something that it was entitled to as an HP and Intel partner and that HP’s customers (who are often users of Oracle’s software as it turns out) are similarly entitled to.
By torpedoing the Itanium platform, Oracle can sink a big portion of HP’s enterprise systems profits, which come from HP-UX system sales and support contracts. Oracle has shown no love to HP since it acquired the Sun Microsystems hardware and operating system business and it got worse when HP fired Larry Ellison’s tennis buddy, Mark Hurd, as CEO.
It got even worse when HP hired former SAP CEO Leo Apotheker to replace him and former Oracle president Ray Lane to be its chairman. Once Hurd came into Oracle as co-president, it was a matter of months before the gloves were off. Whatever the legal, technical, or market merits of Oracle’s moves with regard to Itanium, the intended effect has been realized: HP’s Unix business is shrinking. Then again, so is Oracle’s Unix business, as the latest Gartner server figures show. So far, IBM seems to be the big winner in the tit-for-tat legal spat between these two companies.
Oracle is also filing its countersuit against HP because it says that it was fraudulently induced into entering in an agreement that allowed it to hired Hurd after he had been let go from HP. It claims that HP concealed and misrepresented the “truth about Itanium” and concealed “material information” that it was about to hire Apotheker and Lane to run the company.
Oracle also reminded everyone that HP tried to add clauses to the Hurd agreement that would guarantee HP’s access to Java, its ability to sell Solaris on x86 platforms, and ongoing support from Oracle for its software stack on HP-UX. This language was struck from the commitment reaffirmation portion of the Hurd agreement, and in a draft supplied by Oracle, all that was left was this:
“Oracle and HP reaffirm their commitment to their longstanding strategic relationship and their mutual desire to continue to support their mutual customers. Oracle will continue to offer its product suite on HP platforms and HP will continue to support Oracle products (including Oracle Enterprise Linux and Oracle VM) on its hardware in a manner consistent with that partnership.”
The actual Hurd agreement remains under seal, so we don’t know what it says. But this portion of the agreement, however it was worded, is the clause in the agreement that HP’s lawyers are arguing is a commitment by Oracle to continue to support its software on HP-UX/Itanium machines made by HP. Oracle is seeking a recission of the Hurd hiring agreement in its countersuit.
Incidentally, Oracle’s countersuit says that HP’s allegations in its lawsuit from June that Oracle is withholding support to current Itanium customers on current Oracle software is “utterly false” and that “Oracle is fully supporting the current (and many past) versions of its software on Itanium servers, by issuing bug-fixes per its standard policies.”
In the wake of Oracle’s countersuit, HP put out a lengthy statement of its own.
Interestingly, in the week before the Hurd hiring agreement was signed on September 20, HP says that Oracle’s general counsel wrote in an email that this provision was “an agreement to continue to work together as the companies have – with Oracle porting products to HP’s platform and HP supporting the ported products and the parties engaging in joint marketing opportunities – for the mutual benefit of customers.”
While much remains murky in this suit and countersuit, what seems clear at this point is that we are going to have a Bill Clinton verb definition moment like that during the ex-President’s impeachment. It will all depend on what the definition of the word “support” is.
Oracle will no doubt argue that it is continuing to support HP-UX and Itanium with current and prior releases of its database, middleware, and application software. HP will no doubt argue that what the clause meant was that Oracle would continue to port future releases to future Itanium chips and HP-UX releases.
HP continues to contend, and reiterated in its statement, that Oracle wants to move Itanium server customers to its own Sun systems and that the “tactics employed by Oracle in support of this purpose included pricing misconduct, withholding of benchmarking scores for HP servers run on Oracle software, and abusing customers on support issues.”
The HP-Oracle lawsuit is scheduled for trial on April 2, 2012, and will also probably have both sides arguing about how long a proper server chip roadmap needs to be so it is not at “end of life,” and what it means if HP is indeed paying Intel to keep the Itanium chip alive. It will be interesting to see what that is costing HP and how long that commitment term is for, if it turns out to be true.
Release the brakes on your virtual servers • The Register
The performance costs of virtualization - “virtualization overhead” —
One of the dirty little secrets of virtualisation is the performance cost: operating systems running inside a virtual machine are slower than those running natively on the same hardware, sometimes by quite some margin.
This is termed virtualisation overhead, and with current whole-system virtualisation, it’s a given. It always happens. The question is, how much.
Back in the days of ESX Server 3, VMware itself admitted that integer performance suffered an overhead of up to six percent and more complex CPU operations up to 18 per cent. It claimed that Xen 3 was about twice as bad.
These days, things are not so serious, and the performance differential between the main hypervisor vendors has mostly evened out.
Stiff competition
Even so, bear in mind that this was on a single host with a single virtual machine. Contention for shared resources is also a significant issue, especially when it comes to disk storage where virtual machines often share a single drive or array.
Most current virtualisation on x86 is whole-system virtualisation: each virtual machine is a complete emulated PC containing a complete PC operating system. The virtual machine’s “disks” are actually files in a file system managed by a different operating system.
All the componentry of that nice uniform hardware platform that means you can move virtual machines from host to host – network cards, motherboard chipset, graphics adaptor and so on – is not nice fast hardware, it is software emulations running as part of the hypervisor.
Hardware extensions
Gradually, x86 chips are acquiring hardware extensions to assist in this emulation. The first generation of hardware virtualisation assist extensions was Intel’s VT, introduced in some of the last models of Pentium 4, the 662 and 672, in 2005. AMD-V followed with the Socket AM2 Athlons in 2006.
This hardware virtualisation merely allowed hypervisors to create a “Ring minus-one” – essentially, trapping Ring 0 (kernel-mode) code and running it though a software emulator. The CPU-intensive process of mapping virtual machines’ memory to the host’s physical memory still had to be done in software.
This changed in 2007 with the arrival of AMD’s second wave of hardware virtualisation, Rapid Virtualization Indexing (RVI), in the Barcelona generation of Athlons and Opterons. RVI provides the hypervisor with shadow page tables in hardware.
Page tables hold the map that translates addresses in an operating system’s memory layout to physical RAM addresses. But from inside a virtual machine, these emulated physical addresses are actually blocks of the host’s RAM. RVI’s second level of indirection accelerates the translation of memory addresses inside virtual machines to real physical memory addresses.
This makes little difference to a virtual machine’s pure CPU performance, but significantly enhances memory-intensive workloads, to the tune of 42 to 48 per cent.
Intel’s equivalent is called Extended Page Table and appeared with the Nehalem-family Core i3, i5 and i7 processors in late 2008.
For the moment, this is all hardware can do to help. The remaining techniques are a matter of software and system configuration.
In praise of paravirtualisation
Emulation is expensive so another good way to boost performance is to avoid it. From the virtual machine perspective, one way to do this is to modify the guest operating system or its drivers to be aware that they are running in a hypervisor.
For example, a guest operating system can be provided with special drivers that talk directly to the virtual network connecting the virtual machines to the host machine, rather than emulating a physical network card.
Similar methods can be applied to storage (such as SCSI and iSCSI), graphics, input devices and even memory management.
Microsoft calls this Enlightened I/O for its Hyper-V Server; support is built into Vista SP1, Windows Server 2008 and later, and drivers are available for Windows Server 2003, SUSE Linux Enterprise Server 10 SP3 and Red Hat Enterprise Linux 5.2 to 5.5.
VMware had a similar approach, the Virtual Machine Interface, which allowed Linux guests to communicate with the hypervisor, but this has now been outpaced by hardware virtualisation.
VMware also offers the vmxnet virtual NIC, as well as enhanced vmxnet2 and vmxnet3 drivers that offer TCP Offload Engine acceleration to virtual machines on suitably equipped hosts.
For Xen, there are PV drivers and for KVM, Virtio, which both offer analagous functionality.
Just passing through
The final, and in some ways most drastic, step is to avoid the emulation overhead by directly connecting virtual machines to physical hardware.
The simplest and theoretically cleanest way of doing this is by offloading storage, for instance, to a SAN; a virtual machine accessing a SAN in principle suffers no more slowdown than a physical server would.
As in the case of a physical server accessing storage over the network, though, this ideally means dedicating network interfaces to storage – which may mean adding multiple network interface cards to the host and configuring dedicated routes between virtual machines and networked storage devices.
Hyper-V also supports pass-through disks, where a virtual machine can directly control a dedicated LUN of a storage device on the host machine.
Windows Server 2008 R2 adds a new feature, Cluster Shared Volumes, which allows multiple hosts to share access to a single storage LUN, adding a degree of scalability to pass-through disks.
VMware currently takes the prize in this department, though, with its ability to directly dedicate not only SCSI controllers, but as of ESX 4, entire physical PCI and PCIe devices to a specific virtual machine.
The VMDirectPath feature allows one or two PCI cards in the host machine to be connected to the operating sysem running in a specific virtual machine rather than being managed by the hypervisor itself – from a simple USB controller to a dedicated physical NIC or storage device.
A slightly more modest optimisation is to place the swapfiles of Windows virtual machines directly onto the host’s vmfs storage.
There is always a price to pay, though, and this one is a biggie. There is a significant drawback in attaching dedicated devices to virtual machines, whether they are just disk partitions on the host server or physical interface cards and any attached devices.
Although such techniques can deliver pretty much full native performance, they hinder some of the key advantages of virtualisation: the ability to snapshot virtual machines for backup purposes, duplicate them and migrate them from one host to another.
At best, virtual machines accessing external, non-virtual resources often need to be shut down before migration, and snapshots must also duplicate any external resources – thus removing the scalability and fault-tolerance of virtualisation.
Differences of opinion
Virtualisation is now a key part of the x86 platform and it is not going to go away again. Further hardware advances will continue to improve speed and reduce overhead but there’s a long way to go before x86 servers can match the performance and scalability features of systems that have been doing virtualisation for decades, such as IBM’s System z mainframes.
On the other hand, paravirtualisation is also important. There are significant gains from having guest operating systems that know they are guests and can request services from the host server or its assistance in performing demanding operations.
Some of the possible improvements will probably remain limited by competitive demands, such as the different virtual machine formats of all the main hypervisor vendors and their totally different driver architectures.
Historically, such issues have receded either when everyone becomes compatible with Microsoft’s formats or the functionality moves into hardware.
That is still some way off for x86 virtualisation but it is a rapidly developing area, so watch this space.
Intel takes the heat off power management • The Register
Intel has just rolled out some heavy-duty power management software that offers policy-based policy management across up to 10,000 nodes - and it plugs-in to major management software, like vmWare’s vSphere. Pretty cool.
Intel has a new piece of software called Data Center Manager (DCM), which provides power management ranging from the individual server level up to the bird’s-eye-view of your entire data centre.
There are still few supported devices for this fledgling product. Given Intel’s heavy push for cloud computing, I expect that to change very soon. These are the systems supporting DCM out of the gate. Next year will look quite different.
DCM is impressive stuff. It can manage up to 10,000 nodes, 5,000 nodes per management server, with two tiers of management servers being the current maximum. It offers monitoring, trend analysis and fine-grained control.
Policy decisions
The entire thing is policy-based; you can set schedules for different levels of power management or base it on triggers of various sorts.
Think back to how impressed you were at Google moving virtual machines and workloads around in response to thermal excursions in its data centres. DCM is very nearly a pre-canned version of this technology.
While DCM does come with a default user interface, it doesn’t provide that sort of flexibility entirely on its own. Then again, it doesn’t have to. DCM is intended to serve as a plug-in for larger data centre management software.
There is a vSphere plug-in and support for a wide array of data centre management vendors stretching from Modius’ DCIM to Visual Data Center.
Play by the rules
DCM exposes an easy-to-use API that truly shows the flexibility of policy-based power management. When you combine the fine-grained control of compliant equipment with this software, you get the ability to do remarkable things.
At the moment, you might need to do some scripting to get the really neat stuff going, but it is all there.
The fact that this is even available should point to how simple the future of this product will be. Regularly scheduled air conditioner maintenance? No problem. Set a rule: limit all servers in row A to 50 per cent power utilisation for two hours on the 15th of each month.
More management tools vendors will come on board supporting DCM. Current vendors will start doing things with the API that even Intel hadn’t thought of.
I am interested to see how long it will take before it is supported by Puppet.
The potential is huge. Imagine being able to integrate with software or smart metres that keep track of the spot-price of electricity.
You could compare this to user demand as measured by other software and generate complex sets of dynamic policies that alter server availability and power consumption based on the balance between end-user demand and electricity costs.
DCM enables data centres that are aware not only of their physical environment but of economic or digital issues as well. Tie this in to more advanced software and you could probably script sets of triggers to migrate services to the public cloud when needed, and back to the private cloud again.
Green cloud
This is the beginning of a truly environmentally aware hybrid cloud.
The logging features here are important as well. DCM keeps a full year’s worth of history on hand. Combined with its trending and analytics, it gives you an impressive overview of your power utilisation – critical data for planning refreshes or gauging remaining capacity.
The best part about DCM is that it is agent-less. No software is required on the physical systems; it takes advantage of various power management technologies built into the hardware.
If you are overhauling a data centre or planning a new one, take the time to give DCM a look. Certainly it requires some of the latest-greatest gear to work properly, but with the right bits in place it does a fantastic job of managing power consumption.
Considering that power is rapidly becoming the largest expense of any data centre, Intel’s DCM will more than pay for itself.
Intel mad for power, but stacked-up dies keep MELTING! • The Register
Intel shares its thoughts on how it will continue to push Moore’s law over the next ten years. They’re looking to build processors that break the exaflop barrier by 2018.
Moore’s Law is going to be good for at least another decade, according to chip-maker Intel.
“There’s always physical limits to everything,” Steve Pawlowski, senior fellow and head honcho on exascale research, told The Register at the European Research and Innovation Conference in Ireland.
“But you can always come up with clever ways… for example, there’s nothing that says I can’t take two dies and stack them on top of each other so I can grow Moore’s Law in the third dimension,” he added.
It’s not just a potentially clever way to push Moore’s Law, stacking chips is something Pawlowski is looking into right now as a way to get the world to high performance computing (HPC).
Intel has set itself the target of getting computers to be 100 times more powerful than they are today by 2018 - in other words, to achieve the exaflop computing level.
One way Pawlowski thinks Intel can do this is by improving memory data transfer.
“The biggest part of memory is getting information out of memory into the processor, moving data around. So in certain situations we’re looking at can you make memory and processor closer together by stacking them on top of each other,” he said.
“The bottom line is you’re reducing the length a signal has to go from A and B and by doing that you can make it faster. By stacking on top of the CPU die you can make wider memory interface and with width and speed, you get higher bandwidth,” he added.
However, stacking them up that way has an unfortunate side-effect - the power needed to pull the memory out of the chip and into the processor has had the nasty habit of melting the die in lab experiments.
Whether it eventually works that way or not, stacking is the sort of idea Pawlawski reckons is needed to achieve exascale, not new materials.
“It’s architecturally in how we build the devices that we need innovation,” he said.
“Every time I hear this technology is going to run out of gas in ten years and we’re going to need something new, there’s always some new way of engineering or some new creative way to use the material that gives you a longer life.”
That’s why, even though he thinks new wonder material graphene is interesting, he doesn’t think it’s the way to go for future chips.
“I’m kind of interested in it for a number of reasons, but is it going to take over everything and be the new technology that’s going to drive us to exascale? I don’t believe it,” he said. “It’s my opinion, but I think silicon is still going to be the underlying technology that’s going to take us well into the next decade.”
Which is about how long Pawlowski will be drawn into forecasting that Moore’s Law will last, although he will say: “I’m of the belief that if you give an engineer a problem, they’ll solve it.”
Facebook's Open Compute friends ODCA IT union • The Register
The Open Data Center Alliance, started by Intel last October, and the Open Compute Project, founded by Facebook, are coming together. The initial results look interesting.
What do you get when you cross a consortium of big data center customers and IT suppliers (the Open Data Center Alliance started by Intel last October) with an open source server and data center design project started by a hyperscale Web company (the Open Compute Project founded by Facebook)?
We don’t know, but it looks like someone is going to lose some margins.
At the Intel Developer Forum in San Francisco on Wednesday, the ODCA said that the collective was building momentum – now with over 300 members, up from 70 at its founding nearly a year ago – with a collective IT spend of more than $100bn, according to Marvin Wheeler, who chairs the ODCA board and was president and COO of hosting company Terremark before telecom giant and cloud-wannabe Verizon scarfed it for $1.4bn in January.
At the IDF event, the ODCA was showing off six “usage models”, something akin to a reference architecture for handling specific cloud workloads, including cloud on-boarding (moving VMs from one hypervisor in one data center to another brand of hypervisor in other) demonstrated by Citrix Systems. EMC and Intel teamed up to show a secure VM on-boarding scenario using Intel’s TXT “trusted extensions” for Core and Xeon chips and VPLEX Metro data replication.
Cloud interoperability based on CloudForms was another proof-of-concept demonstrated at IDF, and Dell and JouleX, a maker of power management software, put together a POC that showed how the JouleX tools could be used to track and reduce energy consumption on a rack of PowerEdge C dense servers – the kind that Dell wants to sell to corporate customers, particularly now that Facebook has unfriended Dell and is building its own servers – or, rather, having Taiwanese IT manufacturing giant Quanta Computer do it and Synnex do the rack integration for Facebook’s shiny new Prineville, Oregon data center.
Speaking of Facebook, the ODCA has teamed up with the Open Compute Project, which is open sourcing both the design of the Prineville data center and the motherboards and server nodes used to run Facebook’s applications as well as the related power and cooling systems used to feed and pamper those servers.
Wheeler tells El Reg that the ODCA and the Open Compute Project will work together, initially by having requirements as expressed by ODCA members fed into the Open Compute Project. Following this, Open Compute contributors are expected to come up with designs that meet those needs (when they feel like contributing, of course), and then these hardware designs can then become part of ODCA usage models that other people can use as they deploy particular kinds of infrastructure to support specific workloads.
“This is end users telling vendors what they want the cloud to be,” says Wheeler. “This is the end user community pushing back.”
Well, yes and no. The ODCA has a lot of IT vendors that are helping steer things, and you’ll notice who is putting together the usage models, right?
No more tier ones?
The most interesting thing about the cross-coupling of the OCDA and the Open Compute project will be the establishment of detailed reference architectures for specific workloads that involve systems that are not built by tier-one server makers – at least not yet.
“The HPs and the Dells of the world can innovate on top of this,” explains Frank Frankovsky, a founding member of the Open Compute Project and director of technical operations at Facebook. “They may not be innovating at the box level any more, but I don’t think it stifles their innovation.”
At Facebook, the Open Compute servers are based on motherboards specced by Facebook that are manufactured by Quanta, which in turn ships the completed boxes back to California to Synnex, which tests the machines and plunks them into racks. These completed racks are shipped off to Prineville and rolled into the data center as needed.
Frankovsky is not at liberty to say how many of the Open Compute boxes have been built to date, but does say that there are tens of thousands of these machines installed in Prineville. And presumably the next generation of Open Compute servers will go into the new Facebook data center being built in North Carolina, which should come online in the next few months, according to Frankovsky. These Open Compute 2 platforms will be based on half-width motherboards using Opteron 6200 processors from Advanced Micro Devices and Xeon E5 processors from Intel.
Over the long haul, Frankovsky expects that anywhere from three to five distributors will step up to manufacture Open Compute server designs – Foxconn and Delta Electronics are obvious possibilities.
He also said that it was possible that just as software projects sometimes branch, Open Compute designs will likely branch (but not fork) for specific use cases, and that tier vendors such as HP and Dell might play in this way.
The other thing that Wheeler expects to happen is that countries with import duties to protect their IT industries will encourage their manufacturers to pick up the Open Compute designs and make the machines for local customers. They would not make the motherboards in, say, Brazil, but import them from Taiwan and then bend the metal around them and integrate the components indigenously.
Don’t be surprised if Intel makes Open Compute servers at some point, too. It already makes a large number of servers for cloud customers in China, according to rumors going around IDF this week.
Intel Rewards Itanium Loyalists With Performance And RAS Features In Poulson | Forrester Blogs
Intel presented the latest Itanium Poulson chip the other day. Not a lot of fanfare - particularly after the embarrassing hp/Oracle dispute - but apparently, it is a nice step forward, according to Forrester.
Intel Raises the Curtain on Poulson
At the Hot Chips conference last week, Intel disclosed additional details about the upcoming Poulson Itanium CPU due for shipment early next year. For Itanium loyalists (essentially committed HP-UX customers) the disclosures are a ray of sunshine among the gloomy news that has been the lot of Itanium devotees recently.
Poulson will bring several significant improvements to Itanium in both performance and reliability. On the performance side, we have significant improvements on several fronts:
- Process – Poulson will be manufactured with the same 32 nm semiconductor process that will (at least for a while) be driving the high-end Xeon processors. This is goodness all around – performance will improve and Intel now can load its latest production lines more efficiently.
- More cores and parallelism – Poulson will be an 8-core processor with a whopping 54 MB of on-chip cache, and Intel has doubled the width of the multi-issue instruction pipeline, from 6 to 12 instructions. Combined with improved hyperthreading, the combination of 2X cores and 2X the total number of potential instructions executed per clock cycle by each core hints at impressive performance gains.
- Architecture and instruction tweaks – Intel has added additional instructions based on analysis of workloads. This kind of tuning of processor architectures seldom results in major gains in performance, but every small increment helps.
- Instruction replay – Beyond performance, Intel has added the ability to re-execute a failed instruction. This is a powerful capability for enhancing reliability, and a first for Intel. Instruction replay allows a failed instruction to be retried without the overhead of refetching all of the data, and is triggered by a number of low-level failures. The attraction of this technology is that it happens at a very low level of the hardware, and is completely hidden from the OS and application software. This feature will add to the already impressive reliability of HP-UX running on Itanium-based systems.
Does Anyone Care?
With the bloody divorce of HP and Oracle, with Intel as an embarrassed spectator, does anyone still care about Itanium? Simple answer – the thousands of Itanium customers who are not running Oracle. While the Oracle breakup will definitely hurt HP’s Itanium business, it will not kill it outright, and HP-UX running on Itanium systems will still remain a highly reliable platform for other users. These users will be rewarded with an Itanium platform that will have improved reliability as well as an impressive boost in performance from today’s offerings. While HP has not made any public statements about systems or availability, there is nothing immediately evident about the current Superdome II architecture that makes it inappropriate for Poulson.
Beyond Poulson the future gets a bit fuzzy. Intel has one visible generation of Itanium beyond Poulson named Kittson. Kittson should appear in 22 nm process in approximately 2014. Beyond that our crystal ball gets a bit fuzzy. Intel is releasing no details, and my opinion is that Kittson and one successive performance bump might be the end of the line for Itanium, which would mean that its evolution may stop on or about 2016, plenty of time for HP and its users to come to terms with HP-UX on an x86 platform based on x86 CPUs that will have gone through approximately three additional product cycles by then.