Tilera preps many-cored Gx chips for March launch • The Register
It’s always nice to see an alternative to x86. Tilera is producing a massively multi-core RISC-based architecture that looks very interesting and has already attracted a number of customers.
Upstart multicore RISC chip maker Tilera is timing the launch of its third generation of Tile processors to rain a little on Intel’s forthcoming parade, and to try to blunt all of the excitement that is building for ARM-based alternatives for servers.
Tilera will today begin sampling of its Tile-Gx series of processors. As El Reg suspected back in June 2011 - when Tilera announced it was actually launching three different lines of Tile-Gx chips: Gx3000s for servers, Gx5000s for heavy media processing, and Gx8000s for network equipment makers – all three lines are based on Gx8000 chips with certain features deprecated and different pricing.
That means Tilera can offer variants of the chips with 16, 36, 64, and 100 cores and only have to do four chip layouts instead of as many as a dozen. It is the full-on Tile Gx8000 chips with 16 and 36 cores that are in fact sampling now at 1.2GHz, Bob Doud, director of marketing at the upstart chippery, tells El Reg.
All three generations of Tilera processors have the same idea behind them: use simple RISC cores tuned for Linux infrastructure workloads, put lots of them on a chip, and link them together using an on-chip a mesh network that makes all of those cores look like a single, monster, multithreaded processor to the Linux kernel…
Tilera does not do SMP to increase the performance of a server node, but rather uses the on-chip mesh to build a bigger socket image with more physical threads.
Each core on the new Tile-Gx chip has three instruction threads and has 32KB of L1 data cache and 32KB of L1 instruction cache, and also has a 256KB L2 cache; the mesh network is used to link those L1 and L2 caches into a single, coherent L3 cache shared by all the cores on the chip - so the top-end, 100-core variant of the Tile-Gx chip has 32MB of total cache.
The Tile-Gx also has math instructions that allow a floating point operating to be done in five cycles instead of hundreds of cycles when done in software, and believe it or not, this is important for some hyperscale Web applications built using PHP.
Doud says that the ramp for the Tilera chips has been pretty steep, with over 80 engagements with system and network equipment vendors of all colors and stripes, and 20 design wins where the company has committed to use a Tile processor. Embedded system maker Mercury Computer and video streaming equipment maker Harmonic have gone public admitting that they are using Tile chips in their gear.
Ihab Bishara, director of cloud computing applications at Tilera, says that three of the largest hyperscale data centers in the world have deployed Tile-based servers. With the Tile-Gx line, the 64-bitness and floating point instructions are attracting more interest, with a number of OEMs and ODMs placing orders for the chips even before they were sampling - even though the Tile chips have their own proprietary interconnect.
“Our view is, it is our ISA, get over it,” says Doud, and for the Linux crowd that compiles its own applications anyway, he makes a good point. (Jumping to ARM chips will require a recompile, too, after all.) “Once you have a chip that is supporting C, C++, Java, and PHP and you’re running Linux, it doesn’t matter. People are not writing assembler programs.”
Well, there are probably a few card-wallopers out there who are in mainframeland.
Tilera is putting the finishing touches on a Java JIT compiler, which should be done by the end of the first quarter, according to Bishara – and just in time to take on big Java workloads like Hadoop. The Tilera Linux stack is based on a derivative of CentOS that has around 2,000 packages ported over to run natively on the chips.
Tilera doesn’t just expect to sell Tile-Gx processors as the main engines inside of systems. In some cases, customers will want to use them as offload engines. To that end, the company has cooked up an evaluation adapter card that slips into a PCI-Express 1.0 or 2.0 slot and runs the Tilera Linux environment…
If you are really serious about putting the Tile-Gx chips through the server paces, Tilera will get you what it calls its Liberty-Gx platform, which crams four of these microserver boards into a single 1U rack machine.
The Tile-Gx processors sampled last July in limited quantities to selected partners, and alpha evaluation boards shipped in September. The company racked up ten design wins for the chip by November and has decided to “open up the flood gates” and do much more sampling in February with volume shipments to begin in March. The full-on Gx8016 is expected to cost around $450, with the Gx8036 at around $650. Presumably the parts aimed at servers will cost less, since some features are deactivated.
The 64-core and 100-core variants of the Tile-Gx chips will sample in late 2012 and ship sometime in the first half of 2013, according to Doud, and the company is on track with its 200-core “Stratton” chips with a shrink to 28 nanometers.
Bishara says that Tilera is not threatened by ARM contenders in the server racket, such as Calxeda with its 32-bit ARMv7 variant, called EnergyCore or Applied Micro Circuits with its 64-bit ARMv8 variant, called X-Gene.
“We’re here today shipping a 64-bit processor core and we are what looks like two years ahead of ARM,” says Bishara. “The architecture of the Tile-Gx is aligned to the workload and gives one server node per chip rather than a sea of wimpy nodes not acting in a cache coherent manner. We have been in this market for two years now and we know what hurts in data centers and what works. And 32-bit ARM just is not going to cut it. Applied Micro is doing their own core, and that adds a lot of risks.”
Tilera should know a thing or two about that. It didn’t just do its own cores, but its own instruction set and what really is a system on a chip.
No one knows how this will turn out, with server makers just trying to make a buck and take as few risks as possible. But one thing is for sure. Intel and AMD have a lot more problems than just each other from here forward.
Revisiting the 2011 Predictions, Part 2 – tecosystems
More from O’Grady.
ARM Will Emerge as a Server Player
Whether they will ultimately emerge as a credible mainstream alternative remains to be seen, but ARM is indeed emerging as a server player. Though virtually all of them discuss it privately, HP (via Calexda) this year became the first major systems player to publicly detail plans for ARM servers – perhaps banking on the fact that the upcoming A15 processor is more server friendly,
Intel is predictably skeptical of ARM’s viability in its core markets, with CEO Paul Otellini bluntly dismissive: “It ain’t gonna work.” And while it certainly hasn’t proven to work thus far, and there are real architectural and software issues to address, the power profile continues to pique the interest of server manufacturers and customers alike. Even marginal power savings mean real dollars at scale.
I count this as a hit…
The NoSQL Marketplace Will Experience Consolidation
The merger of CouchOne and Membase into CouchBase in February provided some evidence that the long anticipated wave of consolidation in this space was beginning, but the balance of the year provided little evidence to support this aside from the acceleration of a few individual players such as MongoDB [coverage]. I remain convinced that the marketplace will be unable to sustain the current volume of would be commercial entities, but from our conversations with both those in a position to potentially impact consolidation and those interested in partnering with various NoSQL players, it is clear that consolidation will depend on clearer winners and losers to proceed. This should occur in 2012.
I’ll count this as a push in light of the CouchBase merger which subtracted one player but otherwise saw very few exits.
NoSQL Will Look More Like Pro-SQL
The implicit rejection of the Structured Query Language in the NoSQL term is ironic in light of the fact that a variety of projects are now adding similar features. Continuing in the proud tradition of Hive and Pig, which provide query language interfaces to Hadoop, DataStax announced CQL in June while CouchBase and SQLite announced UnQL in July [coverage].
Whether we’ll see a unified interface or a variety of engine-specific implementations as Alex Popescu would prefer remains to be seen, but query languages will be coming to the majority of NoSQL stores one way or another.
I count this as a hit.
Open Source of Non-Strategic Infrastructure Assets Will Increase
From Twitter open sourcing the Storm assets it acquired via the BackType transaction to the New York Stock Exchange’s donation of OpenMAMA to the Linux Foundation, it is increasingly clear even to traditional parties that the release of non-strategic code as open source has multiple benefits. GitHub’s Tom Preston-Werner’s list of same is difficult to improve upon:
- “Open sourcing code is great advertising for you and your company.”
- “If your code is popular enough…you will have created a force multiplier that helps you get more work done faster and cheaper. “
- “When you open source useful code, you attract talent.”
- “If you’re hiring, the best technical interview possible is the one you don’t have to do because the candidate is already kicking ass on one of your open source projects.”
- “Dedication to open source code is an amazingly effective way to retain that talent.”
- “[Assuming code will be open sourced] leads to effortless modularization.”
- “By getting code out in the public we can drastically reduce duplication of effort.”
- “It’s the right thing to do.”
It may or may not be beneficial to open source core strategic assets, as VMware did with Cloud Foundry, but it is increasingly hard to justify protecting those that are purely tactical in nature. The benefits in many if not most cases will outweigh the costs, which is why we’re seeing an increase in contributions to open source projects.
The data from the annual Eclipse surveys is one example of this. If we examine the percentage of organizations that contribute back to open source versus those that do not from 2007 to 2011, it is clear that comfort levels with open source generally are rising.
I count this as a hit.
Xen hypervisor ported to ARM chips • The Register
Some big advances in developing hypervisors for ARM-based server environments.
The Mobile Virtual Platform (MVP) hypervisor that VMware sells for smartphones and fondleslabs running the Android variant of Linux on ARM RISC processors is getting some competition.
Intrepid techies are working away on two different implementations of the open source Xen hypervisor for ARM chips, and another group is putting together a KVM hypervisor port as well.
Unlike the MVP hypervisor, which is a type two or hosted virtualizer, these Xen and KVM hypervisors are type one or bare-metal hypervisors that will be appealing to server makers pondering the possibilities of ARM-based machines. VMware rebranded MVP, which it got through its acquisition of Trango three years ago, VMware Horizon Mobile back in August, but everyone still calls it MVP.
The software is being demonstrated on schizophones that have a primary personality running in the host Android environment and a guest phone personality in the guest Android image running inside MVP.
MVP is not open source, and while Trango started out creating a bare-metal hypervisor for ARM devices, VMware backed off to type two because it was too difficult to keep up with the rapid pace of change in the mobile chip market - in terms of certifying for new chips and other peripherals as they become available from the legions of phone makers. Such certifications would slow down product introductions and eat into profits; it is much easier to let Android itself talk to these new chips and devices and have MVP present the same old virtual machine to the hosted Android apps.
A few weeks ago, Stefano Stabellini, a senior software engineer from Citrix Systems, and compatriots Tim Deegan and Ian Campbell, started hacking together a proof-of-concept Xen hypervisor for ARM machines, and on Tuersday Stabellini announced on the Linux Kernel Mailing List that they have put together a Xen hypervisor port to ARM’s Cortex-A15 reference chip. ARM Holdings, the company behind the ARM architecture, gave the Xen-on-ARM project an emulation board to do their development and testing.
“We started the work less than three months ago, but the port is already capable of booting a Linux 3.0 based virtual machine (dom0) up to a shell prompt on an ARM Architecture Envelope Model, configured to emulate an A15-based Versatile Express,” Stabellini wrote. “Now we are looking forward to porting the tools and running multiple guests. The code requires virtualization, LPAE and GIC support and therefore it won’t be able to run on anything older than a Cortex-A15.”
Stabellini says that the plan is to support other chips that are compliant with the ARMv7 architecture and that the project is looking ahead to support Xen on the 64-bit ARMv8 architecture, the initial specs of which were announced last month.
He gave a shout out to the Xen ARM Project being championed by Samsung Electronics, which is led by Sang-bum Suh from the chip and electronics maker. The Samsung version of Xen for ARM supports ARMv5, ARMv6, and ARMv7 processors – the earlier ones not sporting the virtualization extensions in the ARMv7 chips.
The Samsung approach uses paravirtualization – operating systems know they are running atop fake hardware and they stop acting like pigs and let the hypervisor manage their requests – and that means the operating systems have to be tweaked to allow for the lying that goes on between the guests and the hypervisor in such approaches. Sometimes, you can just paravirtualize device drivers instead of a whole guest operating system, and this is exactly what the Xen on Linux project started by Stabellini has done, thus limiting the number of changes to the Linux operating system to run atop this Xen implementation.
MVP and Xen are not the only hypervisors coming to ARM. Techies at Columbia University have put together the KVM/ARM project. While the Columbia project is working on propping up KVM on ARM Cortex-A15 processors and making use of the virtualization extensions in the chip, they have slightly modified Linux 2.6.27 and Linux 2.6.29 kernels running KVM atop ARMv6 and ARMv7 chips. The work on the virtualization extensions and on supporting multicore ARM chips still has yet to be done, according to the projects page.
You can bet that if ARM servers suddenly look like they will be taking off that Red Hat and Canonical will kick in some help and move these Xen and KVM projects along. Server maker HP, which has launched the “Redstone” experimental server line using Calxeda’s new quad-core EnergyCore ARM chips, might also help out. Dell has been playing around with ARM servers, too, and might help with the hypervisor efforts as well.
Calxeda hurls EnergyCore ARM at server chip Goliaths • The Register
More details on Calxeda’s latest ARM server processor - a tweaked version of the Cortex-A9 that is specifically designed for enterprise computing.
Calxeda, formerly known as Smooth-Stone in reference to the river rock that the mythical David used in his sling to slay Goliath, doesn’t think the server racket can wait for the 64-bit ARMv8 architecture (announced late last week) to be designed and tested in the next few years.
And that is why Calxeda has spent the past several years tweaking the 32-bit ARMv7 core to come up with its own system-on-chip (SoC) and related interconnect fabric suitable for hyperscale parallel and distributed computing where nodes have only modest memory needs.
Today, Calxeda takes the wraps off its much-anticipated ARM server processor, which has been given the name EnergyCore in reference to the fact that like other ARM chips used in smartphones and tablets, it is focused on doing computing work for the least amount of energy possible. The idea is that by switching to ARM cores, Calxeda can do a unit of computing work burning less juice than an x86 chip from Intel or Advanced Micro Devices, the Power chip from IBM, the Sparc T from Oracle, or the Itanium from Intel.
The EnergyCore ECX-1000 Series chips, as the first EnergyCores will be called, are based on the Cortex-A9 designs from ARM Holdings. The ECX-1000 chips are in fact based on a quad-core implementation of the Cortex-A9 chip, but like other server implementations of the ARM chips, such as the X-Gene announced last week by Applied Micro Circuits, there is a lot more to these chips than the core.
There is a slew of other stuff, including a fabric interconnect and a management controller that would otherwise be an add-on for the system board, on the chip. One big difference between the EnergyCore and X-Gene is that the latter is based on the 64-bit ARMv8 and won’t ship until the second half of next year if all goes well at Applied Micro. And that will be early silicon. It remains to be seen when server makers will pick up the X-Gene chip and actually get it into servers, but that might take until 2013.
Calxeda thinks there’s money to be made now, and for some workloads, the EnergyCore chips are going to fit the power bill. “ARM does for the processor world what Linux did for the operating system world,” Karl Freund, vice president of marketing at Calxeda, tells El Reg. “It opens up the chip market to a whole lot of innovation.”
The ECX-1000 chips are implemented in a 40 nanometer process and are manufactured by Taiwan Semiconductor Manufacturing Corp, which seems to be the foundry of choice for server chip makers that don’t have their own wafer baking facilities. Each Cortex-A9 core runs at 1.1GHz or 1.4GHz and includes a scalar floating point unit that can do single-precision or double-precision operations as well as a NEON SIMD media processing unit that has 64-bit and 128-bit registers and that can also do floating point ops…
Calxeda is not trying to do cache coherency over one to four ECX-1000 sockets on a system board or across the 1,024 possible system boards that the integrated fabric switch scales to. And it is not particularly worried about latencies as parallel workloads pass data around this switch fabric.
“If you look at the workloads we are aiming at, they are not latency sensitive,” says Freund. This includes offline analytics like MapReduce big data chewing, Web applications, middleware and Memcached, and storage and file serving. It would be interesting to see how a network of these puppies runs a shared-nothing database cluster.
The topology of the EnergyCore Fabric Switch can be changed on the fly from one style to another and the settings are stored on flash memory on the chip package. Bandwidth can be dynamically allocated in 1Gb/sec, 2.5Gb/sec, 5Gb/sec and 10Gb/sec virtual pipe sizes by the fabric switch and presents two Ethernet ports to the operating system.
The idea is to eliminate the top-of-rack Layer 2 switch that is typically used in a cluster these days with the on-chip switch. While it is possible to build a cluster with 4,096 ECX-1000 chips by using 10Gb/sec XAUI cables and the four ports coming off the Calxeda board to cross link them all, Freund says that most companies will put two real 10GE switches in a rack and use these like end-of-row switches and only lash together 72 four-socket nodes (about a half rack of servers) with the integrated fabric.
The other important thing about that EnergyCore Fabric Switch is that is has dynamic routing, which means you can get around congestion in a network of nodes and also, in conjunction with that management controller, optimize operations for latency or reduced power consumption – or boosted power consumption if you have some work that needs to run faster…
It’s hard to imagine server makers won’t be lining up to get their hands on these EnergyCards. And if they want to start selling them right away, Calxeda is good with that, too. The chips will be able to run Canonical’s Ubuntu and Red Hat’s Fedora Linuxes to start; Windows Server 8 could eventually get there if Microsoft gets interested. (So far, it has made no commitments, even with Windows 8 for clients and mobile phones getting an ARM port.)
The ECX-1000 chips will sample in late 2011, right on time, and volume shipments of the chips will start in the middle of 2012. That is about when Applied Micro will begin sampling its 64-bit X-Gene chip, which has its own crossbar switch but one that implements up to 128-way symmetric multiprocessing across 64 of its two-core chips.
It will be interesting to see how these two ARM server chips compete against each other as well as against other RISC chips and Intel Xeon and AMD Opteron processors. And remember, graphics chip maker Nvidia, which sells ARM-based SoCs for smartphones and tablets, has also promised ARM chips of its own designs for PCs and servers. Thank heavens for a little competition to keep Intel and AMD honest. Or whatever you might call it.
HP Embraces Calxeda ARM Architecture With "Project Moonshot" - New Hyperscale Business Unit Program | Forrester Blogs
HP has just announced the creation of a new hyperscale computing business unit. They are partnering with Calxeda to build an ecosystem around low-energy computing platforms composed of hundreds of ARM-based processing cores. These platforms will support emerging high-volume web and cloud workloads.
Emerging ARM server Calxeda has been hinting for some time that they had a significant partnership announcement in the works, and while we didn’t necessarily not believe them, we hear a lot of claims from startups telling us to “stay tuned” for something big. Sometimes they pan out, sometimes they simply go away. But this morning Calxeda surpassed our expectations by unveiling just one major systems partner – but it just happens to be Hewlett Packard, which dominates the WW market for x86 servers.
At its core (unintended but not bad pun), the HP Hyperscale business unit Project Moonshot and Calxeda’s server technology are about improving the efficiency of web and cloud workloads, and promises improvements in excess of 90% in power efficiency and similar improvements in physical density compared with current x86 solutions. As I noted in my first post on ARM servers and other documents, even if these estimates turn out to be exaggerated, there is still a generous window within which to do much, much, better than current technologies. And workloads (such as memcache, Hadoop, static web servers) will be selected for their fit to this new platform, so the workloads that run on these new platforms will potentially come close to the cases quoted by HP and Calxeda.
The Program And New HP Business Unit
Officially, the announcement was HP announcing their new hyperscale business unit, based on the premise that very high-volume data centers will continue to proliferate, driven by massive continued increases in demand for web and cloud-based applications handling massive amounts of data, and that the trajectory of current systems technology with respect to power, cooling and density may be inadequate for emerging requirements.
HP’s hyperscale initiative consists of three major components:
- Discovery centers – Facilities where potential partners and customers can experiment with HP’s new hyperscale products. These centers are a vital component of this initiative because for the most part the workloads are new to this new platform, and both potential users, HP, Calxeda and other partners have a lot of learning to do to about which applications really fit and how to tune them. In effect a learning lab for both customers and suppliers. In many ways the knowledge gained in these centers is more valuable in the early phases than any product revenues that flow from them.
- Partner ecosystem – The usual suspects, software and hardware partners needed to facilitate the success of the new business unit. While HP was clear in their statements that they will be looking at multiple technologies, the entire announcement is centered around Calxeda, who is simultaneously announcing their EnergyCore server architecture and their accompanying EnergyCard reference architecture. In addition to a number of cloud-centric partners, the initial partnership roster includes Cannonical (Ubuntu Linux), putting a fully functional Linux distribution in the plus column for the nascent ARM ecosystem.
- A product – The HP “Redstone” development system, based on the existing SL6500 system enclosure and Calxeda’s EnergyCore servers. The SL6500 is HP’s current dense rack offering for HPC and hyperscale web computing. Redstone swaps out the current x86 servers and substitutes modules with 18 Calxeda EnergyCard servers, cute little 10” x 3” cards that contain four complete SOC quad-core server nodes with integrated memory, management processor, scalable fabric and integrated switch and all network and SATA interfaces, with a 5W per server/20W per card maximum power draw. In total, each server tray packs 72 quad-core ARM servers in 1 RU equivalent of space. If you read my last post on Calxeda’s reference architecture, you can guess that the basic Calxeda architecture is indeed the core of the HP offering, but in keeping with Calxeda’s OEM business model, HP has added value around packaging, extending the SOC fabric topology & I/O, management and power/cooling technology, and will add further value as the line matures.
What Does It Mean
Reduced to its essence, this means that ARM servers are on the industry road map. Among the major effected constituents:
- System vendors - HP is clearly placing its bets on an emerging segment of the server market that cannot be met with current CPU x86 CPU technology and current server designs. As the dominant x86 server vendor, HP is making an intelligent bet, and is now well positioned, with a solid first-mover advantage over its competitors, to capture new opportunities among its existing customer base as well as to capture others who might have gone away and patronized a new ARM server startup in search of ultimate energy efficiency. We might suspect that ARM has had discussions with Dell and IBM. More news to come?
- Customers – Now have (or will have in 2012 when HP officially ships the Redstone) a viable alternative CPU architecture to deploy for appropriate workloads, and I expect a lot of demand for evaluation units and access to the discovery centers. The potential to improve throughput per watt by such huge factors is incentive enough to seriously evaluate the new alternative, and my recommendation is to take a look and see how it works with your own applications.
- Intel and AMD – How about a giant wakeup call? I seriously doubt that this has caught them totally by surprise – the studied silence and nonchalance over the past year with which they responded to any inquiries about the impact of ARM competition had me convinced that they were actually quite worried. But being concerned in the abstract and having your No. 1 customer endorse not only your competition but an entirely new architecture are two different things entirely. Will this destroy Intel and AMD as server vendors? The thought is absolute nonsense. Aside from the large number of workloads that will not particularly benefit from the ARM model, both will respond with further focused R&D to continue to improve their power efficiency, leveraging their strengths in software compatibility and in Intel’s case, their market dominance.
My Takeaway
Not that it was exactly boring before, but the server world just got a whole lot more interesting, and customers just got a major early Christmas present – a whole new technology platform for emerging high-volume web and cloud workloads. All in all a very positive event for the industry and for us, the eventual beneficiaries of this technology.
HP hooks up with Calxeda to form server ARMy • The Register
Calxeda is beginning to reach out to big server and storage manufacturers to help push their ARM-based system-on-chip designs. HP is their first partnership.
HP is partnering ARM-licensee Calxeda to build energy-efficient micro-servers for large data centres, the WSJ reports.
Calxeda is producing 4-core, 32-bit, ARM-based system-on-chip (SOC) designs, developed from ARM’s Cortex A9. It says it can deliver a server node with a thermal envelope of less than 5 watts. In the summer it was designing an interconnect to link thousands of these things together. A 2U rack enclosure could hold 120 server nodes: that’s 480 cores.
The company is supporting an ecosystem of hardware and software partners with a focus on Linux.
From the WSJ report, it appears that HP has joined that ecosystem. It will have made the comparison between the Calxeda server nodes’ power consumption and that of Intel’s 20 watt and 15 watt Sandy Bridge Xeon processors, and a planned Xeon drawing less than 10 watts due next year. Calxeda’s thermal envelope is far superior for data centre operators needing to optimise for power efficiency and willing to forego any advantages of Xeon’s 64-bit memory.
Intel’s Atom products also use more power than ARM chips and Intel intends to develop them to draw less than 10 watts as well. These are due next year. It looks as if HP and Calxeda servers will have a thermal envelope advantage – at least until 2013, when Intel may have Atom designs closer to Calxeda’s server node power characteristics, not that Intel is saying anything about that.
HP has 30 per cent or so server market share and is closely aligned with Intel, using its Itanium design for high-end servers. Intel believes that performance is more important to the broad mass of its customers than power efficiency but, even so, is developing more power-efficient chips. However the X86 architecture is a power hog.
Calxeda is said to be talking to other server manufacturers and storage vendors and we might expect more ARM-powered server and storage controller news to be revealed in the coming months.
ARM vet: The CPU's future is threatened • The Register
One of the original innovators behind the ARM platform discusses troubled waters ahead for the microprocessor industry - and it’s not simply a matter of the approaching limits of Moore’s law. The growing complexity of computing is making today’s “hard” problems far harder than the “hard” problems of the past. At the same time, growing specialization in the market is actually shrinking the number of players who can tackle these sorts of problems and pushing out the incentive to do so.
ARM’s employee number 16 has witnessed a steady stream of technological advances since he joined that chip-design company in 1991, but he now sees major turbulence on the horizon.
“I don’t think the future is going to be quite like the past,” Simon Segars, EVP and head of ARM’s Physical IP Division, told his keynote audience on Thursday at the Hot Chips conference at Stanford University, just north of Silicon Valley.
“There may be trouble ahead.”
The microprocessor industry has enjoyed an almost unbroken streak of improvements, Segars said, citing advances in silicon manufacturing techniques, power reduction, and gadget-size and gadget-cost shrinkage – he brought along a 1983, $3,995 Motorola DynaTAC as a prop.
But the landscape is changing. The low-hanging fruit has been picked, and a new way of thinking will be needed to provide the world with the squillions of low-cost, low-power microprocessors that the increasingly mobile computing ecosystem requires – not to mention the everything-connected world described by the current buzz-phrase: “The internet of things”.
Harkening back to when he joined ARM, Segars said: “2G, back in the early 90s, was a hard problem. It was solved with a general-purpose processor, DSP, and a bit of control logic, but essentially it was a programmable thing. It was hard then – but by today’s standards that was a complete walk in the park.”
He wasn’t merely indulging in “Hey you kids, get off my lawn!” old-guy nostalgia. He had a point to make about increasing silicon complexity – and he had figures to back it up: “A 4G modem,” he said, “which is going to deliver about 100X the bandwidth … is going to be about 500 times more complex than a 2G solution.”
The way that the 4G-modem problem will be solved, Segars said, will be by throwing a ton of dedicated DSP processing engines at it – which will, of course, require a lot of silicon real estate.
“But that’s not so bad,” he said, “because silicon’s being scaled the whole time. But it’s going to eat a lot of power, and power is the real problem.”
ARM is a mobile-processor company, and mobile processors run on batteries – and Segar said that the power required to juice increasingly complex silicon is a system-level challenge. “The reason for that,” he said, “is because batteries are pretty rubbish, really.”
As silicon technologies have improved in comparative leaps and bounds, batteries haven’t. “Historically,” Segar said, “battery power has grown about 10 or 11 per cent per year, which unfortunately is not very well-matched with Moore’s law.”
Moore’s law on trial
But as big a problem as batteries are in the mobile market, there are much more fundamental challenges that will make the future of the microprocessor market different from its past.
For one, the complexity of chip design and the intricacies of the physics involved is increasing, making design a much riskier and more demanding process. “And you really, really need to worry about that,” Segars said.
The reason for that worry is risk and cost. “The cost of your tape-out is going to be astronomical,” he said. “When you’ve written that check for a million or two million dollars for your [chip-making lithography] masks, you want to hope that chip works. So the effort going into validating and verifying a design has gone up by orders of magnitude.”
Another challenge that Segars sees is caused by the increasing stratification of the semiconductor industry – a development that has brought many benefits, but which also has its downside.
“When prople first started building semiconductor devices,” he said, “they did everything themselves in fully vertically integrated companies. People had fabs, product design, manufacturing, case design – they did the whole thing themselves.”
That has changed over the years – mostly for the better, from Segars’ point of view – and now the industry is filled with companies specializing in various areas such as design, IP, electronic design automation (EDA), packaging, chip-baking, and so on.
This specialization has been great for spreading the costs and risks around, and for taking advantage of the economies of scale – TSMC and Global Foundries, for example, produce silicon for many different fabless chip designers.
Dying arts and fading fabs
But with more companies focusing on design rather than manufacture, Segars sees a danger. “The skills you need to close out the timing at the transistor level are becoming a dying art,” he warned.
“As we go forward, and start worrying about very exotic processes that we’re going to have to deal with in the future, those transistor skills are going to need to become very, very important once again,” he cautioned. “And as a designer, you’re going to have to worry about everything – from architecture down to transistors.”
But no matter how “disaggregated” the semiconductor industry becomes, eventually somebody has to actually manufacture the chips themselves. Segars reminded his audience of the famous quote from T.J. Rodgers of Cypress Semiconductor that “Real men have fabs,” but then pointed out the obvious fact that there are far fewer fab companies around than there used to be – and that such shrinkage in the manufacturing base poses its own problems.
As chip-baking process sizes shrink, Segars said, “We’ve seen the cost of developing processes go up and up and up, and now it costs you billions of dollars – as I’m sure everybody knows – to develop a new process. It costs you billions of dollars to buy all the equipment for it, and so fewer prople are doing it.”
From a customer’s point of view, a small number of strong, efficient, advanced fabs is not a problem – in fact, the same economies of scale that make industry disaggregation a good thing make fewer, busier fabs a good thing.
Four customers does not a market make
But there is one potentially troubling problem, Segars said – and it’s not for the fabs or their customers, it’s for equipment suppliers. “The physics problems that you have to solve when you go to smaller and smaller geometries are getting harder and harder to solve, and so the cost of the equipment that you need for the next generation process goes up and up.”
That price inflation is not in and of itself the real market problem, though. “The problem is that the supply chain that builds that equipment for the foundries has a bit of trouble dealing with its return on investment.”
Simply put, the foundry-equipment market is shrinking. “When the world moved from 200 millimeter wafers to 300-millimeter wafers, if you were [fab-equipment manufacturer] ASML or somebody like that, you had a whole lot of customers that you could go and sell that equipment to.”
Now, however, the move to 450-millimeter wafers is the new hotness – which is great for economies of scale at the foundry level, but not so great for foundry-equipment makers. “There’s only going to be about four guys who are going to build those size wafers,” Segars said, “so if you’re doing all the R&D and your customer size is four, that is a bit of a problem.”
Moore’s law repealed
Finally, there’s the looming problem of the future of silicon process-size shrinkage: it can’t go on forever. One obvious limit, as Segars pointed out, is the .27-nanometer diameter of the silicon atom itself – as process sizes shrink down to, say, 14nm and below, you’re only talking about dozens of atoms per transistor gate.
But there are plenty of other challenges to be met before you’ll count the number of silicon atoms in a gate on your fingers and toes – namely, what lithography techniques can take you well below 20nm?
At the 20nm level, Segars said, “the problem is that you need to introduce double patterning.” Using two sets of masks to accomplish what one set could do at larger process sizes not only increases mask costs, but also slows down the throughput of the manufacturing.
If you want to keep the throughput at the same rate, you have to buy more equipment – which might make the ASML’s of the world happier, but driving up chip costs is not a good thing in a world that includes developing nations whose citizens are hankering to join the mobile world.
There’s one long-sought technology about which Segars remains a bit sceptical. “At 14 [nanometers] and below, what you really want is EUV,” he said, referring to extreme ultraviolet lithography, which has long been seen as a possible solution to the process-shinking problem. EUV’s promise comes from the fact that it’s based on 13.5nm wavelength light – one hell of a lot more precise than the 193nm light that Segars said is used in today’s visible-light lithography.
“The problem is,” he said, “that [EUV] is really, really hard to make. You’ve got to make a plasma out of tin atoms, and then shoot it with a laser, and some light comes out – but the light’s really weak, and it gets absorbed by everything. So generating enough of it to economically build chips is very, very hard.”
After the endgame: core teamwork
But that’s not to say that there’s a dead-end on the road we’re travelling. Segars’ vision of the future jibes with the one described by fellow ARMian Jem Davies, the company’s vice president of technology, when speaking at AMD’s Fusion Summit this June – namely, that heterogeneous computing systems are the Next Big Thing.
Simply put, heterogeneous computing systems distribute a workload to various and sundry specialized compute engines – CPU, GPU, video, encryption, baseband, whatever – so that individual sub-tasks are completed efficiently by dedicated hardware best suited to them.
“I think the future of processing is heterogeneous multiprocessing,” Segars said, “… dedicated engines arranged in various clusters with a software layer that can understand the underlying hardware, and make sure that if it’s not needed, it’s shut off, it’s not leaking, to preserve that battery.”
There are a host of challenges to achieving the holy heterogeous grail, of course – not the least of which being keeping all the various cores in close communication, and optimally data-coherent.
To that end, ARM’s upcoming Cortex-A15 compute core – which will likely appear in early 2013 – will introduce a cache coherent interconnect that will enable full coherency among multiple CPU clusters. Segars also projects that by 2015, coherency in ARM-based SoCs and systems will be limited not only to CPUs, but will also allow full “where’s that data?” transparency among CPUs, GPUs, and specialized engines.
Full coherence, however, brings with it its own set of challenges, such as unwanted latency when far-flung cores and engines need to share the same data, but ARM, AMD, and Intel are all looking into how different approaches to coherency can help – or hinder – heterogeneity.
A lot has changed in the microprocessor world since the Intel 4004 appeared 40 years ago this November. By and large, the arc of improvement has been relatively straightforward, with improvements in process size, processing power, and miniaturization being fairly regular – achieved through one hell of a lot of work, to be sure, but regular nonetheless.
There’s been a lot of talk recently about the “post PC era”. From Segars’ point of view, however, we may also soon be talking about the “post–Moore’s Law” era – a time when computing advances are no longer measured in transistor counts per square millimeter, but rather in how quickly, intelligently, and cooperatively different cores and engines can communicate.
ARM daddy simulates human brain with million-chip super • The Register
IBM isn’t the only company that is attempting to emulate the human brain with new computing models. The original designer of ARM chips is conducting a similar experiment in academia.
While everyone in the IT racket is trying to figure out how many Intel Xeon and Atom chips can be replaced by ARM processors, Steve Furber, the main designer of the 32-bit ARM RISC processor at Acorn in the 1980s and now the ICL professor of engineering at the University of Manchester, is asking a different question, and that is: how many neurons can an ARM chip simulate?
The answer, according to Furber’s SpiNNaker project, which is being done in conjunction with Andrew Brown of the University of Southampton, is that an ARM core can simulate the activities of around 1,000 spiking neurons. And the SpiNNaker project is going to attempt to build a supercomputer cluster with 1 million processors to simulate the activities of around 1 billion neurons. Depending on who you ask – and who you are talking about, how old they are, and how much drinking and brown acid they have done – the human brain has somewhere on the order of 80 to 90 billion neurons. So even with the impressive million-core SpiNNaker machine, Furber and Brown are only going to be able to simulate about 1 per cent of the complexity inherent in the human brain.
Scale model human brain
As Furber and Brown explain in their paper (PDF) describing the SpiNNaker project, they hope that by creating a silicon analog, they can simulate a more sophisticated neural network (including the spiking behavior that gets neurons to cause other neurons to fire and thus performing the data storage and data processing inside our heads) and get a better sense of how the brain really works. Something funky is taking place between the low-level function of a neuron, which is pretty well understood according to Furber and Brown, and the larger scale of the brain itself, which we can watch with magnetic resonance imaging. And it is not just thinking about sex, either. But the suspicion is that cognition has to do with the cumulative spiking effect between large numbers of neurons.
“Of greatest interest in this work is, of course, the fundamental question of how concurrency is exploited in the biology that we are trying to model,” the two researchers write. “The brain is itself a massively-parallel system comprising low performance asynchronous components. Those components, neurons, operate at timescales of a millisecond or greater, and the primary means of information exchange is through the emission of electrical ‘spike’ events. These spikes seem to carry no information in their amplitude or impulse, they are pure asynchronous events that carry information only in the time at which they occur.”
So where is information in the brain encoded? The oldest theory, say Furber and Brown, is that the spiking rate of a neuron is where data is encoded, but this theory, they say, doesn’t hold water. There is some speculation that data is encoded in the order in which populations of neurons fire, and this, among other things, is what the researchers hope to put to the test as they simulate a 1/100th scale human brain on a million ARM cores.
ARM choice is a no-brainer
Figuring out how the brain works is tough, and the processor and communication network design for the SpiNNaker system is easy by comparison.
It was a no-brainer that Furber, who designed the 32-bit ARM processor while at Acorn, would opt for an embedded variant of that chip for the SpiNNaker system. Not just because of his familiarity with the processor architecture and the wide variety of tools and expertise in customizing the ARM processors, but because of the energy efficiency inherent in the ARM design.
“Embedded processors can reduce the capital and energy costs of a given level of compute power by about an order of magnitude, thereby significantly reducing the ownership (and environmental) costs,” Furber and Brown write. “The embedded processor technology employed in SpiNNaker delivers a similar performance to a PC from each 20-processor node, for a component cost of around $20 and a power consumption under 1 watt.” …
The communications network-on-chip (NoC) device was created by Silistix, a company that Furber created and spun out of the UofM. On the die itself, there is an on-chip interconnect that allows the ARM cores to access memory, networking ports, and other shared resources. The design embodies what Furber and Brown call a Globally Asynchronous Locally Synchronous (GALS) architecture, which again doesn’t mean thinking about sex at all.
What it does mean is that the simulated neurons can fire off a pulse to any other simulated neuron in the million-core system in about 1 millisecond, which just so happens to be about as fast as your neurons do it. The resulting interconnect fabric links the cores together in a 2D mesh network, which can be used to model 3D brain structures. 2D mesh networks are common in massively parallel supercomputers, although they are certainly not the only way to lash machines together.
Neither Furber nor Brown think that the SpiNNaker machine will help solve the wetware riddle of the human brain. But they think that a million-ARM machine will go a long way towards helping researchers run better models of the brain on a system that acts more like the human brain than previous hardware did…
ARM server hero Calxeda lines up software super friends • The Register
Calxeda is beginning the arduous task of building up a software/developer ecosystem for their ARM-based server solution, and by the looks of it, they’re opening line-up is pretty strong.
With Intel’s top brass bad-mouthing ARM-based servers, upstart server chip maker Calxeda can’t let Intel do all the talking. It has to put together an ecosystem of hardware and software partners who believe there’s a place for a low-power, 32-bit ARM-based server platform in the data center.
To that end, Calxeda, formerly known as Smooth-Stone, is launching the “trailblazer initiative” - a team of 10 software companies that will support upcoming servers based on Calxeda’s impending ARM-based system-on-chip (SoC) designs
The Calxeda ARM super friends include Autonomic Resources, Canonical, Caringo, Couchbase, Datastax, Eucalyptus Systems, Gluster, Momentum SI, Opscode, and Pervasive.
Canonical is of course, the commercial sponsor of the Ubuntu distribution of Linux, which is now first in line as the server operating system of choice for Calxeda ARM-based servers.
Eucalyptus Systems provides a cloud framework that clones the API stack and operations of Amazon’s EC2 cloud. Canonical was a big champion of Eucalyptus but has now thrown its weight behind the OpenStack cloud framework for compute and storage clouds created initially by NASA and Rackspace Hosting and now the darling of the open source community.
If Windows happens, great
Microsoft is not one of the trailblazing software partners for Calxeda, but it would be interesting if it were. Karl Freund, the vice president of marketing at Calxeda, tells El Reg that the chip maker is building its business plan around Linux. “Microsoft has not made any announcements about Windows Server on ARM,” Freund warns. “If Windows happens, great. That’s all upside for us and we’ll love it if it happens.”
Opscode is used to automate cloud operations, and Autonomic Resources is a cloud computing provider to the US government, and Caringo and Gluster do cloudy and clustered storage. Couchbase provides a NoSQL database called Membase Server, and Datastax rides the Hadoop big data chewer atop the Cassandra distributed database with the Hive query language.
Pervasive has commercialized the PostgreSQL database in the past and now sells a database based on Btrieve called PSQL and a line of data integration and analytics products called DataRush. Momentum SI sells a set of tools to do load balancing and auto-scaling on private clouds based on VMware’s vCloud Director or the Eucalyptus framework.
“The biggest opportunity for us in is cloud and big data,” says Freund. “And now, when someone says, ‘Forget ARM servers’, we can say take a look at these companies. They think there is a latent market for ARM servers, and based on the ARM skillsets out there, that this is going to be easy.”
‘Multiple thousands of cores’
We do know that Calxeda is designing the interconnect to link “multiple thousands of cores” together, albeit not in a cache-coherent manner but more link what you do with a rack of servers or multiple racks of servers in a cluster or cloud. The initial reference architecture machines pack 120 server nodes (that’s 480 cores) in a 2U chassis.
The Calxeda super friends will get early access to hardware, which should be in their hands by the end of the year. End-user trailblazers will also get machines at this time.
Calxeda has not said exactly when its chips will be ready for distribution for commercial use, or who will make systems based on its ARM SoCs. But Freund says that each partner that commits to building systems will have a handful of real customers putting the machines through the paces on a beta test by the end of this year.
Calxeda boasts of 5 watt ARM server node • The Register
More on Calxeda… they’re marketing head happens to be Karl Freund - formerly of the IBM mainframe marketing group…
The precise feeds and speeds of the Calxeda chips are still not known, but Freund knows from his days of marketing Power systems at Big Blue that putting out some details can prep a market for consumption – as IBM most certainly did in the two years before its dual-core Power4 chips hit the market in the autumn of 2001. That Power4 chip and its successors turned IBM from a joke in the Unix server business to the dominant Unix system supplier a decade later.
“Anybody with a few million dollars can produce an ARM chip. So what makes us different?” Freund asks.
Not the core, that’s for sure. Calxeda is starting with the Cortex-A9 core, which is the current 32-bit part that can be licensed from ARM Holdings. (You can license the 40-bit, virtualization-assisted Cortex-A15 part from ARM Holdings, too, but that won’t be a production product for another two years or so.)
“What you don’t want to do with a server chip is build your first product on a technology that is not proven yet,” says Freund. “You have to wait for the design to settle down, as the Cortex-A9 has. If you’re out on the front end of things, you can get into trouble.” …
Freund says that a quad-core A9-derived processor, plus its memory controller, the DDR3 memory module, and the on-chip fabric interconnect will burn only 5 watts. Clock speeds were not divulged, but it will probably be somewhere between 1 GHz and 2 GHz. That is less juice than a fat DDR3 memory stick uses, forget about the Intel or AMD x64 chip.
“This gives us extremely high levels of density,” says Freund. And, the fabric interconnect will allow for “multiple thousands of cores” to be lashed together and controlled as a unit. (But not in a cache-coherent, shared memory manner. Don’t get the wrong idea.)
The Cortex-A9 does not have any circuits to do virtualization, but Freund says that on the workloads that Calxeda expects customers to use the chip for, they won’t need hypervisors to carve up the servers. The will already have parallelized workloads that span thousands of nodes that run at very high utilization rates. On an X64 server, you use a hypervisor to plunk multiple server images on one set of chips, workloads that might only consume 5, 10, 15, or 20 per cent of the raw CPU capacity by themselves, driving up utilization of the overall system.
That said, hypervisors and their control freak add-ons are also useful for managing workloads and spreading running workloads around a cluster of machines. Freund says that Calxeda is participating in the OpenStack cloud fabric effort to see how to adapt these tools to manage bare-metal images instead of virtual images on machines using its ARM variants. The Linux community is also working on software container technology for ARM chips, too, according to Freund, which could be useful for some workloads.
Calxeda is not going to make and sell servers, but rather make chips and reference machines that it hopes other server makers will pick up and sell in their product lines. The company hopes to start sampling its first ARM chips and reference servers later this year. The first reference machine has 120 server nodes in a 2U rack-mounted format, and the fabric linking the nodes together internally can be extended to interconnect multiple enclosures together.
The initial workloads that Calxeda is targeting include internet-scale web serving, of course, as well as streaming content delivery (so long as it doesn’t need compute-intensive DRM), small web application hosting, storage controllers, and big data analytics.
“NoSQL and MapReduce are a beautiful fit for these servers because of the ratio of CPU, memory, and disk and the performance per watt,” says Freund.