More Cores Keep Power Down
by George Lawton
Startup chipmaker Tilera has announced plans for the world's first 100-core chip, which targets cloud computing, networking, and media-processing applications. The company claims the processor will offer the highest performance of any microprocessor yet announced by a factor of four and 10 times the compute performance per watt of Intel's next-generation Westmere processors.
The new chip is part of the company's planned Tile-Gx family. The 100-core Tile-Gx100 is expected to be available in the first quarter of 2011. It aims at specialty applications rather than general-purpose PCs and servers. "We aren't targeting PC motherboards or even a general-purpose server," said Tilera director of marketing Bob Doud. "Our current target markets include infrastructure applications such as cloud computing and embedded systems such as routers, security appliances, video conferencing gear, and wireless network base stations."
The company attributes the chip's scaling power to a novel interconnect technology and a unique technique for sharing cache memory. These innovations make scaling to 100 cores possible without having to redesign the cores or chip architecture.
Multicore Semantics
The Tilera architecture uses identical sized cores arranged in a 2D grid. Consequently, all the chips contain the square of the number of cores in one direction — that is, 16, 36, 64, and 100.
Other companies have claimed to develop chips with more cores, but the cores can't run a complete operating system (OS) independently. Rather, a much smaller number of master engines run an OS and connect to hundreds of smaller execution units.
Chip giants Intel and AMD are just starting to push 8-core chips for complex-instruction-set computers (CISC), which use relatively large cores with deep pipelines. These chips can execute many instructions in parallel and perform dynamic branch predictions. Their large size also enables them to execute four times as many instructions per clock cycle as a Tilera core, but they can't scale as well, said Doud.
CISC designs are also more power hungry. The processor spends considerable compute cycles and hence power doing predictive computations, and much of this work gets thrown away once the program logic goes in a particular direction.
The Tile-Gx family is a reduced-instruction-set computer (RISC), which means it relies on very fast execution of a simpler instruction set. It also means the Gx family doesn't run an x86 instruction set, so it's not binary-compatible with programs that run on Intel and AMD chips.
New Interconnect
Tilera's iMesh interconnect technology replaces a traditional bus with much shorter links between the switches connected to each core. Each switch is connected by five or six mesh links (depending on the particular model) to all four adjacent tiles. This approach keeps the wires short and the power usage down, in contrast with other interconnects in which the wires run much longer — sometimes even the length of the chip.
Each tile can move data at up to 4 terabits per second to its neighbor tiles. This substantial bandwidth minimizes flow control and reduces chip bottlenecks. However, each hop introduces a 1-clock latency that grows as the messages travel across multiple switches. To reduce this latency, Tilera's developer and runtime software helps keep interrelated processes on multiple cores near each other.
Peter Glaskowsky, former editor of Microprocessor Report and now a CNET blogger, said the new chip's weakest link is the 40 interconnects at the array's edges. If the top row of 10 cores is busy accessing the top two DRAM controllers, he explained, DRAM accesses from the rest of the cores will have to wait. In general, the total DRAM bandwidth is similar to or slightly better than the highest-end 8-core server processor. If DRAM links for an application are saturated, it becomes more difficult to use the other on-chip cores. Glaskowsky said the Tile-Gx's large amount of cache will help a lot with this issue.
Coherent Memory
One challenge in writing applications for multicore chips lies in getting multiple threads on different processors to address shared memory locations.. One way around this problem is to use larger caches that are shared across multiple cores, but these larger memories consume more power.
Another approach is to write applications using message passing between threads, but this adds another layer of complexity and makes programming more difficult. It also requires completely rewriting the programs from the ground up when the memory addresses exceed a given size.
With cache coherency, multiple threads running on different cores can share the same memory location, making programming easier. Tilera has developed a technique for efficiently sharing the caches across multiple cores. Doud said the technique improves scalability and consumes less power than a larger shared cache.
The total shared memory grows with the number of cores. In the Tile-Gx architecture, each core has 256 Kbytes of L2 cache. A core can peek into another core's cache before making a call to external memory.
Programmability
Programmability has been the downfall of many new processor architectures. It was the main problem with network processing units that were optimized for communication products such as routers, said Doud.
Will Strauss, senior analyst for Forward Concepts, an electronics market research firm, said the Tile-Gx 100 might be the first easy-to-program, massively parallel chip. "There have been many massively parallel chips in the past," he said, "but they were not easy to program."
The processor communication is also unique. Companies have strung hundreds of cores together on a die before, but each core must be programmed independently, as if the others didn't exist. According to Strauss, Tilera has developed an architecture that makes it easy to program the cores all together.
Despite these improvements, Glaskowsky said that multicore programming will never be as easy as single-core programming and, in some cases, won't be possible at all. "There's no exception for Amdahl's Law," he said, referring to the constraint imposed on parallel applications by the need to distribute serial algorithms across multiple processors. "Some portions of some apps are inherently serial. As for the portions that can be parallelized, some will be easier to implement than others."
Glaskowsky said Tilera could help overcome these challenges by supporting OpenCL, a framework for implementing inherently parallel algorithms, and by developing more automated expert-system-based tools to detect and alleviate interconnect and peripheral bottlenecks.
He also said the Tile-Gx needs floating-point support for the general-purpose applications envisioned for cloud computing. "It's hard to imagine how a compute server without floating-point hardware can be competitive in [the cloud] market."
George Lawton is a freelance technology writer based in Monte Rio, California. Contact him at glawton@glawton.com.