The Community for Technology Leaders

Squeezing Supercomputers onto a Chip

Linda Dailey

Pages: pp. 21-23

An ongoing trend in computing and communications technologies has been to deliver more performance in smaller, more convenient packages. Servers, laptops, cellular phones, and consumer electronics are examples of this.

Supercomputer functionality is another recent example. High-performance computing once took place only on systems housed on an entire floor of a building. Now, the functionality is appearing on a single chip.

According to Doug Burger, an associate professor at the University of Texas at Austin, a long-term performance goal for supercomputer-on-a-chip technology is 1 teraflops. Currently, though, a supercomputer-on-a-chip is expected to perform between 100 and 200 gigaflops, said Srinidhi Varadarajan, director of the Virginia Tech Terascale Computing Facility.

Companies are interested in the new chips because they provide a supercomputer's performance and functionality in a small package, which makes them good for devices and settings that haven't used such capabilities before, such as gaming consoles and media servers, Burger said.

In addition, the research is yielding advances that can be applied to basic chip design.

Providing a complex computational engine inside a single processor requires specialized chip architectures, more effective pipelining, and better communications between chip elements, said Varadarajan.

Investigating these matters are several major commercial and academic supercomputer-on-a-chip projects, including Cell, under development by IBM, Sony Computer Entertainment, and Toshiba; MDGRAPE-3 by Japan's Riken; and TRIPS (Tera-Op Reliable Intelligently Adaptive Processing System) from IBM and the University of Texas at Austin.

Some of these projects are initially looking at ways to build chips that would apply high-performance computing to specific tasks such as molecular-dynamics simulations, and others hope to develop technology that is either flexible or adaptive enough to be used for many computation-intensive tasks.


Researchers are pursuing various ways of developing chips with supercomputer-level capabilities. For example, using designs with multiple processing cores that operate in parallel can boost performance. Thus, some of the new supercomputer-on-chips don't run as fast as Pentium processors, making them more energy efficient, but yield higher performance.

Research projects are also using small chip-feature sizes, enabling them to pack processors with more transistors and circuitry and thereby boost performance.

Jim Kahle, director of broadband processor technology and a Fellow at IBM, said effective integration of chip elements as part of an optimized architecture may also help improve performance.

Communications bottlenecks, caused by on- and off-board interconnects, are an important challenge for supercomputer-on-a-chip technology, noted Professor Mark Snir, chair of the University of Illinois at Urbana-Champaign's Computer Science Department.

To cope with this, chip designers can put specialty interconnects or a high-bandwidth memory bus on a chip, according to Snir.

The issue then becomes what type of instructions the chip will use to leverage its high-performance architecture, Kahle explained. For example, the programming could include vector instructions or enable multithreading. These are among the most challenging programming issues that face supercomputer-on-a-chip designers.

These designers also face extreme versions of the challenges that confront processor developers, including power consumption and heat buildup. High-performance chips generate considerable heat. Researchers are thus looking at various ways to reduce heat levels such as adjusting system clocks, in some cases by figuring out ways to turn them on when needed and off when not.


IBM, Sony, and Toshiba are working on the Cell supercomputer-on-a-chip, a multimedia processor designed initially for use in consumer-entertainment devices that work with broadband networks and the volumes of high-speed multimedia traffic they deliver.

For example, Sony has announced that Cell chips will power the next version of its PlayStation game console. The first product slated to use Cell, later this year, will be an IBM-Sony workstation primarily for handling computer animation and other demanding graphics tasks. Sony and Toshiba plan to begin selling Cell-powered high-definition TVs next year.

Cell will have multiple processing cores, based on IBM's Power architecture, that can divide computing tasks and share information using parallel- and distributed-computing techniques.

The chip works with 90-nanometer (0.09-micron) feature sizes, which is state of the art, according to IBM's Kahle. Cell uses IBM's silicon-on-insulator technology, in which pure crystal silicon sits on pure silicon-oxide insulation. The materials and their purity let chips operate faster, more efficiently, and cooler.

Cell also uses low-K dielectric circuit insulation to eliminate crosstalk—in which signals from one circuit disrupt signals in another circuit—and thereby enable signals to move faster through the chip. In addition, the processor will incorporate new power-saving techniques, about which IBM refused to elaborate.

Engineers are optimizing Cell chips for video-, audio-, and other multimedia-packet processing. Multimedia optimization will make the chip particularly useful in consumer-electronics devices, which, Kahle said, represent one of the few areas in which consumers are demanding more processing power.

Nonetheless, he said, the three participating companies are still investigating the types of applications in which this technology could be used. They hope its architecture will be flexible enough to work with various types of applications, he noted.

Multiple Cell chips will be able to link to and work in parallel with one another as the building blocks of larger grid systems, which combine the resources of many computers to solve a single problem.


Riken, Japan's Institute of Physical and Chemical Research, is continuing work on its MDGRAPE (molecular dynamics gravity pipeline) family of supercomputer-on-a-chip technologies.

The University of Tokyo initiated the GRAPE project 15 years ago to develop a supercomputer for astrophysics research. Riken, one of the world's largest biosciences institutes, continued the project for its research.

The latest iteration is the MDGRAPE-3, designed specifically for computationally intensive molecular-dynamics simulations, frequently used in protein science and pharmaceutical development. MDGRAPE-3, shown in Figure 1, can also be used in processes such as genome sequence analysis and complex chemical and matrix calculations.


Figure 1   Riken uses its MDGRAPE-3 supercomputer-on-a-chip for computationally intensive molecular-dynamics simulations, including calculations of the forces on the atoms that make up molecules. The chip achieves high performance in part via 20 pipelines that break up calculations and perform them in parallel. The chip requires only a small memory and has independent input and output to connect to many chips serially.

The chip performs particularly well for such applications because it is specialized for workloads that involve numerous, similar calculations on a comparatively small data set. Having circuitry dedicated to one type of calculation makes MDGRAPE-3 different from other supercomputer-on-a- chip systems.

Also unlike them, MDGRAPE-3 would not be used on a stand-alone basis but instead would accelerate work performed by a larger host system, explained Makoto Taiji, leader of the High-Performance Biocomputing-Research Team at Riken's Genomic Sciences Center. For example, he said, "In simulations, the host machine sends positions of atoms, then MDGRAPE calculates the forces on atoms and returns the results."

"We achieved highly parallel operations by adopting a specialized architecture," said Taiji. The architecture includes 20 dedicated pipelines, 160 floating-point multipliers, 180 adders, and 20 function-evaluation units.

The processor's broadcast memory architecture—which connects the memory output to all the pipelines and force-feeds them data at high rates—helps the chip perform 660 operations per cycle and thus run up to 230 gigaflops at 350 MHz.

MDGRAPE-3, slated for release this spring, has 130-nanometer (0.13-micron) feature sizes. Because the chips operate at low frequencies, they are power efficient, consuming 14 watts of power at 250 MHz and 19 watts at 350 MHz.

Taiji says the team's goal is to make the world's first petaflop (1 quadrillion flops) machine, using 6,144 of the MDGRAPE-3 processors on 512 boards clustered in 32 boxes.

"We have already developed the chip and the system board," Taiji noted. "The overall system will appear in 2006."


IBM and the University of Texas at Austin are collaborating on building a general-purpose supercomputer-on-a-chip, suitable for many types of applications.

The project aims to create a chip that provides a large pool of arithmetic-logic and floating-point units linked by on-chip networks, explained the University of Texas' Burger.

According to Burger, the key to the technology is flexibility and the use of two large, scalable processing cores that exploit high levels of parallelism from single or multiple threads.

TRIPS is highly adaptable and can work with different types of hardware and software, said Jeff Burns, IBM Research's manager of advanced projects for emerging systems technologies. The chip can change its characteristics so that it can work with the way different systems use data, such as via streaming- or cache-based workloads, Burns explained.

The processors will have 130-nanometer feature sizes, but the technology could function at much smaller feature sizes, according to Burger.

Researchers expect their prototype TRIPS chip, slated for release later this year, to run at 500 MHz, with each of the two cores handling 16 operations per clock cycle. The chip would thus process 16 gigaflops.

The chip will perform at high levels in part because it will have about 250 million transistors, as well as processor cores that each have 16 arithmetic units with integer and floating-point functionality.

This internal architecture will run single-threaded or multithreaded instructions in parallel. The system breaks threads into large blocks of up to 128 instructions that it maps to the arithmetic logic units, which then process the information simultaneously, explained Burns. The system fills its ALUs to get as many instructions running per cycle as possible, Burger added.

By handling data in blocks, running many functions only once per block rather than once per instruction, TRIPS is energy efficient, Burns explained.

To mitigate on-chip communication delays, he noted, the compiler schedules applications so that the critical dataflow paths within block instructions are placed along nearby ALUs.

The TRIPS architecture and on-chip memory system are configurable so that the processor can handle different types of workflows and thus run multiple kinds of tasks. The TRIPS research team has experimented with using the chip for traditional numeric supercomputing tasks, graphics applications, and signal processing. Gaming and networking would also have market potential, Burns said.

According to Burger, the team expects to have a commercial TRIPS chip in 2009 and a processor that performs 1 teraflops by 2012.


The new chips will face some serious obstacles on the road to possible success. Initially, for example, demand for the chips will be low and they will be expensive, as is the case with many new technologies.

During this slow period, the technology will face competition from proven, less-expensive technologies that make supercomputing-level performance available now. For example, users can assemble Pentium processors or powerful devices such as PlayStation gaming consoles to create ad hoc supercomputers, or they can utilize the massively distributed processing power of multiple PCs, explained Will Strauss, president and principal analyst at Forward Concepts, a market research firm.

Meanwhile, vendors must find a market for the new chips. The commercial demand for high-performance computing is limited, noted Varadarajan, so supercomputer-on-a-chip developers are looking to the graphics and entertainment markets, including gaming devices and consumer electronics. Eventually, he said, the technology could even be used in general- purpose, desktop computing.

Thus, Strauss explained, the key to the new chips' acceptance in multiple markets will be having the adaptability necessary to handle the various tasks that these potential markets require.

The processors will also have to be able to work together in larger systems, such as grids, that require more than one supercomputer-on-a-chip, said Burton Smith, chief scientist at supercomputer maker Cray. However, Smith predicted, research should yield these capabilities, as well as improvements in areas such as architecture, on-chip communication, and parallelism.

At the crux of making supercomputer-on-a-chip systems effective, he noted, will be the programming necessary to get the most out of the circuitry and architecture. However, he explained, there are few programming tools and no programming languages optimized for this purpose.

"We've been trying to solve this for a while," he said. "It's really brutal."

About the Authors

Linda Dailey Paulson is a freelance technology writer based in Ventura, California. Contact her at
62 ms
(Ver 3.x)