Issue No. 02 - March/April (2001 vol. 21)
The current revolution in computation and communication demands integrated circuits or chips of constantly increasing speed and capability. Most of us have processors on our desks that exceed 1980s-era supercomputer capabilities because microprocessor performance increases by a factor of 100 each decade. Exponentially faster chips for packet processing, routing, and circuit switching in part enable the exponential growth in Internet bandwidth. Most media-oriented consumer electronics devices are based on chips for media processing or data conversion.
Over its 12-year history, the Hot Chips conference has become the leading forum for discussing the latest chips for computing, communication, and networking. The conference covers technical details of these chips across the board, including technology, fundamental algorithms, architecture, and circuit details. Hot Chips is unique in that it's a product-focused technical conference. The emphasis is on technical details of real chips—not theoretical research or marketing hype.
The Hot Chips 12 program, held 13-15 August 2000 at Stanford University, consisted of 25 superb presentations selected from a field of high-quality submissions. The presentations reflected industry trends in technology, applications, and metrics. Most striking was the shift in emphasis from general-purpose microprocessors, which dominated the conference in previous years, to network infrastructure chips such as packet processors and switch fabrics. Also notable was the shift in metrics from absolute performance toward power efficiency.
In this special issue of IEEE Micro, we bring you the very best of these presentations expanded to the form of full articles.
The Internet's exponential growth is driving the development of a host of networking infrastructure chips. Switch fabrics form the foundation of routers and switches. Both fixed-function and programmable packet processors interpret arriving packets to handle tasks such as forwarding and scheduling. Network infrastructure chips were well represented at Hot Chips 12 with three dedicated sessions.
O'Connor and Gomez's article describes the iFlow Address Processor (iAP), a fixed-function packet processor for route lookup. The chip performs the longest-matching-prefix search function needed for classless interdomain routing. The chip combines a novel architecture with an aggressive on-chip DRAM. The entire routing table and its associated statistics are held on chip in 52 Mbytes of DRAM. The iAP achieves a rate of 66 million lookups per second using a hardware-embedded B-tree structure that exploits very wide (up to 2,000 bit) on-chip RAMs. The chip is almost entirely RAM—essentially a memory chip with some added logic.
Ultimately, the Moore's Law scaling that drives continuing improvements in integrated circuit speed and capability will cease. At the point where chip dimensions become comparable to atomic dimensions, no further scaling will be possible. Quantum computing is based on quantum phenomena arising at such atomic scales. It may allow further computational speedups for select problems by using quantum algorithms that have a much lower computational complexity than classical algorithms.
Steffen, Vandersypen, and Chuang give an accessible overview of quantum computing and present some exciting results on the construction of a five-qubit computer. In a quantum computer, each qubit can exist in a superposition state where it's in state 0 with probability |a| 2 and state 1 with probability |b| 2. In effect, each bit exists in both a 0 and a 1 at the same time. Similarly, a five-qubit computer can exist in all 32 states simultaneously, with different probabilities. Applying functions to all 32 states simultaneously achieves quantum parallelism. The authors show how this quantum parallelism realizes an efficient quantum Fourier transform (QFT) algorithm—a key step in Shor's factoring algorithm.
Steffen et al. have constructed a five-qubit quantum computer based on nuclear magnetic resonance. They encode the qubits in the spins of five atoms from one molecule. The qubits communicate with one another via the molecule's atomic bonds. While this computer operates at only 215 Hz and can perform at most a few hundred operations before spin coherence is lost, it can successfully implement a short version of the QFT algorithm. While significant challenges remain to scale up quantum computation in terms of qubit number, operation speed, and coherence time, this is clearly a promising technology.
Computers are increasingly used to process media such as video, audio, and still images. Applications that synthesize, transform, modulate, demodulate, compress, or decompress media data present processors with significantly different challenges than do traditional integer or floating-point workloads. These media applications are a poor match for general-purpose microprocessors because they often demand both high absolute performance (many billions of operations per second, or GOPS) and power efficiency (many GOPS per watt). Many handheld devices—in which power is scarce—have embedded media processors.
Khailany et al. describe Imagine, a media processor that employs a stream architecture. Expressing a media application as a set of kernels that operate on data record streams exposes an application's inherent concurrency and locality. The processor uses 48 arithmetic units in parallel to exploit six-wide instruction-level parallelism by eight-wide data parallelism.
To take advantage of locality, intermediate kernel results pass through a hierarchy of storage, from wide, high-latency memory to narrow, low-latency memory. This concept reduces the internal and external bandwidth demand by one and two orders of magnitude. A prototype in the final stages of design will have a peak performance of 20 Gflops for 32-bit floating-point and 40 Gflops for 16-bit fixed-point computation. Simulations show that this design sustains from 5 to 20 GOPS across a range of applications without exceeding available memory bandwidth. It also achieves a power efficiency of better than 2 Gflops/W.
Opris et al. describe the input side of a digital imaging system—a 12-bit, 50-Mpixel/s analog front-end processor. The authors explain that high-resolution images look better than low-resolution images only if they have an adequate signal-to-noise ratio, which requires a high dynamic-range front end. High-quality images also require analog white balance to avoid color noise due to different quantization errors on the different color channels. The chip developed by Opris and his colleagues meets these challenges using analog processing to perform white balance and to independently scale each color before analog-to-digital conversion. The result is a high-performance front end that dissipates little power—only 150 mW.
Tremaine et al. apply data compression technology to a computer system's main memory. To implement the compression, the processor incorporates a memory controller that sits between a shared cache and the main memory. The processor fetches decompressed data from the cache, and the controller compresses and decompresses blocks of data as they move between the cache and shared memory. On realistic workloads, this design demonstrates compression ratios of 2:1 to 6:1. In addition to the obvious benefit of appearing to have more main memory, performance advantages accrue because the processor can bring compressed data on chip more quickly.
Space limitations prevent us from including more of the superb presentations from Hot Chips 12. Most of these presentations, as well as others from previous years are available at www.hotchips.org. We hope you find these articles as exciting as we do.
William J. Dally is a professor of electrical engineering and computer science at Stanford University. His Stanford group developed low-power, high-speed signaling technology and the Imagine processor. Earlier, at the Massachusetts Institute of Technology, he and his group built the J- and M-Machine experimental parallel computer systems. As a research assistant and research fellow at Caltech, he designed the Mossim simulation engine and the Torus routing chip. Dally has a BS in electrical engineering from Virginia Polytechnic Institute, an MS in electrical engineering from Stanford University, and a PhD in computer science from Caltech. He is a member of the IEEE and the ACM. He was a Hot Chips 12 conference co-chair.
Marc Tremblay is a senior distinguished engineer at Sun Microsystems and chief architect for the company's Processor Products Group. He was coarchitect for UltraSparc I and II, and chief architect for MAJC, a microprocessor group focusing on new-media applications and broadband services. He also had a role in developing the picoJava processor core. Tremblay has an MS and PhD in computer science from the University of California, Los Angeles, and a BS in physics engineering from Laval University, Canada. He holds 29 patents and has 60 more outstanding in various areas of computer architecture. He was a Hot Chips 12 conference co-chair.
Allen J. Baum works in the Alpha Development Group at Compaq. Previously, he worked on processor and systems architecture of the StrongArm SA1500, Apple's Aquarius RISC, the ARM8, Hewlett-Packard's original PA RISC, and was architect of the Apple II I/O system and coauthor of its monitor ROM. He holds 19 patents in the area of processor architectures. Allen received his BSEE and MSEE degrees in electrical engineering from the Massachusetts Institute of Technology. He is a member of the IEEE and the ACM. He was a Hot Chips 12 conference co-chair.