Pages: pp. 6-9
Energy efficiency has been a key design constraint for microprocessor development teams since the late 1990s. The fundamental technological issues that have led to this point are quite well understood at this time by industry and academia. Although active (or dynamic) and passive (or standby) components of the net power equation are of concern, in recent years the latter (leakage) aspect of chip power has been escalating at a much faster rate than active power. In fact, as we write, leakage power has almost equaled active power in the total power breakdown of a typical microprocessor. This means that a 100 W chip in today's technology will be burning about 50 W with just power on and no program running! And, by the way, the highest-performance, multicore, server-class processors are already close to 200 W in maximum power consumption! This equates to power densities that are pretty much at the very edge of air-cooled systems. So, without investing in liquid-cooled systems and their corresponding packaging (at significantly higher cost), microprocessors at the high end, targeted for traditional air-cooled server boxes, are pretty much at the end of the road, without a major paradigm shift in design and/or packaging technology. In fact, such a basic paradigm shift has been in the works for a few years, with the introduction of IBM's dual-core Power4 chip in 2000, for example. The industry in general has recently made a clear shift toward lower frequency, multicore architectures for general-purpose high-performance microprocessors. Intel, AMD, and Sun Microsystems have all announced future product roadmaps that embrace the multicore paradigm.
Although such a shift has enabled design groups to keep going for a little while, the need for building power efficiency into the chip's noncore components continues unabated. Also, we recently have witnessed a trend toward finer levels of clock gating in all designs, and the increasing use of power-gated modes to reduce leakage. Academic research in low-power design techniques have evolved from lower-level issues related to the underlying device and circuit technologies to higher-level knobs available in the microarchitecture, architecture, and even the application and software layers. In addition to the established International Symposium on Low Power Electronic Design (ISLPED), other smaller conferences and workshops have emerged to highlight the latest research—especially at the architecture and design levels.
That's why IEEE Micro, in putting together a special issue on this important theme, decided to focus on two recently held (and relatively new) conferences: Cool Chips VIII, held April 2005 in Yokohama, Japan, and the fourth Annual Austin Conference on Energy Efficient Design (ACEED), held in March 2005 in Austin, Texas. (Both of these conferences are organized to have presentations by speakers, without formal proceedings of written papers.) As our readers might know, Cool Chips covers talks on exciting new processor products or test chips that have "low power" as a primary constraint; ACEED, on the other hand, deals more with general research topics related to this field. As technical program chairs for these two conferences, we then teamed up to organize this theme issue, after inviting an initial set of articles from selected speakers and then screening those further into a set of seven final choices after a due review process, per IEEE Micro guidelines. We initially received a total of 17 submissions (10 from Cool Chips and seven from ACEED). Each article received at least two independent reviews; many received more than two reviews. After receiving reviews by a given deadline, we guest editors tried to make initial recommendations. One of us (Pradip Bose, who is also editor-in-chief of IEEE Micro) then made the final decisions about inclusion in the theme issue.
Three of the six selected articles turned out to be related to the Cell processor developed by Sony, Toshiba, and IBM. As Peter Hofstee of IBM describes in the " First-Generation Cell Processor" sidebar, the Cell chip was designed as a heterogeneous, multicore system-on-chip, but using custom CMOS silicon-on-insulator technology, targeted to achieve leadership frequency and performance at affordable power, for the game market. The challenge of delivering such high performance at a power level that makes it possible to use the processor chip in a game console or set-top system, is understandably very steep.
In the article by Takahashi et al., the authors describe the power-aware design principles behind each of the synergistic processing element cores within the full chip. In their article, Maeda et al. deal with some of the challenging issues in the programming model, paying attention to power conservation issues in a real-time computing scenario. In the article by Asano et al., we find a treatment of the low-level design issues related to achieving a high-performance SRAM design (again, for the SPE cores within Cell) at low power.
The other three selections in this issue are research articles, reflecting some of the leading-edge academic research activities in the area of power-aware microarchitecture design. In their article on duration prediction, Isci et al. dwell on the problem of predicting the length (or time duration) of each distinct phase of an executing workload. In a setting where dynamic voltage/frequency scaling (DVFS) helps to manage power efficiently, the ability to accurately predict a phase's duration enables the precision deployment of the underlying voltage change mechanism at a low overhead cost. In the article on formal control techniques, Wu et al. touch on a topic that is of significant current interest within the field of energy-efficient design: on-chip adaptive control techniques and algorithms, an area that has received considerable coverage in recent academic research, as a basic mechanism to manage power in the presence of changing workload demands. In this article, the authors address the formal control-theoretic aspects of such mechanisms, pointing the community toward an era of mathematically provable, robust control algorithms.
Finally, the article by Marculescu et al. touches on a topic of increasing importance to the chip design community: that of uncertainty in design caused by the increased variability and failure rates of component devices and building blocks. How this emerging new constraint interplays with the now well-known constraint of power dissipation limits is the interesting subject of this article.
We hope you enjoy this theme issue on energy-efficient design. We would like to thank all the anonymous reviewers who helped us select this excellent collection of articles from two very relevant and interesting conferences. We are also grateful to all the authors who took the time and effort to submit written versions of their original talks to this special issue.
The Cell Broadband Engine processor is the product of collaboration among IBM, Sony, and Toshiba. The three companies were looking at the future market for entertainment applications that would rely on real-time multimedia processing and broadband Internet communication. Sony set a goal for future applications to eventually run at 1,000 times the performance of Sony's PlayStation 2 Emotion Engine processor developed with Toshiba. 1,2 As a first step, the first-generation Cell's objective was to achieve 100 times PlayStation 2's performance. In summer 2000, participants at a critical meeting determined that a conventional organization, and even a homogeneous chip multicore processor, would not deliver sufficient computational power. 3
Besides increased performance, Cell had to consume much less power per operation to realize the performance improvements while meeting power constraints. Because Cell is targeted at devices and systems in a multi-standard interconnected world, flexibility and programmability are key aspects. Therefore, it was not possible to improve efficiency by simple specialization and limiting the application domain. It is these opposing needs—high performance, high efficiency, and good flexibility and programmability—that made designing Cell an interesting challenge.Choosing an architectural organization
In choosing the basic organization for the Cell processor, the design team considered the several hurdles to improvements in single-thread performance. Such improvements are increasingly difficult to come by because of
As a result of these factors, single-thread performance growth has become so challenged that the industry is switching en-masse to multicore arrangements, allowing improved performance per watt on applications that can execute in parallel. The multiple cores also permit more memory requests in flight at the same time per chip, but do not address the inefficiencies that arise within each core as a result of the memory and frequency wall. Heterogeneous multicore designs permit specialization and further increased efficiencies.Basic Cell Broadband Engine Architecture
Because a shared, coherent, system address space was deemed essential for programmability, and because efficient SMP architectures take a long time to develop, the initial architectural team decided to build the Cell Broadband Engine Architecture ( http://www.ibm.com/developerworks/power/cell) on a system organization inherited from the Power Architecture ( http://www.ibm.com/developerworks/eserver/library/es-archguide-v2.html). The Power Processor Element (PPE) implements the 64-bit Power instruction set architecture and provides the operating system and control functions. Accompanying the PPE are processors optimized to run applications. We chose to call these Synergistic Processor Elements (SPEs) because we designed them to have a mutual dependence on the PPE, working in harmony to perform tasks more efficiently than either type of processing element alone.
Other key aspects of Cell include enhanced real-time controls to allow real-time and non-real-time operating systems and applications to run concurrently; and a hardware architecture to support privacy, security, and digital-rights-management applications.PPE
The PPE is a 64-bit Power processor optimized for performance per watt and performance per area while matching the frequency of the SPEs. Designers implemented the PPE as a dual-threaded core, and it includes floating-point and the vector media extensions of the Power architecture. The processor contains 32-Kbyte instruction and data caches, a 512-Kbyte L2 cache, and on-chip bus interface logic. A new, ground-up implementation, this core has an extended pipeline to achieve a low fan-out of 4 to match the SPEs, and it is an enhanced in-order design. To support real-time operations, the PPE was extended with resource management tables for the L2 and translation caches.SPE
With the central CPU handling the operating system and other control-intensive tasks, we could design the SPEs for efficient general-purpose application processing. The SPE instruction set combines all data types in a single 128-entry 128-bit register file. Special purpose registers (link, count, condition registers) are unified with this register file as well. With 128 registers, the SPE tolerates deep pipelines better than conventional processors. A rational microarchitecture further helps in attacking the frequency wall. However, for applications that fit the local store, from a programmer's perspective, the processor is not fundamentally different from other scalar processors with single-instruction multiple-data (SIMD) extensions.
The most fundamental aspect of the SPE is that it manages two levels of store (registers and local-store memory) in software. Because (direct memory access, DMA) transfers between system memory and local-store memory are asynchronous, a fundamental break with sequential semantics, SPEs can have many main-memory accesses in flight without resorting to speculation. DMA transactions use standard Power effective addressing to refer to system memory. These addresses are translated into a virtual and finally a real address using the standard Power architecture page and segment table caches. Like Power load and stores, DMAs are coherent in the system. The DMA unit is also capable of fetching DMA commands from the local-store memory, thus acting as a separate data moving processor. Although programmers can use conventional programming models and language with appropriate compiler support, we anticipate that the asynchronous streaming-DMA aspect of the SPEs will also lead to the introduction of new programming paradigms.
The SPE ISA includes a software branch hint instruction, allowing implementations of the architecture with minimal hardware branch prediction support. The implementation of the SPE on Cell realizes a processor with leading performance on compute-intensive applications in just 10 mm 2 (15 mm 2 including the DMA unit). We optimized the current implementation of the SPEs for integer and single-precision floating-point. With a dual-threaded PPE and eight SPEs, the first-generation Cell Broadband Engine processor is capable of 10 simultaneously executing threads (18 including the DMA units) and over 100 outstanding memory requests.
The high degree of software control in the SPE is a double-edged sword. On the one hand, because we have largely removed speculative mechanisms from the implementation, the behavior of the SPE is highly predictable, which is a boon for real-time programming and programmers, including compiler writers, who want to optimize their code. At the same time, software management of the local-store memory and branches presents a challenge to compilers.
Although Cell operates in excess of 4 GHz under laboratory conditions, the high-frequency efficient design of the processor, made possible by the operating system and application specialization of the cores, is intended to allow the most efficient operation at minimum operating voltage. The Cell Broadband Engine processor is 235 mm 2 in 90-nm SOI (silicon-on-insulator) technology. At its minimum operating voltage, it dissipates power comparable to high-end PC processors and operates in excess of 3 GHz, but in many cases delivers an order of magnitude more performance more than conventional processors and in some cases more than that. Future work will focus primarily on broadening the reach of the architecture by pursuing implementations with reduced power and die size.