Issue No. 06 - November/December (2005 vol. 25)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MM.2005.112
Sarita V. Adve , University of Illinois at Urbana-Champaign
Pia Sanda , IBM
It is becoming increasingly difficult to achieve expected levels of reliability and data correctness as the industry approaches the era of extreme CMOS scaling. Aging-related device degradation is becoming a real threat to lifetime reliability. Many processors already include solutions for soft errors in memory structures. More recently, soft errors in logic paths have become an increasing concern. Process variability effects are challenging conventional design as the law of large numbers is no longer useful for describing device behavior, and design based on worst-case paths becomes impractical. Impending problems with burn-in and timing effects from temperature are additional threats to reliable operation.
In the past, architects have often left reliability concerns to lower levels of the system stack. As the severity of these problems increases, however, low-level solutions will likely not suffice, and straightforward system-level solutions such as blind redundancy will likely be too expensive for most market segments. Architects will therefore need to make reliability a first-class design constraint and develop new cost-effective approaches to reliability-aware design. Compared to device and circuit-level solutions, architecture-level solutions can more easily exploit application-specific behavior. For example, for common applications, a large fraction of raw soft errors are masked at the architecture level, potentially allowing for lower-cost solutions at this level. As another example, architecture-level solutions offer the opportunity for application-driven dynamic lifetime reliability management, allowing an optimal distribution of failure rates in time and space.
More broadly, the cross-cutting nature of the reliability problem will demand solutions that cross traditional system boundaries across the entire system stack, including the hardware layers, operating system, and applications. Such cross-layer solutions are likely to expose opportunities that are yet unexplored. For example, the operating system could make a reliability-aware resource allocation among applications, and the hardware could exploit semantic information about the inherent error resilience in many media applications. Fresh research approaches are required to design such cross-layer solutions at a reasonable cost.
In addition to the inherent challenges for reliability-constrained design, extreme CMOS scaling also imposes energy, thermal, and other related constraints. Any reliability-aware design must naturally take these into account in addition to considering performance and cost. The specific objective function that requires optimization in this multidimensional problem space will depend on the market segment. Nevertheless, a new framework is required in which to perform such optimizations, particularly in the context of cross-layer solutions.
This special issue of IEEE Micro contains six articles that address these trends, their system-level implications, and some innovative ideas on how to build systems in the face of these challenges.
The first two articles, by Borkar and Iyer et al., are invited articles. Borkar discusses the different sources of variability and errors, and their trends going forward. Iyer et al. survey the current state-of-the-art in soft-error mitigation techniques. Saggese et al. present an experimental study to understand how injected faults are masked at the architectural and application level. The article by Lu et al. discusses a technique for dynamic lifetime reliability management, where the temperature is dynamically adjusted to reduce the processor failure rate. The article by Gold et al. concerns a practical system-level solution for distributed shared-memory servers; this solution obtains the benefits of lockstep redundant computation without incurring all of the associated costs. Finally, the article by Rashid et al. proposes a redundant multithreading solution that accounts for power with minimal performance impact. A seventh article, by Stasiak et al., is a nontheme article on the Cell processor, an excellent article held over from the September-October issue of IEEE Micro on energy-efficient design.
The six theme articles published in this issue illustrate the recent solid gains in the field of dependable computing using reliability-challenged technology. We faced the difficult task of selecting exemplary articles out of the many excellent submissions. In the future, Micro will publish several submissions that missed this issue.
We thank our anonymous reviewers for their timely reviews and Sheila Clark for her administrative assistance.
Sarita V. Adve is a professor of computer science at the University of Illinois at Urbana-Champaign. Her research interests include energy-, temperature-, and reliability-aware computer architectures and systems. Adve has a PhD in computer science from the University of Wisconsin-Madison. She is a recipient of an Alfred P. Sloan Fellowship; a University Scholar of the University of Illinois; and a member of the IEEE, IEEE CS, and ACM SIGArch.
Pia Sanda is a senior technical staff member with the IBM Systems and Technology Group. Her research interests include reliability- and manufacturability-aware computing. Sanda is one of the founders of PICA (Picosecond Imaging Circuit Analysis) and has implemented PICA for assuring robust microprocessors. She is currently studying the system effects of logic soft errors. She holds patents in PICA, accelerated soft-error testing, design algorithms for phase shift masks, and silicon device processing. Sanda has a PhD in physics and a BS in engineering, both from Cornell University. She is a senior member of the IEEE.