November/December 2009 (Vol. 26, No. 6) pp. 6-7 0740-7475/09/$26.00 © 2009 IEEE Published by the IEEE Computer Society Guest Editors' Introduction: Reliability Challenges in Nano-CMOS Design
VLSI design is driven by an ever-increasing challenge to cope with unreliable components at the device, circuit, and system levels. Reliability challenges include, for example, bias-temperature instability (BTI), dielectric breakdown, early-life failure, and soft errors, as well as their interaction with statistical process variation. The impact of unreliability must be managed at various levels of the design abstraction. This special issue addresses the problem of design for reliability at the 32-nm node and beyond, in the context of the emerging threat of progressively unreliable components used in VLSI chip design. VLSI design in the late CMOS era is driven by an ever-increasing challenge to cope with unreliable components at the device, circuit, and system levels. Reliability challenges include bias-temperature instability (BTI), dielectric breakdown, early-life failure, and soft errors, as well as their interaction with statistical process variation. Design and test solutions at the 32-nm node and beyond need to resolve these reliability issues. The impact of unreliability must be managed at various levels of the design abstraction: at the circuit, logic, microarchitecture, and system levels, depending on the nature and degree of error manifestation starting at the physical level. This special issue addresses the problem of design for reliability at the 32-nm node and beyond, in the context of the emerging threat of progressively unreliable components used in VLSI chip design. Six articles highlight a few notable R&D ventures across academia and industry that are pursuing leading-edge design solutions and analysis techniques to cope with the problem. Today's ever-stringent requirements on both performance and energy efficiency render the traditional approach of simple guard-banding ineffective. Innovative techniques, therefore, which recognize the unique property of emerging reliability mechanisms and adaptively protect the system, are increasingly being favored in design exploration. Meanwhile, new research on efficient simulation and fault diagnosis techniques is receiving prominence in order to balance reliability challenges with other performance metrics. At the device level, reliability issues manifest themselves as the temporal shift of transistor and interconnect parameters. The first two articles present leading physical mechanisms and their impact in the late silicon era. As gate-dielectric thickness becomes thinner than 2 nm, BTI in MOSFETs becomes the major reliability challenge, reducing the lifetime of or even causing catastrophic failure in SRAM storage cells. The first article, by Sang Phill Park et al., quantifies the severity of BTI degradation in logic and on-chip memory arrays. With predictive device models, they identify the SRAM cell as the most vulnerable circuit unit under BTI. At 22 nm, its impact could be more than that from random process variations. Besides transistors, scaled interconnects also suffer the rapid increase of reliability concerns, especially after the introduction of new materials (e.g., low- k dielectrics). These new materials enhance wire performance, but degrade thermal and mechanical stability. The situation is compounded by variations in line geometry that increase the failure probability. The second article, by Muhammad Bashir and Linda Milor, addresses one of the emerging wearout issues—low- k dielectric breakdown. Statistical data are collected from a 45-nm test chip to construct an area-scalable model. The results enable designers to consider the failure rate in metal wires. Together, these two articles complete the investigation into the technology aspects of aging and wearout. Reliability issues that start from the technology level must be propagated into circuit and system levels for design protection. Along this track, the first step is to develop failure detection techniques that accurately sense errors with minimum cost. The third article, by Yanjing Li et al., reviews three detection techniques, including circuits that indicate early-life failures in gate dielectrics, concurrent autonomous chip diagnostics using stored test patterns, and sensing and optimization for self-healing under aging. These techniques are built on the specific properties of the underlying reliability mechanisms. They minimize the cost in traditional concurrent error detection, across multiple layers during the abstraction. Error detection techniques provide the essential information that allows further diagnosis and control, leading to better system robustness. This information can be integrated with adaptive design techniques, such as dynamic voltage and frequency scaling, in order to adjust the system and reduce the error rate during the runtime. The fourth article, "Sensor-Driven Reliability and Wearout Management," by Prashant Singh et al., demonstrates such dynamic reliability management by implementing a set of aging sensors for gate oxide degradation. The sensor acts as a canary circuit and reports the degradation rate under realistic operation conditions. The sensor's compact design supports a large-scale implementation to account for statistical process, voltage, and temperature variations across the chip. The entire approach effectively reduces the reliability margins and improves system performance. The successful management of reliable functionality requires a seamless integration of device-, circuit-, and system-level knowledge. By appropriately injecting lower-level errors into the system-level simulator, the goal is to abstract the errors to the architecture level and to provide the most cost-efficient solution. This task is especially challenging for design of embedded systems and SoCs, because of the complexity of these contemporary systems. The fifth article, by Dongwoo Lee and Jongwhoa Na, presents a novel simulation-based fault injection methodology to evaluate the dependability of embedded hardware. Distinguished from traditional methods that are routinely used at the RT level, the new method is applied at the system description level. It achieves shorter simulation time without compromising the convergence. Our last article in this special issue, by Jude A. Rivers and Prabhakar Kudva, presents an overview of reliability challenges at the component level. Although CMOS technology in the late silicon era implies an increasing degree of reliability concerns, various levels of protection have been proposed and will be incorporated into future microprocessor chips and associated systems. This article describes techniques and methodologies that are valuable to achieve error detection and correction in a complex, high-performance system. These techniques provide protection against major types of error while delivering performance and managing power and complexity. At the infant stage of design for reliability (DFR), these six articles illustrate the essential needs and diverse opportunities contributing to future resilient systems. It is our hope that this special issue will promote the research activities in this broad area, ranging from technology modeling, circuit hardening, fault analysis and abstraction, and failure prediction, to system adaptation. Although diagnosis tools at various levels provide the vital basis for resilient design, DFR's success in delivering reliable yet efficient systems depends on effectively managing the trade-offs inherent in prediction and protection of system functionality. Yu Cao is an associate professor of electrical engineering at Arizona State University. His research interests include physical modeling of nanoscale technologies, design solutions for variability and reliability, and reliable integration of postsilicon technologies. He has a PhD in electrical engineering from the University of California, Berkeley. Jim Tschanz is a research scientist at Intel's Circuit Research Lab in Hillsboro, Oregon. His research interests include circuit techniques for managing static and dynamic variations, and design for reliability. He has an MS in electrical engineering from the University of Illinois at Urbana-Champaign. Pradip Bose is a research staff member and manager at IBM Thomas J. Watson Research Center in Yorktown Heights, NY. His research interests include power-efficient design of reliable microprocessor architectures and associated presilicon modeling methodologies. He has a PhD in electrical and computer engineering from the University of Illinois at Urbana-Champaign.
| ||||||||||||||||||||||||||||||||