Microprocessor Test and Reliability Challenges
Guest Editor's Introduction • Cecilia Metra, University of Bologna • December 2013
Translated by Osvaldo Perez and Tiejun Huang
Microelectronic technology's ongoing scaling according to Moore's law enables us to keep increasing microprocessor performance and complexity, thus paving the way for innovative applications that were unthinkable just a few years before. However, that same shrinking feature size poses new challenges for testing and reliability of high-performance microprocessors. The December Computing Now theme examines some of these challenges, as well as some approaches to solving them.
Challenges in a New Era
Shrinking feature size increases the likelihood of defects and parametric variations during fabrication, thus posing new challenges for testing. Traditional test techniques such as burn-in are also becoming increasingly difficult because of power and voltage constraints, and might soon become ineffective or infeasible. Burn-in's limited effectiveness in activating faults that are likely to occur in the first years of circuit operation in the field, together with the scaling of transistors' gate insulators, are making aging phenomena more likely to affect circuit operation, possibly compromising correct operation and consequently increasing reliability risks. Negative-bias temperature-instability, in particular, is becoming a major concern. NBTI is characterized by a positive shift in the absolute value of the pMOS (p-channel metal-oxide semiconductor) transistor threshold voltage, mainly due to the creation of positively charged interface traps when the transistor is biased in strong inversion. As a result, circuit performance is degraded, with possible consequent incorrect system operation in the field.
Moreover, the reduced feature size, together with reduced power supply voltages and noise margins, make integrated circuits more vulnerable to environmentally induced faults, such as transient faults due to particle hits (for example, Alpha particles and neutrons). When transient faults affect a sampling element or propagate until the input of a sampling element and get latched, an output logic error may be generated, which is generally referred to as a soft error.
In addition, increased system complexity together with decreased transistor-conduction threshold makes power consumption -- and consequently, power management in the field — a critical issue. Power supply noise becomes increasingly likely, and it becomes challenging to identify the minimal power supply voltage value that will let the system operate correctly (with no reliability risks) at limited power consumption. Although several approaches have been developed in the past 50 years to guarantee the successful test and reliable operation of integrated circuits for mission-critical applications in areas such as space, military, automotive, medical, and banking, direct application of those approaches is infeasible with mainstream applications, for which cost is an important factor. Innovative, low-cost analysis, modeling, test, and design approaches are therefore needed to face the ongoing test and reliability challenges of high-performance microprocessors.
The six articles in this month's theme provide a comprehensive reference for the theoretical and practical aspects of innovative testing approaches, reliability analysis, and improvement techniques for high-performance microprocessors.
This theme opens with "New Design for Testability Approach for Clock Fault Testing," an article I coauthored with Martin Omaña, T.M. Mak, and Simon Tam. It introduces an approach for identifying faults that can occur during fabrication, affecting clock-distribution network signals in high-performance microprocessors and potentially compromising reliable operation in the field. Other researchers have shown that conventional testing strategies can't guarantee the detection of such faults, but we found that simple modifications to conventional clock buffers in high-performance microprocessors can force clock faults to result in clock stuck-at faults (making the clock signal constantly equal to Vdd or ground), which are easy to identify with any conventional testing strategy. A further proposed clock-buffer modification could also enable calibration after fabrication to compensate for parameter variations introduced during manufacturing. The proposed approach is suitable for both global and local clock buffers, and it implies only a small increase in the area and power consumption required for clock buffers, with no additional test cost or impact on microprocessor performance or in-field operation.
In "Automated Stressmark Generation for Testing Processor Voltage Fluctuations," Youngtaek Kim and colleagues analyze the problem of power management in the field. The article addresses the problem of analyzing voltage fluctuations that occur during a microprocessor's normal operation due to variations in the current that different parts of the code consume. The authors propose an approach to automatically generate proper benchmarks to evaluate recent multicore x86-64 processors' susceptibility to such voltage fluctuations.
Next up, Charles R. Lefurgy and his colleagues propose a solution to the problem of power management in the field. "Active Guardband Management in Power7+ to Save Energy and Maintain Reliability" presents an approach for adjusting processor voltage margins to save energy during low-temperature and low-activity operation periods, thus reducing power consumption while guaranteeing reliable operation (within performance constraints) during high-activity operation periods. To track workload needs, they adjust the voltage margins, which are usually adopted to compensate primarily for temperature and voltage changes resulting from different workloads, as well as variables such as test inaccuracy and aging. The authors have verified their proposed approach on prototype systems with Power7 and Power7+ chips (proper references can be found in the paper), showing its successful adoption for energy-efficient operation.
"Statistical Reliability Estimation of Microprocessor-Based Systems" by Alessandro Savino and his colleagues analyzes the problem of estimating microprocessor-based systems' reliability against soft errors. The article proposes a probabilistic approach to evaluate a microprocessor's reliability while running a given workload. The authors begin by characterizing the microprocessor according to its probable success in executing each instruction in its instruction set architecture and then complete a fast analysis to evaluate the probability of successful execution in case of soft errors. They have evaluated this approach on the Intel 8088 and OpenRISC1200 microprocessors.
In "Analysis of Error Masking and Restoring Properties of Sequential Circuits," Jinghang Liang, Jie Han, and Fabrizio Lombardi address the problem of possible error masking in complex sequential circuits — that is, the logic-masking effect imposed on the feedback signals by specific combinations of primary inputs, thus potentially eliminating the cumulative effect of soft errors. The authors use state-transition matrices and binary decision diagrams in a finite-state machine model to extensively analyze error masking. They have validated the proposed approach via simulations performed on sequential benchmark circuits, which showed attractive features that, albeit beyond the article's scope, could be exploited to improve the reliable operation of sequential circuits.
Finally, Martin Omaña and his colleagues address the problem of monitoring NBTI and keeping it from compromising correct system operation in "Low Cost NBTI Degradation Detection & Masking Approaches." The article proposes two monitoring approaches to detect late transitions (due to NBTI) of signals of timing-critical data paths, as well as two techniques to avoid that such late transitions result in erroneous data sampled by the flip-flops at the end of those data paths. In the low area and power (LAP) approach, a monitoring circuit provides an alarm message when, due to NBTI, late transitions of signals of timing-critical data paths occur, which could result in the sampling of incorrect data by the flip-flops at the end of such data-paths. The alarm message activates a clock-frequency adaptation phase, avoiding the generation of incorrect data at the outputs of such critical timing paths. This approach features lower area overhead and lower (or comparable) power consumption than previous alternate approaches, while presenting the same impact on performance. The other proposed approach, denoted as high performance (HP), consists of a monitoring circuit that can overwrite the incorrect data produced at the critical timing path outputs. This HP approach reduces the impact on system performance compared to previous alternatives, at the cost of some increase in area and power consumption.
This month's theme also includes the following videos, which provide some deep technical insight into these issues by three industrial experts (in alphabetic order):
Rob Aitken, from ARM;
Erik Altman, from IBM; and
Bill Eklow, from Cisco.
We hope that this issue of Computing Now serves as a resource to highlight the major challenges in microprocessor test and reliability and stimulates further research in the field.
C. Metra, "Microprocessor Test and Reliability Challenges," Computing Now, vol. 6, no. 12, Dec. 2013, IEEE Computer Society [online]; http://www.computer.org/web/computingnow/archive/december2013.
Cecilia Metra is editor-in-chief of Computing Now. She is a full professor in electronics at the University of Bologna, Italy, where she has worked since 1991, and where she received a PhD in Electronic Engineering and Computer Science. She is a member of the IEEE Computer Society Board of Governors for 2013-2015, and Vice-President for Technical & Conference Activities of the IEEE Computer Society for 2014. She was associate editor-in-chief of IEEE Transactions on Computers. She is on the editorial boards of several professional journals and has been involved in numerous IEEE-sponsored conferences, symposia, and workshops, serving as general or program chair or co-chair 14 times, as topic or track chair 28 times, and as technical program committee member 74 times. In 2002, she was a visiting faculty consultant for Intel in the US. Her research interests are in the field of design and test of digital systems, reliable and error-resilient systems design, fault tolerance, online testing, fault modeling, diagnosis and debugging, emergent technologies and nanocomputing, secure systems, energy-harvesting systems, and photovoltaic systems. She is an IEEE Fellow (as of January 2014) and a Golden Core member of IEEE Computer Society, from which she has received two Meritorious Service awards and two Certificates of Appreciation. Please contact her by email for possible comments on the monthly theme.