Issue No.04 - July/August (2011 vol.8)
pp: 537-547
John P. Hayes , University of Michigan, Ann Arbor
Sudhakar M. Reddy , University of Iowa, Iowa City
Ilia Polian , Albert-Ludwigs-University of Freiburg, Freiburg
Transient or soft errors caused by various environmental effects are a growing concern in micro and nanoelectronics. We present a general framework for modeling and mitigating the logical effects of such errors in digital circuits. We observe that some errors have time-bounded effects; the system's output is corrupted for a few clock cycles, after which it recovers automatically. Since such erroneous behavior can be tolerated by some applications, i.e., it is noncritical at the system level, we define the critical soft error rate (CSER) as a more realistic alternative to the conventional SER measure. A simplified technology-independent fault model, the single transient fault (STF), is proposed for efficiently estimating the error probabilities associated with individual nodes in both combinational and sequential logic. STFs can be used to compute various other useful metrics for the faults and errors of interest, and the required computations can leverage the large body of existing methods and tools designed for (permanent) stuck-at faults. As an application of the proposed methodology, we introduce a systematic strategy for hardening logic circuits against transient faults. The goal is to achieve a desired level of CSER at minimum cost by selecting a subset of nodes for hardening against STFs. Exact and approximate algorithms to solve the node selection problem are presented. The effectiveness of this approach is demonstrated by experiments with the ISCAS-85 and -89 benchmark suites, as well as some large (multimillion-gate) industrial circuits.
Soft errors, error tolerance, selective hardening, transient faults.
John P. Hayes, Sudhakar M. Reddy, Ilia Polian, "Modeling and Mitigating Transient Errors in Logic Circuits", IEEE Transactions on Dependable and Secure Computing, vol.8, no. 4, pp. 537-547, July/August 2011, doi:10.1109/TDSC.2010.26
[1] M. Abramovici, M.A. Breuer, and A.D. Friedman, Digital Systems Testing and Testable Design. Computer Science Press, 1990.
[2] S. Almukhaizim, T. Verdel, and Y. Makris, “Cost-Effective Graceful Degradation in Speculative Processor Subsystems: The Branch Prediction Case,” Proc. IEEE Int'l Conf. Computer Design, pp. 194-197, 2003.
[3] H. Ando, R. Kan, Y. Tosaka, K. Takahisa, and K. Hatanaka, “Validation of Hardware Error Recovery Mechanisms for the SPARC64 V Microprocessor,” Proc. Int'l Conf. Dependable Systems and Networks, pp. 62-69, 2008.
[4] H. Asadi and M. Tahoori, “Soft Error Modeling and Protection for Sequential Elements,” Proc. IEEE Defect and Fault Tolerance Symp., pp. 463-471, 2005.
[5] M. Breuer and H. Zhu, “An Illustrated Methodology for Analysis of Error Tolerance,” IEEE Design and Test of Computers, vol. 25, no. 2, pp. 168-177, Mar./Apr. 2008.
[6] M.A. Breuer, “Testing for Intermittent Faults in Digital Circuits,” IEEE Trans. Computers, vol. 22, no. 3, pp. 241-246, Mar. 1973.
[7] P.E. Dodd and L.W. Massengill, “Basic Mechanisms and Modeling of Single-Event Upset in Digital Microelectronics,” IEEE Trans. Nuclear Science, vol. 50, no. 3, pp. 583-602, June 2003.
[8] K. Driscoll, B. Hall, H. Sivencrona, and P. Zumsteg, “Byzantine Fault Tolerance, from Theory to Reality,” Proc. Int'l Conf. Computer Safety, Reliability and Security, pp. 235-248, 2003.
[9] E.N. Elnozahy, L. Alvisi, Y.M. Wang, and D.B. Johnson, “A Survey of Rollback-Recovery Protocols in Message-Passing Systems,” ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, 2002.
[10] R. Garg, N. Jayakumar, S.P. Khatri, and G. Choi, “A Design Approach for Radiation-Hard Digital Electronics,” Proc. IEEE Design Automation Conf., pp. 773-778, 2006.
[11] J.P. Hayes, I. Polian, and B. Becker, “An Analysis Framework for Transient-Error Tolerance,” Proc. Very Large-Scale Integration Test Symp., pp. 249-255, 2007.
[12] S. Hellebrand, C.G. Zoellin, H.-J. Wunderlich, S. Ludwig, T. Coym, and B. Straube, “A Refined Electrical Model for Particle Strikes and Its Impact on SEU Prediction,” Proc. IEEE Defect and Fault Tolerance Symp., 2007.
[13] E. Hill, M. Lipasti, and K. Saluja, “An Accurate Flip-Flop Selection Technique for Reducing Logic SER,” Proc. Int'l Conf. Dependable Systems and Networks, pp. 32-41, 2008.
[14] S.K. Jain and V.D. Agrawal, “Statistical Fault Analysis,” IEEE Design and Test of Computers, vol. 2, no. 1, pp. 38-44, Jan./Feb. 1985.
[15] Z. Jiang and S. Gupta, “Threshold Testing: Improving Yield for Nanoscale VLSI,” IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 28, no. 12, pp. 1993-1895, Dec. 2009.
[16] V. Joshi, R.R. Rao, D. Blaauw, and D. Sylvester, “Logic SER Reduction through Flip Flop Redesign,” Proc. Int'l Symp. Quality Electronic Design, pp. 611-616, 2006.
[17] S. Krishnaswamy, G.F. Viamontes, I.L. Markov, and J.P. Hayes, “Probabilistic Transfer Matrices in Symbolic Reliability Analysis of Logic Circuits,” ACM Trans. Design Automation of Electronic Systems, vol. 13, no. 1, 2008.
[18] W.Y. Kung, C.S. Kim, and C.C.J. Kuo, “Spatial and Temporal Error Concealment Techniques for Video Transmission over Noisy Channels,” IEEE Trans. Circuits and Systems for Video Technology, vol. 16, no. 7, pp. 789-802, July 2006.
[19] X. Li and D. Yeung, “Application-Level Correctness and Its Impact on Fault Tolerance,” Proc. Int'l Symp. High Performance Computer Architecture, pp. 181-192, 2007.
[20] J.W.S. Liu, W.K. Shin, K.J. Lin, R. Bettati, and J.Y. Chung, “Imprecise Computations,” Proc. IEEE, vol. 82, no. 1, pp. 83-94, Jan. 1994.
[21] M. May, M. Alles, and N. Wehn, “A Case Study in Reliability-Aware Design: A Resilient LDPC Code Decoder,” Proc. Conf. Design, Automation and Test in Europe, 2008.
[22] K. Mohanram and N.A. Touba, “Cost-Effective Approach for Reducing Soft Error Failure Rate in Logic Circuits,” Proc. IEEE Int'l Test Conf., pp. 893-901, 2003.
[23] H.T. Nguyen and Y. Yagil, “A Systematic Approach to SER Estimation and Solutions,” Proc. Int'l Reliability Physics Symp., pp. 60-70, 2003.
[24] M. Nicolaidis, “GRAAL: A Fault-Tolerant Architecture for Enabling Nanometric Technologies,” Proc. Int'l On-Line Test Symp., p. 255, 2007.
[25] A.K. Nieuwland, S. Jasarevic, and G. Jerin, “Combinational Logic Soft Error Analysis and Protection,” Proc. Int'l On-Line Test Symp., 2006.
[26] D. Nowroth, I. Polian, and B. Becker, “A Study of Cognitive Resilience in a JPEG Compressor,” Proc. Int'l Conf. Dependable Systems and Networks, pp. 32-41, 2008.
[27] Z. Pan and M.A. Breuer, “Basing Acceptable Error-Tolerant Performance on Significance-Based Error-Rate (SBER),” Proc. Very Large-Scale Integration. Test Symp., 2008.
[28] I. Polian, B. Becker, M. Nakasato, S. Ohtake, and H. Fujiwara, “Low-Cost Hardening of Image Processing Applications against Soft Errors,” Proc. Int'l Symp. Defect and Fault Tolerance, pp. 274-279, 2006.
[29] I. Polian, J.P. Hayes, S. Kundu, and B. Becker, “Transient Fault Characterization in Dynamic Noisy Environments,” Proc. IEEE Int'l Test Conf., pp. 1039-1048, 2005.
[30] I. Polian, S.M. Reddy, and B. Becker, “Scalable Calculation of Logical Masking Effects for Selective Hardening against Soft Errors,” Proc. IEEE Int'l Symp. Very Large-Scale Integration, pp. 257-262, 2008.
[31] C. Rusu, A. Bougerol, L. Anghel, C. Weulerse, N. Buard, S. Benhammadi, N. Renaud, G. Hubert, F. Wrobel, T. Carriere, and R. Gaillard, “Multiple Event Transient Induced by Nuclear Reactions in CMOS Logic Cells,” Proc. Int'l On-Line Test Symp., pp. 137-145, 2007.
[32] J. Savir, “Testing for Single Intermittent Failures in Combinational Circuits by Maximizing the Probability of Fault Detection,” IEEE Trans. Computers, vol. 29, no. 5, pp. 410-416, May 1980.
[33] S.A. Seshia, W. Li, and S. Mitra, “Verification-Guided Soft Error Resilience,” Proc. Conf. Design, Automation and Test in Europe, 2007.
[34] S. Shahidi and S.K. Gupta, “ERTG: A Test Generator for Error-Rate Testing,” Proc. IEEE Int'l Test Conf., 2007.
[35] P. Shivakumar, M. Kistler, W. Keckler, D. Burger, and L. Alvisi, “Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic,” Proc. Int'l Conf. Dependable Systems and Networks, pp. 389-398, 2002.
[36] F. Wang and V. Agrawal, “Soft Error Rate Determination for Nanometer CMOS VLSI logic,” Proc. Southeastern Symp. System Theory, pp. 324-328, 2008.
[37] H.J. Wunderlich, “PROTEST: A Tool for Probabilistic Testability Analysis,” Proc. IEEE Design Automation Conf., 1985.
[38] M. Zhang, S. Mitra, T.M. Mak, N. Seifert, N.J. Wang, Q. Shi, K.S. Kim, N.R. Shanbhag, and S.J. Patel, “Sequential Element Design with Built-In Soft Error Resilience,” IEEE Trans. Very Large-Scale Integration Systems, vol. 14, no. 12, pp. 1368-1378, Dec. 2006.
[39] M. Zhang and N.R. Shanbhag, “Soft Error-Rate Analysis (SERA) Methodology,” IEEE Trans. Computer-Aided Design, vol. 25, no. 10, pp. 2140-2155, Oct. 2006.
[40] C.G. Zoellin, H.-J. Wunderlich, I. Polian, and B. Becker, “Selective Hardening in Early Design Steps,” Proc. European Test Symp., pp. 185-190, 2008.