loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Autonomic Microprocessor Execution via Self-Repairing Arrays
October-December 2005 (vol. 2 no. 4)
pp. 297-310
To achieve high reliability despite hard faults that occur during operation and to achieve high yield despite defects introduced at fabrication, a microprocessor must be able to tolerate hard faults. In this paper, we present a framework for autonomic self-repair of the array structures in microprocessors (e.g., reorder buffer, instruction window, etc.). The framework consists of three aspects: 1) detecting/diagnosing the fault, 2) recovering from the resultant error, and 3) mapping out the faulty portion of the array. For each aspect, we present design options. Based on this framework, we develop two particular schemes for self-repairing array structures (SRAS). Simulation results show that one of our SRAS schemes adds some performance overhead in the fault-free case, but that both of them mask hard faults 1) with less hardware overhead cost than higher-level redundancy (e.g., IBM mainframes) and 2) without the per-error performance penalty of existing low-cost techniques that combine error detection with pipeline flushes for backward error recovery (BER). When hard faults are present in arrays, due to operational faults or fabrication defects, SRAS schemes outperform BER due to not having to frequently flush the pipeline.

[1] M. Abramovici, M.A. Breuer, and A.D. Friedman, Digital Systems Testing and Testable Design. IEEE Press, 1990.
[2] T. Austin, E. Larson, and D. Ernst, “SimpleScalar: An Infrastructure for Computer System Modeling,” Computer, vol. 35, no. 2, pp. 59-67, Feb. 2002.
[3] T.M. Austin, “DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design,” Proc. 32nd Ann. IEEE/ACM Int'l Symp. Microarchitecture, pp. 196-207, Nov. 1999.
[4] T.S. Barnett, A.D. Singh, and V.P. Nelson, “Extending Integrated-Circuit Yield-Models to Estimate Early-Life Reliability,” IEEE Trans. Reliability, vol. 52, no. 3, pp. 296-300, Sept. 2003.
[5] J.M. Berger, “A Note on Error Detecting Codes for Asymmetric Channels,” Information and Control, vol. 4, pp. 68-73, Mar. 1961.
[6] D.T. Blaauw, C. Oh, V. Zolotov, and A. Dasgupta, “Static Electromigration Analysis for On-Chip Signal Interconnects,” IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 22, no. 1, pp. 39-48, Jan. 2003.
[7] R. Blish et al. “Critical Reliability Challenges for the International Technology Roadmap for Semiconductors (ITRS),” Technical Report 03024377A-TR, Int'l SEMATECH, Mar. 2003.
[8] F.A. Bower, P.G. Shealy, S. Ozev, and D.J. Sorin, “Tolerating Hard Faults in Microprocessor Array Structures,” Proc. Int'l Conf. Dependable Systems and Networks, pp. 51-60, June 2004.
[9] T. Chen and G. Sunada, “A Self-Testing and Self-Repairing Structure for Ultra-Large Capacity Memories,” Proc. Int'l Test Conf., pp. 623-631, Oct. 1992.
[10] T. Chen and G. Sunada, “An Ultra-Large Capacity Single-Chip Memory Architecture with Self-Testing and Self-Repairing,” Proc. Int'l Conf. Computer Design (ICCD), pp. 576-581, Oct. 1992.
[11] T.J. Dell, “A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory,” IBM Microelectronics Division Whitepaper, Nov. 1997.
[12] D.J. Dumin, Oxide Reliability: A Summary of Silicon Oxide Wearout, Breakdown, and Reliability. World Scientific Publications, 2002.
[13] L. Gwennap, “Alpha 21364 to Ease Memory Bottleneck,” Microprocessor Report, Oct. 1998.
[14] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel, “The Microarchitecture of the Pentium 4 Processor,” Intel Technology J., Feb. 2001.
[15] IBM, Enhancing IBM Netfinity Server Reliability: IBM Chipkill Memory, IBM Whitepaper, Feb. 1999.
[16] Int'l Technology Roadmap for Semiconductors, 2003.
[17] JEDEC Solid, State Technology Assoc., “Failure Mechanisms and Models for Semiconductor Devices,”” JEDEC Publication JEP122-B, Aug. 2003.
[18] D. Jewett, “Integrity S2: A Fault-Tolerant UNIX Platform,” Proc. 21st Int'l Symp. Fault-Tolerant Computing Systems, pp. 512-519, June 1991.
[19] S. Krumbein, “Metallic Electromigration Phenomena,” IEEE Trans. Components, Hybrids, and Manufacturing Technology, vol. 11, no. 1, pp. 5-15, Mar. 1988.
[20] B.P. Linder, J.H. Stathis, D.J. Frank, S. Lombardo, and A. Vayshenker, “Growth and Scaling of Oxide Conduction After Breakdown,” Proc. 41st Ann. IEEE Int'l Reliability Physics Symp., pp. 402-405, Mar. 2003.
[21] P. Mazumder and J.S. Yih, “A Novel Built-In Self-Repair Approach to VLSI Memory Yield Enhancement,” Proc. Int'l Test Conf., pp. 833-841, 1990.
[22] S. McFarling, “Combining Branch Predictors,” Technical Report TN-36, Digital Western Research Laboratory, June 1993.
[23] S.S. Mukherjee, C. Weaver, J. Emer, S.K. Reinhardt, and T. Austin, “A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor,” Proc. 36th Ann. IEEE/ACM Int'l Symp. Microarchitecture, Dec. 2003.
[24] M. Nicolaidis, N. Achouri, and S. Boutobza, “Dynamic Data-Bit Memory Built-In Self-Repair,” Proc. Int'l Conf. Computer Aided Design, pp. 588-594, Nov. 2003.
[25] K. Nikolic, A. Sadek, and M. Forshaw, “Fault-Tolerant Techniques for Nanocomputers,” Nanotechnology, vol. 13, pp. 357-362, 2002.
[26] I. Pomeranz and S.M. Reddy, “On n-Detection Test Sets and Variable n-Detection Test Sets for Transition Faults,” Proc. 17th IEEE VLSI Test Symp., pp. 173-180, Apr. 1999.
[27] S.K. Reinhardt and S.S. Mukherjee, “Transient Fault Detection via Simultaneous Multithreading,” Proc. 27th Ann. Int'l Symp. Computer Architecture, pp. 25-36, June 2000.
[28] W.C. Riordan, R. Miller, J.M. Sherman, and J. Hicks, “Microprocessor Reliability Performance as a Function of Die Location for a 0. 25um, Five Layer Metal CMOS Logic Process,” Proc. 37th Ann. IEEE Int'l Reliability Physics Symp., pp. 1-11, Mar. 1999.
[29] R. Rodriguez, R.V. Joshi, J.H. Stathis, and C.T. Chuang, “Oxide Breakdown Model and Its Impact on SRAM Cell Functionality,” Simulation of Semiconductor Processes and Devices (SISPAD), pp. 283-286, Sept. 2003.
[30] E. Rotenberg, “AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors,” Proc. 29th Int'l Symp. Fault-Tolerant Computing Systems, pp. 84-91, June 1999.
[31] K. Sawada, T. Sakurai, Y. Uchino, and K. Yamada, “Built-in Self Repair Circuit for High Density ASMIC,” Proc. IEEE Custom Integrated Circuits Conf., 1989.
[32] J. Saxena et al., “Scan-Based Transition Fault Testing— Implementation and Low Cost Test Challenges,” Proc. Int'l Test Conf., pp. 1120-1129, Oct. 2002.
[33] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically Characterizing Large Scale Program Behavior,” Proc. 10th Int'l Conf. Architectural Support for Programming Languages and Operating Systems, Oct. 2002.
[34] P. Shivakumar, S.W. Keckler, C.R. Moore, and D. Burger, “Exploiting Microarchitectural Redundancy For Defect Tolerance,” Proc. 21st Int'l Conf. Computer Design, Oct. 2003.
[35] L. Spainhower and T.A. Gregg, “IBM S/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective,” IBM J. Research and Development, vol. 43, nos. 5/6, Sept./Nov. 1999.
[36] J. Srinivasan, S.V. Adve, P. Bose, and J.A. Rivers, “The Impact of Technology Scaling on Lifetime Reliability,” Proc. Int'l Conf. Dependable Systems and Networks, June 2004.
[37] J.R. Srour, D. Long, D. Millward, R.L. Fitzwilson, and W.L. Chadsey, Radiation Effects on and Dose Enhancement of Electronic Materials. Noyes Publications, 1984.
[38] K. Sundaramoorthy, Z. Purser, and E. Rotenberg, “Slipstream Processors: Improving Both Performance and Fault Tolerance,” Proc. Ninth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 257-268, Nov. 2000.
[39] J. Tao, J.F. Chen, N.W. Cheung, and C. Hu, “Modeling and Characterization of Electromigration Failures Under Bidirectional Current Stress,” IEEE Trans. Electron Devices, vol. 43, no. 5, pp. 800-808, May 1996.
[40] S. Thompson et al., “An Enhanced 130nm Generation Logic Technology Featuring 60nm Transistors for High Performance and Low Power at 0.7-1.4V,” Proc. Int'l Electron Devices Meeting, pp. 257-260, Dec. 2001.
[41] C.-W. Tseng and E.J. McCluskey, “Multiple-Output Propagation Transition Fault Test,” Proc. Int'l Test Conf., pp. 358-366, Nov. 2001.
[42] T.N. Vijaykumar, I. Pomeranz, and K.K. Chung, “Transient Fault Recovery Using Simultaneous Multithreading,” Proc. 29th Ann. Int'l Symp. Computer Architecture, pp. 87-98, May 2002.
[43] J.F. Wakerly, Error Detecting Codes, Self-Checking Circuits and Applications. North-Holland, 1978.
[44] C. Weaver and T. Austin, “A Fault Tolerant Approach to Microprocessor Design,” Proc. Int'l Conf. Dependable Systems and Networks, pp. 411-420, July 2001.
[45] D. Weiss, J.J. Wuu, and V. Chin, “The On-Chip 3MB Subarray Based 3rd Level Cache on an Itanium Microprocessor,” Proc. Int'l Solid-State Circuits Conf., pp. 112-113, Feb. 2002.
[46] D. Wilson, “The Stratus Computer System,” Resilient Computer Systems, pp. 208-231, 1985.
[47] T.-Y. Yeh and Y. Patt, “Two-Level Adaptive Training Branch Prediction,” Proc. 24th Ann. IEEE/ACM Int'l Symp. Microarchitecture, pp. 51-61, Nov. 1991.
[48] L. Youngs and S. Paramanandam, “Mapping and Repairing Embedded-Memory Defects,” IEEE Design & Test of Computers, pp. 18-24, Jan.-Mar. 1997.

Index Terms:
Index Terms- Logic design reliability and testing, microprocessors, and microcomputers.
Citation:
Fred A. Bower, Sule Ozev, Daniel J. Sorin, "Autonomic Microprocessor Execution via Self-Repairing Arrays," IEEE Transactions on Dependable and Secure Computing, vol. 2, no. 4, pp. 297-310, Oct.-Dec. 2005, doi:10.1109/TDSC.2005.44
Usage of this product signifies your acceptance of the Terms of Use.