The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.08 - August (2009 vol.58)
pp: 1063-1079
Onur Mutlu , Microsoft Research, Redmond
Kypros Constantinides , University of Michigan, Ann Arbor
Valeria Bertacco , University of Michigan, Ann Arbor
ABSTRACT
This work proposes a new, software-based, defect detection and diagnosis technique. We introduce a novel set of instructions, called Access-Control Extensions (ACE), that can access and control the microprocessor's internal state. Special firmware periodically suspends microprocessor execution and uses the ACE instructions to run directed tests on the hardware. When a hardware defect is present, these tests can diagnose and locate it, and then activate system repair through resource reconfiguration. The software nature of our framework makes it flexible: testing techniques can be modified/upgraded in the field to trade-off performance with reliability without requiring any change to the hardware. We describe and evaluate different execution models for using the ACE framework. We also describe how the proposed ACE framework can be extended and utilized to improve the quality of post-silicon debugging and manufacturing testing of modern processors. We evaluated our technique on a commercial chip-multiprocessor based on Sun's Niagara and found that it can provide very high coverage, with 99.22 percent of all silicon defects detected. Moreover, our results show that the average performance overhead of software-based testing is only 5.5 percent. Based on a detailed register transfer level (RTL) implementation of our technique, we find its area and power consumption overheads to be modest, with a 5.8 percent increase in total chip area and a 4 percent increase in the chip's overall power consumption.
INDEX TERMS
Reliability, hardware defects, online defect detection, testing, online self-test, post-silicon debugging, manufacturing test.
CITATION
Onur Mutlu, Kypros Constantinides, Valeria Bertacco, "A Flexible Software-Based Framework for Online Detection of Hardware Defects", IEEE Transactions on Computers, vol.58, no. 8, pp. 1063-1079, August 2009, doi:10.1109/TC.2009.52
REFERENCES
[1] A. Agarwal, B.-H. Lim, D.A. Kranz, and J. Kubiatowicz, “April: A Processor Architecture for Multiprocessing,” Proc. 17th Ann. Int'l Symp. Computer Architecture (ISCA-17), 1990.
[2] T.M. Austin, “DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design,” Proc. 32nd Ann. Int'l Symp. Microarchitecture (MICRO-32), 1999.
[3] K. Batcher and C. Papachristiou, “Instruction Randomization Self Test for Processor Cores,” Proc. Very Large Scale Integration (VLSI) Test Symp. (VTS), 1999.
[4] S. Borkar, T. Karnik, and V. De, “Design and Reliability Challenges in Nanometer Technologies,” Proc. 41st Ann. Conf. Design Automation (DAC-41), 2004.
[5] B. Bottoms, “The Third Millennium's Test Dilemma,” IEEE Design and Test of Computers, vol. 15, no. 4, pp. 7-11, Oct.-Dec. 1998.
[6] D. Brahme and J.A. Abraham, “Functional Testing of Microprocessors,” IEEE Trans. Computers, vol. 33, no. 6, pp. 475-485, June 1984.
[7] M.L. Bushnell and V.D. Agrawal, Essentials of Electronic Testing for Digital, Memory and Mixed-Signal VLSI Circuits. Kluwer Academic Publishers, 2000.
[8] K.-H. Chang, I.L. Markov, and V. Bertacco, “Automating Post-Silicon Debugging and Repair,” Proc. Int'l Conf. Computer-Aided Design (ICCAD), Nov. 2007.
[9] L. Chen and S. Dey, “Software-Based Self-Testing Methodology for Processor Cores,” IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 20, no. 3, pp. 369-380, Mar. 2001.
[10] K. Constantinides, J. Blome, S. Plaza, B. Zhang, V. Bertacco, S. Mahlke, T. Austin, and M. Orshansky, “BulletProof: A Defect-Tolerant CMP Switch Architecture,” Proc. 12th Int'l Symp. High Performance Computer Architecture (HPCA-12), 2006.
[11] K. Constantinides, O. Mutlu, and T. Austin, “Online Design Bug Detection: RTL Analysis, Flexible Mechanisms, and Evaluation,” Proc. 41st Ann. Int'l Symp. Microarchitecture (MICRO-41), 2008.
[12] K. Constantinides, O. Mutlu, T. Austin, and V. Bertacco, “Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation,” Proc. 40th Ann. Int'l Symp. Microarchitecture (MICRO-40), 2007.
[13] W.J. Dally, L.R. Dennison, D. Harris, K. Kan, and T. Xanthopoulos, “The Reliable Router: A Reliable and High-Performance Communication Substrate for Parallel Computers,” Proc. Parallel Computer Routing and Comm. Workshop (PCRCW), 1994.
[14] N. Durrant and R. Blish, “Semiconductor Device Reliability Failure Models,” http:/www.sematech.org/, 2000.
[15] M.J. Flynn and P. Hung, “Microprocessor Design Issues: Thoughts on the Road Ahead,” IEEE Micro, vol. 25, no. 3, pp. 16-31, May/June 2005.
[16] R. Goering, “Post-Silicon Debugging Worth a Second Look,” Electronic Eng. Times, Feb. 2007.
[17] R. Guo, S. Mitra, E. Amyeen, J. Lee, S. Sivaraj, and S. Venkataraman, “Evaluation of Test Metrics: Stuck-At, Bridge Coverage Estimate and Gate Exhaustive,” Proc. Very Large Scale Integration (VLSI) Test Symp. (VTS), 2006.
[18] P. Guptan and A.B. Kahng, “Manufacturing-Aware Physical Design,” Proc. Int'l Conf. Computer-Aided Design (ICCAD), 2003.
[19] G. Hetherington, T. Fryars, N. Tamarapalli, M. Kassab, A. Hassan, and J. Rajski, “Logic BIST for Large Industrial Designs: Real Issues and Case Studies,” Proc. Int'l Test Conf. (ITC), Sept. 1999.
[20] NetPerf: A Network Performance Benchmark. Hewlett-Packard Company, 1995.
[21] H. Hirata, K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase, and T. Nishizawa, “An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads,” Proc. 19th Int'l Symp. Computer Architecture (ISCA-19), 1992.
[22] H. Holzapfel and P. Levin, “Advanced Post-Silicon Verification and Debug,” EDA Tech Forum, vol. 3, no. 3, Sept. 2006.
[23] A.M. Ionescu, M.J. Declercq, S. Mahapatra, K. Banerjee, and J. Gautier, “Few Electron Devices: Towards Hybrid CMOS-SET Integrated Circuits,” Proc. Design Automation Conf. (DAC), 2002.
[24] D. Josephson, “The Good, the Bad, and the Ugly of Silicon Debug,” Proc. 43rd Design Automation Conf. (DAC-43), pp. 3-6, 2006.
[25] D. Josephson and B. Gottlieb, “The Crazy Mixed up World of Silicon Debug,” Proc. IEEE Custom Integrated Circuits Conf. (IEEE-CICC), 2004.
[26] H. Klug, “Microprocessor Testing by Instruction Sequences Derived from Random Patterns,” Proc. Int'l Test Conf. (ITC), 1988.
[27] C. Kong, “A Hardware Overview of the NonStop Himalaya (K10000),” Tandem Systems Overview, vol. 10, no. 1, pp. 4-11, 1994.
[28] P. Kongetira, K. Aingaran, and K. Olukotun, “Niagara: A 32-Way Multithreaded SPARC Processor,” IEEE Micro, vol. 25, no. 2, pp.21-29, Mar./Apr. 2005.
[29] N. Kranitis, A. Paschalis, D. Gizopoulos, and Y. Zorian, “Instruction-Based Self-Test of Processor Cores,” Proc. Very Large Scale Integration (VLSI) Test Symp. (VTS), 2002.
[30] R. Kuppuswamy, P. DesRosier, D. Feltham, R. Sheikh, and P. Thadikaran, “Full Hold-Scan Systems in Microprocessors: Cost/Benefit Analysis,” Intel Technology J., vol. 8, no. 1, pp. 63-72, Feb. 2004.
[31] J. Lee and J.H. Patel, “An Instruction Sequence Assembling Methodology for Testing Microprocessors,” Proc. Int'l (r) Test Conf. (ITC), Sept. 1992.
[32] A.S. Leon, K.W. Tam, J.L. Shin, D. Weisner, and F. Schumacher, “A Power-Efficient High-Throughput 32-Thread SPARC Processor,” IEEE J. Solid-State Circuits, vol. 42, no. 1, pp. 7-16, Jan. 2007.
[33] M.-L. Li, P. Ramachandran, S.K. Sahoo, S.V. Adve, V.S. Adve, and Y. Zhou, “Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design,” Proc. 13th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS-XIII), 2008.
[34] Y. Li, S. Makar, and S. Mitra, “CASP: Concurrent Autonomous Chip Self-Test Using Stored Test Patterns,” Proc. Conf. Design, Automation and Test in Europe (DATE), 2008.
[35] C.-K. Luk, R.S. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V.J. Reddi, and K. Hazelwood, “Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation,” Proc. Conf. Programming Language Design and Implementation (PLDI), 2005.
[36] E.J. McCluskey and C.-W. Tseng, “Stuck-Fault Tests vs. Actual Defects,” Proc. Int'l Test Conf. (ITC), pp. 336-343, Oct. 2000.
[37] M. Meterelliyoz, H. Mahmoodi, and K. Roy, “A Leakage Control System for Thermal Stability during Burn-In Test,” Proc. Int'l Test Conf. (ITC), 2005.
[38] S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K.S. Kim, “Robust System Design with Built-In Soft-Error Resilience,” Computer, vol. 38, no. 2, pp. 43-52, Feb. 2005.
[39] J. Nakano, P. Montesinos, K. Gharachorloo, and J. Torrellas, “ReViveI/O: Efficient Handling of I/O in Highly-Available Rollback-Recovery Servers,” Proc. Int'l Symp. High-Performance Computer Architecture (HPCA), 2006.
[40] E.B. Nightingale, P.M. Chen, and J. Flinn, “Speculative Execution in a Distributed File System,” ACM Trans. Computer Systems, vol. 24, no. 4, pp. 361-392, Nov. 2006.
[41] P. Parvathala, K. Maneparambil, and W. Lindsay, “FRITS—A Microprocessor Functional BIST Method,” Proc. Int'l Test Conf. (ITC), 2002.
[42] M. Prvulovic, Z. Zhang, and J. Torrellas, “ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors,” Proc. 29th Int'l Symp. Computer Architecture (ISCA-29), 2002.
[43] B.R. Quinton and S.J.E. Wilton, “Post-Silicon Debug Using Programmable Logic Cores,” Proc. Conf. Field-Programmable Technology (FPT), pp. 241-248, 2005.
[44] J.M. Rabaey, Digital Integrated Circuits: A Design Perspective. Prentice-Hall, Inc., 1996.
[45] J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Privulovic, L. Ceze, S. Sarangi, P. Sack, K. Stauss, and P. Montesinos, “SESC Simulator,” http:/sesc.sourceforge.net, 2002.
[46] S. Sarangi, S. Narayanasamy, B. Carneal, A. Tiwari, B. Calder, and J. Torrellas, “Patching Processor Design Errors with Programmable Hardware,” IEEE Micro, vol. 27, no. 1, pp. 12-25, Jan./Feb. 2007.
[47] M.J. Serrano, W. Yamamoto, R.C. Wood, and M. Nemirovsky, “A Model for Performance Estimation in a Multistreamed, Superscalar Processor,” Proc. Seventh Int'l Conf. Modeling Techniques and Tools for Computer Performance Evaluation, 1994.
[48] P. Shivakumar, S.W. Keckler, C.R. Moore, and D. Burger, “Exploiting Microarchitectural Redundancy for Defect Tolerance,” Proc. Int'l Conf. Computer Design (ICCD), 2003.
[49] M. Shulz, “The End of the Road for Silicon,” Nature Magazine, June 1999.
[50] S. Shyam, K. Constantinides, S. Phadke, V. Bertacco, and T. Austin, “Ultra Low-Cost Defect Protection for Microprocessor Pipelines,” Proc. 12th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS-12), pp. 73-82, 2006.
[51] D.P. Siewiorek and R.S. Swarz, Reliable Computer Systems: Design and Evaluation, third ed. AK Peters, Ltd., 1998.
[52] D.J. Sorin, M.M.K. Martin, M.D. Hill, and D.A. Wood, “SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery,” Proc. 29th Int'l Symp. Computer Architecture (ISCA-29), 2002.
[53] J. Srinivasan, S.V. Adve, P. Bose, and J.A. Rivers, “The Impact of Technology Scaling on Lifetime Reliability,” Proc. Int'l Conf. Dependable Systems and Networks (DSN-34), 2004.
[54] J.H. Stathis, “Reliability Limits for the Gate Insulator in CMOS Technology,” IBM J. Research and Development, vol. 46, nos. 2/3, pp. 265-286, 2002.
[55] OpenSPARC T1 Microarchitecture Specification. Sun Microsystems, Inc., Aug. 2006.
[56] TetraMAX ATPG User Guide, version 2002.05. Synopsys, http:/www.synopsys.com, 2002.
[57] D. Tarjan, S. Thoziyoor, and N.P. Jouppi, “Cacti 4.0.,” Technical Report hpl-2006-86, Hewlett-Packard, 2006.
[58] M.H. Tehranipour, S. Fakhraie, Z. Navabi, and M. Movahedin, “A Low-Cost At-Speed Bist Architecture for Embedded Processor and Sram Cores,” J. Electronic Testing: Theory and Applications, vol. 20, no. 2, pp. 155-168, 2004.
[59] D. Tullsen, S. Eggers, and H. Levy, “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” Proc. 22nd Int'l Symp. Computer Architecture (ISCA-22), June 1995.
[60] D.P. Vallett, “Future Challenges in IC Testing and Fault Isolation,” Proc. IEEE Ann. Meeting of Lasers and Electro-Optics Society (LEOS), vol. 2, pp. 539-540, Oct. 2003.
[61] I. Wagner, V. Bertacco, and T. Austin, “Shielding against Design Flaws with Field Repairable Control Logic,” Proc. 43rd Design Automation Conf. (DAC-43), 2006.
[62] T.J. Wood, “The Test and Debug Features of the AMD-K7 Microprocessor,” Proc. Int'l Test Conf. (ITC), pp. 130-136, 1999.
[63] W.M. Yee, M. Paniccia, T. Eiles, and V. Rao, “Laser Voltage Probe (LVP): A Novel Optical Probing Technology for Flip-Chip Packaged Microprocessors,” Proc. Int'l Symp. Physical and Failure Analysis of Integrated Circuits (IPFA-7), 1999.
6 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool