The Community for Technology Leaders
RSS Icon
Issue No.01 - January-March (2010 vol.7)
pp: 94-109
Avi Timor , Technion Intel Corp., Haifa
Avi Mendelson , Technion Intel Corp., Haifa
Yitzhak Birk , Technion Intel Corp., Haifa
Neeraj Suri , TU Darmstadt, Darmstadt
Soft errors (or Transient faults) are temporary faults that arise in a circuit due to a variety of internal noise and external sources such as cosmic particle hits. Though soft errors still occur infrequently, they are rapidly becoming a major impediment to processor reliability. This is due primarily to processor scaling characteristics. In the past, systems designed to tolerate such faults utilized costly customized solutions, entailing the use of replicated hardware components to detect and recover from microprocessor faults. As the feature size keeps shrinking and with the proliferation of multiprocessor on die in all segments of computer-based systems, the capability to detect and recover from faults is also desired for commodity hardware. For such systems, however, performance and power constitute the main drivers, so the traditional solutions prove inadequate and new approaches are required. We introduce two independent and complementary microarchitecture-level techniques: Double Execution and Double Decoding. Both exploit the typically low average processor resource utilization of modern processors to enhance processor reliability. Double Execution protects the Out-Of-Order part of the CPU by executing each instruction twice. Double Decoding uses a second, low-performance low-power instruction decoder to detect soft errors in the decoder logic. These simple-to-implement techniques are shown to improve the processor's reliability with relatively low performance, power, and hardware overheads. Finally, the resulting “excessive” reliability can even be traded back for performance by increasing clock rate and/or reducing voltage, thereby improving upon single execution approaches.
Transient faults, soft errors, superscalar, fault tolerance, microarchitecture, double execution.
Avi Timor, Avi Mendelson, Yitzhak Birk, Neeraj Suri, "Using Underutilized CPU Resources to Enhance Its Reliability", IEEE Transactions on Dependable and Secure Computing, vol.7, no. 1, pp. 94-109, January-March 2010, doi:10.1109/TDSC.2008.31
[1] J. Leray et al., "Atmospheric Neutron Effects in Advanced Microelectronics, Standards and Applications," Proc. IEEE Int'l Conf. Integrated Circuit Design and Technology (ICICDT '04), pp. 311-321, 2004.
[2] C. Constantinescu, "Impact of Deep Submicron Technology on Dependability of VLSI Circuits," Proc. Int'l Conf. Dependable Systems and Networks (DSN '02), pp. 205-209, 2002.
[3] F.L. Yang and R.A. Saleh, "Simulations and Analysis of Transient Faults in Digital Circuits," IEEE J. Solid-State Circuits, vol. 27, pp. 258-264, 1992.
[4] C. Constantinescu, "Trends and Challenges in VLSI Circuit Reliability," IEEE Micro, vol. 23, pp. 14-19, 2003.
[5] P. Shivakumar et al., "Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic," Proc. Int'l Conf. Dependable Systems and Networks (DSN '02), pp. 389-398, 2002.
[6] P. Hazucha and C. Svensson, "Impact of CMOS Technology Scaling on the Atmospheric Neutron Soft Error Rate," IEEE Trans. Nuclear Science, vol. 47, pp. 2586-2594, 2000.
[7] G. Saggese et al., "An Experimental Study of Soft Errors in Microprocessors," IEEE Micro, vol. 25, pp. 30-39, 2005.
[8] T. Suzuki and Y. Yamagami, "A Sub-0.5-V Operating Embedded SRAM Featuring a Multi-Bit-Error-Immune Hidden-ECC Scheme," IEEE J. Solid-State Circuit, pp. 152-160, 2006.
[9] N. Wang et al., "Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline," Proc. Int'l Conf. Dependable Systems and Networks (DSN '04), pp. 61-70, 2004.
[10] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, second ed., pp. 220-261. Morgan Kaufmann, 1996.
[11] R. Iyer et al., "Recent Advances and New Avenues in Hardware-Level Reliability Support," IEEE Micro, vol. 25, pp. 18-29, 2005.
[12] W. Stallings, Computer Organization & Architecture: Designing for Performance, Section 5.2, sixth ed. Pearson Inc., pp. 148-153, 2003.
[13] T.M. Austin, "DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design," Proc. Ann. Int'l Symp. Microarchitecture (Micro '99), pp. 196-207, 1999.
[14] C. Weaver and T. Austin, "A Fault Tolerant Approach to Microprocessor Design," Proc. Int'l Conf. Dependable Systems and Networks (DSN '01), pp. 411-420, 2001.
[15] B. Pei and Y. Ming, "An Embedded Fail-Safe Interlocking System," Proc. Pacific Rim Int'l Symp. Fault-Tolerant Systems (PRFTS '97), pp. 22-27, 1997.
[16] T. Slegel, R.L. Averill, and M.A. Check, "IBM's S/390 G5 Microprocessor Design," IEEE Micro, vol. 19, pp. 12-23, 1999.
[17] M. Nicolaidis, "Time Redundancy Based Soft-Error Tolerance to Rescue Nanometer Technologies," Proc. VLSI Test Symp. (VTS '99), pp. 86-94, 1999.
[18] S. Mitra et al., "Robust System Design with Built-In Soft-Error Resilience," Computer, vol. 38, pp. 43-52, 2005.
[19] N. Wang and S. Patel, "ReStore: Symptom Based Soft Error Detection in Microprocessors," Proc. Int'l Conf. Dependable Systems and Networks (DSN '05), pp. 30-39, 2005.
[20] N. Oh et al., "Error Detection by Duplicated Instructions in Super-Scalar Processors," IEEE Trans. Reliability, vol. 51, pp. 63-75, 2002.
[21] T. Vijaykumar et al., "Transient-Fault Recovery Using Simultaneous Multithreading," Proc. Ann. Int'l Symp. Computer Architecture (ISCA '02), pp. 87-98, 2002.
[22] M. Rashid et al., "Power-Efficient Error Tolerance in Chip Multiprocessors," IEEE Micro, vol. 25, pp. 60-70, 2005.
[23] S.K. Reinhardt and S.S. Mukherjee, "Transient Fault Detection via Simultaneous Multithreading," ACM SIGARCH Computer Architecture News, vol. 28, no. 2, pp. 25-36, May 2000.
[24] E. Rotenberg, "AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors," Proc. Int'l Symp. Fault-Tolerant Computing (FTCS '99), pp. 84-91, 1999.
[25] Z. Purser, K. Sundaramoorthy, and E. Rotenberg, "A Study of Slipstream Processors," Proc. 33rd Ann. IEEE/ACM Int'l Symp. Microarchitecture (Micro '00), pp. 269-280, 2000.
[26] G.S. Sohi, M. Franklin, and K. Saluja, "A Study of Time-Redundant Fault Tolerance Techniques for High-Performance Pipelined Computers," Proc. Int'l Symp. Fault-Tolerant Computing (FTCS '89), pp. 436-443, 1989.
[27] A. Mendelson and N. Suri, "Designing High Performance and Reliable Superscalar Architectures—The Out of Order Reliable Superscalar (O3RS) Approach," Proc. Int'l Conf. Dependable Systems and Networks (DSN '00), pp. 473-481, 2000.
[28] J. Smolens, J. Kim, J. Hoe, and B. Falsafi, "Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures," Proc. Int'l Symp. Microarchitecture (Micro '04), pp. 257-268, 2004.
[29] C. Weaver et al., "Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor," Proc. Ann. Int'l Symp. Computer Architecture (ISCA '04), pp. 264-275, 2004.
[30] S. Mukherjee, C. Weaver, J. Emer, S. Reinhardt, and T. Austin, "A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor," Proc. 36th Ann. IEEE/ACM Int'l Symp. Microarchitecture (Micro '03), pp. 29-40, 2003.
[31] D. Burger and T.M. Austin, "The SimpleScalar Tool Set, Version 2.0," Univ. of Wisconsin-Madison, http://www.simplescalar. com/docsusers_guide_v2.pdf , 1997.
[32] T. Austin, "SimpleScalar Hacker's Guide (release 2.0)," SimpleScalar LLC, guide_v2.pdf , 2008.
[33] D. Brooks et al., "Wattch: A Framework for Architectural-Level Power Analysis and Optimizations," Proc. Ann. Int'l Symp. Computer Architecture (ISCA '00), pp. 83-94, 2000.
[34] D. Roberts et al., "Error Analysis for the Support of Robust Voltage Scaling," Proc. Symp. Quality of Electronic Design (ISQED '05), pp. 65-70, 2005.
21 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool