| | This Article | |
| |
| |
| | Share | |
| |
| |
| | Bibliographic References | |
| |
| |
| | Add to: | |
| |
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
| |
| | Search | |
| |
| |
| | |
ReStore: Symptom-Based Soft Error Detection in Microprocessors
July-September 2006 (vol. 3 no. 3)
pp. 188-201
Device scaling and large-scale integration have led to growing concerns about soft errors in microprocessors. To date, in all but the most demanding applications, implementing parity and ECC for caches and other large, regular SRAM structures have been sufficient to stem the growing soft error tide. This will not be the case for long and questions remain as to the best way to detect and recover from soft errors in the remainder of the processor—in particular, the less structured execution core. In this work, we propose the ReStore architecture, which leverages existing performance enhancing checkpointing hardware to recover from soft error events in a low cost fashion. Error detection in the ReStore architecture is novel: symptoms that hint at the presence of soft errors trigger restoration of a previous checkpoint. Example symptoms include exceptions, control flow misspeculations, and cache or translation look-aside buffer misses. Compared to conventional soft error detection via full replication, the ReStore framework incurs little overhead, but sacrifices some amount of error coverage. These attributes make it an ideal means to provide very cost effective error coverage for processor applications that can tolerate a nonzero, but small, soft error failure rate. Our evaluation of an example ReStore implementation exhibits a 2x increase in MTBF (mean time between failures) over a standard pipeline with minimal hardware and performance overheads. The MTBF increases by 20x if ReStore is coupled with protection for certain particularly vulnerable pipeline structures.
[1] H. Akkary, R. Rajwar, and S.T. Srinivasan, “Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors,” Proc. 36th Ann. Int'l Symp. Microarchitecture, pp. 423-434, Dec. 2003.
[2] H. Akkary, S.T. Srinivasan, R. Koltur, Y. Patil, and W. Refaai, “Perceptron-Based Branch Confidence Estimation,” Proc. 10th Int'l Symp. High Performance Computer Architecture, pp. 265-274, Feb. 2004.
[3] B.A. Giesekeet al., “A 600MHz Superscalar RISC Microprocessor with Out-of-Order Execution,” Proc. 1997 IEEE Int'l Solid-State Circuits Conf. Digest of Technical Papers, pp. 176-178, Feb. 1997.
[4] J. Dundas and T. Mudge, “Improving Data Cache Performance by Pre-Executing Instructions under a Cache Miss,” Proc. 1997 ACM Int'l Conf. Supercomputing, pp. 68-75, July 1997.
[5] M. Franklin, “Incorporating Fault Tolerance in Superscalar Processors,” Proc. High Performance Computing, pp. 301-306, Dec. 1996.
[6] G. Hintonet al., “The Microarchitecture of the Pentium 4 Processor,” Intel Technology J., Jan. 2001.
[7] J. Gaisler, “A Portable and Fault-Tolerant Microprocessor Based on the SPARC V8 Architecture,” Proc. Int'l Conf. Dependable Systems and Networks, pp. 409-415, Sept. 2002.
[8] D. Grunwald, A. Klauser, S. Manne, and A. Pleszkun, “Confidence Estimation for Speculation Control,” Proc. 25th Ann. Int'l Symp. Computer Architecture, pp. 122-131, June 1998.
[9] W. Gu, K. Kalbarczyk, and R.K. Iyer, “Error Sensitivity of the Linux Kernel Executing on Powerpc g4 and Pentium 4 Processors,” Proc. 2004 Int'l Conf. Dependable Systems and Networks, pp. 887-896, June 2004.
[10] W. Gu, K. Kalbarczyk, R.K. Iyer, and Z. Yang, “Characterization of Linux Kernel Behavior under Errors,” Proc. 2003 Int'l Conf. Dependable Systems and Networks, pp. 459-468, June 2003.
[11] H. Andoet al., “A 1.3 GHz Fifth Generation SPARC64 Microprocessor,” Proc. Design Automation Conf., pp. 702-705, June 2003.
[12] P. Hazucha and C. Svensson, “Impact of CMOS Technology Scaling on the Atmospheric Neutron Soft Error Rate,” IEEE Trans. Nuclear Science, vol. 47, no. 6, pp. 2586-2594, Dec. 2000.
[13] E. Jacobsen, E. Rotenberg, and J.E. Smith, “Assigning Confidence to Conditional Branch Predictions,” Proc. 29th Ann. IEEE/ACM Int'l Symp. Microarchitecture, pp. 142-152, 1996.
[14] D.A. Jimenez, “Fast Path-Based Neural Branch Prediction,” Proc. 36th Ann. Int'l Symp. Microarchitecture, pp. 243-252, Dec. 2003.
[15] T. Karnik, P. Hazucha, and J. Patel, “Characterization of Soft Errors Caused by Single Event Upsets in CMOS Processes,” IEEE Trans. Dependable and Secure Computing, vol. 1, no. 2, pp. 128-143, Apr. 2004.
[16] A. Klaiber, “The Technology Behind Crusoe Processors,” technical report, Transmeta Corp., Jan. 2000.
[17] S.S. Lumetta and S.J. Patel, “Characterization of Essential Dynamic Instructions,” Proc. SIGMETRICS 2003, pp. 308-309, June 2003.
[18] A. Mahmood and E.J. McCluskey, “Concurrent Error Detection Using Watchdog Processors— A Survey,” IEEE Trans. Computers, vol. 37, no. 2, pp. 160-174, Feb. 1988.
[19] S. McFarling, “Combining Branch Predictors,” Technical Report TN-36, Digital Western Research Laboratory, June 1993.
[20] D. Meyer, $AMD{\hbox{-}}K7^{(TM)}$ Technology Presentation, microprocessor forum presentation. Sunnyvale, Calif.: Advanced Micro Devices, Inc., Oct. 1998.
[21] S. Mitra, N. Siefert, M. Zhang, Q. Shi, and K.S. Kim, “Robust System Design with Built-In Soft-Error Resilience,” Computer, special issue on nano-scale design and test, vol. 38, no. 2, pp. 43-52, Feb. 2005.
[22] S.S. Mukherjee, M. Kontz, and S.K. Reinhardt, “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” Proc. 29th Ann. Int'l Symp. Computer Architecture, pp. 99-110, May 2002.
[23] S.S. Mukherjee, C. Weaver, J. Emer, S.K. Reinhardt, and T. Austin, “A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor,” Proc. 36th Int'l Symp. Microarchitecture, pp. 29-40, Dec. 2003.
[24] O. Mutlu, J. Stark, C. Wilkerson, and Y.N. Patt, “Runahead Execution: An Effective Alternative to Large Instruction Windows,” IEEE Micro, vol. 23, no. 6, pp. 20-25, Nov./Dec. 2003.
[25] S.J. Patel, Z. Kalbarczyk, R.K. Iyer, W. Magda, and N. Nakka, “A Processor-Level Framework for High-Performance and High-Dependability,” Proc. Workshop Evaluating and Architecting Systems for Dependability, 2001.
[26] S.K. Reinhardt and S.S. Mukherjee, “Transient Fault Detection via Simultaneous Multithreading,” Proc. 27th Ann. Int'l Symp. Computer Architecture, pp. 25-36, June 2000.
[27] E. Rotenberg, “AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors,” Proc. Fault-Tolerant Computing Systems, pp. 84-91, June 1999.
[28] J.C. Smolens, B.T. Gold, J. Kim, B. Falsafi, J.C. Hoe, and A.G. Nowatzyk, “Fingerprinting: Bounding Soft-Error Detection Latency and Bandwidth,” Proc. 11th Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 224-234, Oct. 2004.
[29] L. Spainhower and T.A. Gregg, “IBM S/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective,” IBM J. Research and Development, vol. 43, nos. 5/6, pp. 863-873, 1999.
[30] J.G. Steffan and T.C. Mowry, “The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization,” Proc. Fourth Int'l Symp. High Performance Computer Architecture, pp. 2-13, Feb. 1998.
[31] N.J. Wang and S.J. Patel, “Restore: Symptom Based Soft Error Detection in Microprocessors,” Proc. 2005 Int'l Conf. Dependable Systems and Networks, pp. 30-39, June 2005.
[32] N.J. Wang, J. Quek, T.M. Rafacz, and S.J. Patel, “Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline,” Proc. 2004 Int'l Conf. Dependable Systems and Networks, pp. 61-70, June 2004.
[33] C. Weaver and T. Austin, “A Fault Tolerant Approach to Microprocessor Design,” Proc. 29th Ann. Int'l Symp. Computer Architecture, pp. 87-98, May 2002.
[34] C. Weaver, J. Emer, S.S. Mukherjee, and S.K. Reinhardt, “Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor,” Proc. 31st Ann. Int'l Symp. Computer Architecture, pp. 264-275, June 2004.
Index Terms:
Simulation, fault tolerance, fault injection, redundant design.
Citation:
Nicholas J. Wang, Sanjay J. Patel, "ReStore: Symptom-Based Soft Error Detection in Microprocessors," IEEE Transactions on Dependable and Secure Computing, vol. 3, no. 3, pp. 188-201, July-Sept. 2006, doi:10.1109/TDSC.2006.40