This Article 
 Bibliographic References 
 Add to: 
PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures
April-June 2009 (vol. 6 no. 2)
pp. 135-148
Alex Shye, Northwestern University, Evanston
Joseph Blomstedt, University of Colorado, Boulder
Tipp Moseley, University of Colorado, Boulder
Vijay Janapa Reddi, Harvard University, Cambridge
Daniel A. Connors, University of Colorado, Boulder
Transient faults are emerging as a critical concern in the reliability of general-purpose microprocessors. As architectural trends point toward multicore designs, there is substantial interest in adapting such parallel hardware resources for transient fault tolerance. This paper presents process-level redundancy (PLR), a software technique for transient fault tolerance, which leverages multiple cores for low overhead. PLR creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources. PLR uses a software-centric approach to transient fault tolerance, which shifts the focus from ensuring correct hardware execution to ensuring correct software execution. As a result, many benign faults that do not propagate to affect program correctness can be safely ignored. A real prototype is presented that is designed to be transparent to the application and can run on general-purpose single-threaded programs without modifications to the program, operating system, or underlying hardware. The system is evaluated for fault coverage and performance on a four-way SMP machine and provides improved performance over existing software transient fault tolerance techniques with a 16.9 percent overhead for fault detection on a set of optimized SPEC2000 binaries.

[1] R.C. Baumann, “Soft Errors in Commercial Semiconductor Technology: Overview and Scaling Trends,” IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals, pp.121_01.1-121_01.14, Apr. 2002.
[2] S.E. Michalak et al., “Predicting the Number of Fatal Soft Errors in Los Alamos National Laboratory's ASC Q Supercomputer,” IEEE Trans. Device and Materials Reliability, vol. 5, no. 3, pp. 329-335, Sept. 2005.
[3] S. Hareland et al., “Impact of CMOS Scaling and SOI on Software Error Rates of Logic Processes,” VLSI Technology Digest of Technical Papers, 2001.
[4] T. Karnik et al., “Scaling Trends of Cosmic Rays Induced Soft Errors in Static Latches beyond 0.18$\mu$ ,” VLSI Circuit Digest of Technical Papers, 2001.
[5] T.J. Slegel et al., “IBM's S/390 G5 Microprocessor Design,” IEEE Micro, vol. 19, no. 2, pp. 12-23, Mar./Apr. 1999.
[6] H. Ando et al., “A 1.3 GHz Fifth Generation Sparc64 Microprocessor,” Proc. 40th Conf. Design Automation (DAC '03), pp. 702-705, 2003.
[7] R.W. Horst et al., “Multiple Instruction Issue in the Nonstop Cyclone Processor,” Proc. 17th Ann. Int'l Symp. Computer Architecture (ISCA), 1990.
[8] Y. Yeh, “Triple-Triple Redundant 777 Primary Flight Computer,” Proc. 1996 IEEE Aerospace Applications Conf., vol. 1, pp. 293-307, Feb. 1996.
[9] J. Ziegler et al., “IBM Experiments in Soft Fails in Computer Electronics (1978-1994),” IBM J. Research and Development, vol. 40, no. 1, pp. 3-18, Jan. 1996.
[10] N. Oh et al., “Error Detection by Duplicated Instructions in Super-Scalar Processors,” IEEE Trans. Reliability, vol. 51, no. 1, Mar. 2002.
[11] N. Oh et al., “Control-Flow Checking by Software Signatures,” IEEE Trans. Reliability, vol. 51, no. 1, Mar. 2002.
[12] G.A. Reis et al., “SWIFT: Software Implemented Fault Tolerance,” Proc. Int'l Symp. Code Generation and Optimization (CGO), 2005.
[13] C. Weaver et al., “Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor,” Proc. 31st Int'l Symp. Computer Architecture (ISCA), 2004.
[14] N. Wang et al., “Y-Branches: When You Come to a Fork in the Road, Take It,” Proc. 12th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT), 2003.
[15] N.J. Wang, J. Quek, T.M. Rafacz, and S.J. Patel, “Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline,” Proc. 2004 Int'l Conf. Dependable Systems and Networks (DSN '04), pp. 61-72, June 2004.
[16] S.K. Reinhardt and S.S. Mukherjee, “Transient Fault Detection via Simultaneous Multithreading,” Proc. 27th Ann. Int'l Symp. Computer Architecture (ISCA), 2000.
[17] A. Shye, T. Moseley, V.J. Reddi, J. Blomstedt, and D.A. Connors, “Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance,” Proc. 37th Int'l Conf. Dependable Systems and Networks (DSN '07), June 2007.
[18] J. Smith and R. Nair, Virtual Machines: Versatile Platforms for Systems and Processes. Morgan Kaufmann, 2005.
[19] D. Bruening and S. Amarasinghe, “Maintaining Consistency and Bounding Capacity of Software Code Caches,” Proc. Int'l Symp. Code Generation and Optimization (CGO '05), Mar. 2005.
[20] T.C. Bressoud and F.B. Schneider, “Hypervisor-Based Fault-Tolerance,” Proc. 15th ACM Symp. Operating System Principles (SOSP), 1995.
[21] T.C. Bressoud, “TFT: A Software System for Application-Transparent Fault-Tolerance,” Proc. Int'l Conf. Fault-Tolerant Computing, 1998.
[22] M. Gomaa and T.N. Vijaykumar, “Opportunistic Transient-Fault Detection,” Proc. 32nd Int'l Symp. Computer Architecture (ISCA), 2005.
[23] K. Sundaramoorthy, Z. Purser, and E. Rotenburg, “Slipstream Processors: Improving Both Performance and Fault Tolerance,” Proc. Ninth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2000.
[24] C.-K. Luk et al., “Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI), 2005.
[25] M. Gomaa et al., “Transient-Fault Recovery for Chip Multiprocessors,” Proc. 30th Int'l Symp. Computer Architecture (ISCA), 2003.
[26] S.S. Mukherjee et al., “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” Proc. 29th Int'l Symp. Computer Architecture (ISCA), 2002.
[27] Z. Purser, K. Sundaramoorthy, and E. Rotenberg, “A Study of Slipstream Processors,” Proc. 33rd Ann. ACM/IEEE Int'l Symp. Microarchitecture (MICRO '00), pp. 269-280, 2000.
[28] E. Rotenberg, “AR-SMT: A Microarchitectural Approach to Fault Tolerance,” Proc. 29th Ann. Int'l Symp. Fault-Tolerant Computing (FTCS-29 '99), pp. 84-95, 1999.
[29] C. Wang, H. seop Kim, Y. Wu, and V. Ying, “Compiler-Managed Software-Based Redundant Multi-Threading for Transient Fault Detection,” Proc. Int'l Symp. Code Generation and Optimization (CGO), 2007.
[30] M. Hiller, “Executable Assertions for Detecting Data Errors in Embedded Control Systems,” Proc. Int'l Conf. Dependable Systems and Networks (DSN), 2000.
[31] M. Hiller et al., “On the Placement of Software Mechanisms for Detection of Data Errors,” Proc. Int'l Conf. Dependable Systems and Networks (DSN), 2002.
[32] K. Pattabiraman, Z. Kalbarczyk, and R.K. Iyer, “Application-Based Metrics for Strategic Placement of Detectors,” Proc. 11th Int'l Symp. Pacific Rim Dependable Computing (PRDC), 2005.
[33] M.A. Schuette, J.P. Shen, D.P. Siewiorek, and Y.K. Zhu, “Experimental Evaluation of Two Concurrent Error Detection Schemes,” Proc. Int'l Symp. Fault-Tolerant Computing (FTCS-16), 1986.
[34] T.N. Vijaykumar, I. Pomeranz, and K. Cheng, “Transient-Fault Recovery Using Simultaneous Multithreading,” Proc. 29th Int'l Symp. Computer Architecture (ISCA), 2002.
[35] A. Borg, W. Blau, W. Graetcsh, F. Herrmann, and W. Oberle, “Fault Tolerance under Unix,” ACM Trans. Computer Systems, vol. 7, no. 1, pp. 1-24, Feb. 1989.
[36] P. Murray, R. Fleming, P. Harry, and P. Vickers, “Somersault: Software Fault-Tolerance,” technical report, HP Labs White Paper, Palo Alto, CA, 1998.
[37] J.H. Wensley et al., “SIFT: Design and Analysis of a Fault-Tolerant Computer for Aircraft Control,” Proc. IEEE, vol. 66, no. 10, pp.1240-1255, Oct. 1978.
[38] R.D. Schlichting and F.B. Schneider, “Fail-Stop Processors: An Approach to Designing Fault-Tolerant Computing Systems,” ACM Trans. Computing Systems, vol. 1, no. 3, pp. 222-238, Aug. 1983.
[39] E.D. Berger and B.G. Zorn, “DieHard: Probabilistic Memory Safety for Unsafe Languages,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI), 2006.
[40] G. Novark, E.D. Berger, and B.G. Zorn, “Exterminator: Automatically Correcting Memory Errors,” Proc. ACM SIGPLAN Conf.Programming Language Design and Implementation (PLDI'07), June 2007.
[41] T. Moseley, A. Shye, V.J. Reddi, D. Grunwald, and R.V. Peri, “Shadow Profiling: Hiding Instrumentation Costs with Parallelism,” Proc. Int'l Symp. Code Generation and Optimization (CGO), 2007.
[42] S. Wallace and K. Hazelwood, “Superpin: Parallelizing Dynamic Instrumentation for Real-Time Performance,” Proc. Int'l Symp. Code Generation and Optimization (CGO '07), Mar. 2007.
[43] G.E. Fagg, E. Gabriel, Z. Chen, T. Angskun, G. Bosilca, J. Pjesivac-Grbovic, and J.J. Dongarra, “Process Fault-Tolerance: Semantics, Design and Applications for High Performance Computing,” Int'l J. High Performance Applications and Supercomputing, vol. 19, no. 4, pp. 465-478, 2005.
[44] R. Batchu, Y. Dandass, A. Skjellum, and M. Beddhu, “MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware,” Cluster Computing, 2004.
[45] A. Avizeinis, “The N-Version Approach to Fault-Tolerance Software,” IEEE Trans. Software Engineering, vol. 11, no. 12, pp.1491-1501, Dec. 1985.
[46] J. Aidemark, J. Vinter, P. Folkesson, and J. Karlsson, “Experimental Evaluation of Time-Redundant Execution for a Brake-by-Wire Application,” Proc. Int'l Conf. Dependable Systems and Networks (DSN), 2002.
[47] K. Echtle, B. Hinz, and T. Nikolov, “On Hardware Fault Diagnosis by Diverse Software,” Proc. Int'l Conf. Fault-Tolerant Systems and Diagnostics (FTSD), 1990.
[48] T. Lovric, “Dynamic Double Virtual Duplex Systems: A Cost-Efficient Approach to Fault-Tolerance,” Proc. Int'l Working Conf. Dependable Computing for Critical Applications (DCCA), 1995.
[49] Z. Kalbarczyk, R.K. Iyer, S. Bagchi, and K. Whisnant, “Chameleon: A Software Infrastructure for Adaptive Fault Tolerance,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 6, pp. 560-579, June 1999.

Index Terms:
Fault tolerance, reliability, transient faults, soft errors, process-level redundancy.
Alex Shye, Joseph Blomstedt, Tipp Moseley, Vijay Janapa Reddi, Daniel A. Connors, "PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures," IEEE Transactions on Dependable and Secure Computing, vol. 6, no. 2, pp. 135-148, April-June 2009, doi:10.1109/TDSC.2008.62
Usage of this product signifies your acceptance of the Terms of Use.