This Article 
 Bibliographic References 
 Add to: 
Susceptibility of Commodity Systems and Software to Memory Soft Errors
December 2004 (vol. 53 no. 12)
pp. 1557-1568
It is widely understood that most system downtime is acounted for by programming errors and administration time. However, a growing body of work has indicated an increasing cause of downtime may stem from transient errors in computer system hardware due to external factors, such as cosmic rays. This work indicates that moving to denser semiconductor technologies at lower voltages has the potential to increase these transient errors. In this paper, we investigate the susceptibility of commodity operating systems and applications on commodity PC processors to these soft-errors and we introduce ideas regarding the improved recovery from these transient errors in software. Our results indicate that, for the Linux kernel and a Java virtual machine running sample workloads, many errors are not activated, mostly due to overwriting. In addition, given current and upcoming microprocessor support, our results indicate that those errors activated, which would normally lead to system reboot, need not be fatal to the system if software knowledge is used for simple software recovery. Together, they indicate the benefits of simple memory soft error recovery handling in commodity processors and software.

[1] L. Anghel, M. Nicolaidis, and D. Alexandrescu, “Evaluation of Soft Error Tolerance Technique Based on Time and/or Space Redundancy,” Proc. 13th Symp. Integrated Circuits and Systems Design, Sept. 2000.
[2] J. Bartlett, “A Nonstop Kernel,” Proc. Eighth Symp. Operating Systems Principles, pp. 22-29, Dec. 1981.
[3] J. Bonwick, “The Slab Allocator: An Object-Caching Kernel Memory Allocator,” Proc. USENIX Technical Conf., 1994.
[4] T. Bressoud and F. Schneider, “Hypervisor-Based Fault Tolerance,” Proc. 15th ACM Symp. Operating Systems Principles, pp. 1-11, Dec. 1995.
[5] R. Baumann, “Soft Error Characterization and Modeling Methodologies at TI: Past, Present and Future,” Proc. Fourth Ann. Topical Research Conf. Reliability, Oct. 2000.
[6] J. Chapin et al., “Hive: Fault Containment for Shared-Memory Multiprocessors,” Proc. 15th Symp. Operating Systems Principles, pp. 12-25, Dec. 1995.
[7] D. Chen et al., “JVM Susceptibility to Memory Errors,” Proc. USENIX JVM Symp. '01, Apr. 2001.
[8] P.M. Chen et al., “The Rio File Cache: Surviving Operating System Crashes,” Proc. Seventh Conf. Architectural Support for Programming Languages and Operating Systems, pp. 74-83, Oct. 1996.
[9] J. Clark and D. Pradhan, “Fault Injection: A Method for Validating Computer System Dependability,” Computer, pp. 47-56, June 1995.
[10] T.J. Dell, “A White Paper on the Benefits of Chipkill,” IBM Microelectronics Division, Nov. 1997.
[11] J.C. Fabre, F. Salles, M. Modriguez-Moreno, and J. Arlat, “Assessment of COTS Microkernels by Fault Injection,” Proc. IFIP Dependable Computing for Critical Applications, 1999.
[12] J. Goodenough, “Exception Handling: Issues and a Proposed Notation,” Comm. ACM, vol. 18, pp. 683-696, 1975.
[13] J. Gray and A. Reuter, Transaction Processing: Concepts and Techniques. Morgan Kaufmann, 1993.
[14] W. Gu et al., “Characterization of Linux Kernel Behavior under Errors,” Proc. Int'l Conf. Dependable Systems and Networks '03, 2004.
[15] M.-C. Hsueh et al., “Fault Injection Techniques and Tools,” Computer, pp. 75-82, Apr. 1997.
[16] Intel IA-64 Architecture Software Developer's Manual, Volume 2. Intel Corp., 1999.
[17] Intel IA-32 Architecture Software Developer's Manual, Volume 3. Intel Corp., 2002.
[18] G. Kanawati et al., “FERRARI: A Flexible Software-Based Fault and Error Injection System,” IEEE Trans. Computers, vol. 44, no. 2, pp. 248-260, Feb. 1995.
[19] W.-I. Kao et al., “FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior under Faults,” IEEE Trans. Software Eng., vol. 19, no. 11, pp. 1105-1118, Nov. 1993.
[20] D.E. Lowell and P. Chen, “Exploring Failure Transparency and the Limits of Generic Recovery,” Proc. USENIX Operating System Design and Implementation, Oct. 2000.
[21] H. Maderia et al., “Experimental Evaluation of a COTS System For Space Applications,” Proc. Int'l Conf. Dependable Systems and Networks '02, 2002.
[22] B. McLaughlin, “Evaluating Alternatives for Windows® 2000 Server Availability,” white paper, Stratus, 2001.
[23] L. McVoy and C. Staelin, “lmbench: Portable Tools for Performance Analysis,” Proc. Usenix Technical Conf., 1996.
[24] D. Milojicic et al., “Increasing Relevance of Memory Hardware Errors— A Case for Recoverable Programming Models,” Proc. ACM SIGOPS European Workshop, Sept. 2000.
[25] B. Murphy et al., “Windows 2000 Dependability,” Proc. IEEE Int'l Conf. Dependable Systems and Networks, June 2000.
[26] J.M. Nick et al., “S/390 Cluster Technology: Parallel Sysplex,” IBM Systems J., vol. 36, no. 2, pp. 172-201, 1997.
[27] G. Pfister, In Search of Clusters. Prentice Hall, 1998.
[28] N. Quach, “High Availability and Reliability in the Itanium Processor,” IEEE Micro, vol. 20, no. 5, pp. 61-69, 2000.
[29] Standard Performance Evaluation Corp., “SPECjvm98 Specification,” Aug. 998.
[30] Tandem, Compaq Corp., “Data Integrity for Compaq NonStop Himalaya Servers,” white paper, 1999.
[31] Y. Tosaka, “Soft Error Modeling and Simulation for SOI Circuits,” Proc. Fourth Ann. Topical Research Conf. Reliability, Oct. 2000.
[32] J.F. Ziegler et al., “IBM Experiments in Soft Fails in Computer Electronics (1978-1994),” IBM J. Research and Development, vol. 40, no. 1, pp. 3-18, Jan. 1996.
[33] J.A. Zoutendyk et al., “Characterization of Multiple-Bit Errors from SingleIon Tracks in Integrated Circuits,” IEEE Trans. Nuclear Science, vol. 36, no. 6, Dec. 1989.

Index Terms:
Soft errors, memory errors, commodity, operating systems, Java, recovery.
Alan Messer, Philippe Bernadat, Guangrui Fu, Deqing Chen, Zoran Dimitrijevic, David Lie, Durga Devi Mannaru, Alma Riska, Dejan Milojicic, "Susceptibility of Commodity Systems and Software to Memory Soft Errors," IEEE Transactions on Computers, vol. 53, no. 12, pp. 1557-1568, Dec. 2004, doi:10.1109/TC.2004.119
Usage of this product signifies your acceptance of the Terms of Use.