This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Concurrent Detection of Software and Hardware Data-Access Faults
April 1997 (vol. 46 no. 4)
pp. 412-424

Abstract—A new approach allows low-cost concurrent detection of two important types of faults, software and hardware data-access faults, using an extension of the existing signature monitoring approach. The proposed approach detects data-access faults using a new type of redundant data structure that contains an embedded signature. Low-cost fault detection is achieved using simple architecture support and compiler support that exploit natural redundancies in the data structures, in the instruction set architecture, and in the data-access mechanism. The software data-access faults that the approach can detect include faults that have been shown to cause a high percentage of system failures. Hardware data-access faults that occur in all levels of the data-memory hierarchy are also detectable, including faults in the register file, the data cache, the data-cache TLB, the memory address and data buses, etc. Benchmark results for the MIPS R3000 processor executing code scheduled by a modified GNU C Compiler show that the new approach can concurrently check a high percentage of data accesses, while causing little performance overhead and little memory overhead.

[1] A.V. Aho, R. Sethi, and J.D. Ullman, Compilers, Principles, Techniques and Tools.New York: Addison-Wesley, 1985.
[2] T. M. Austin, S. E. Breach, and G. S. Sohi,“Efficient detection of all pointer and array access errors,”inProc. ACM SIGPLAN'94 Conf. Programm. Language Design and Implementation (PLDI), ACM SIGPLAN, June 1994, pp. 290–301, appears inSIGPLAN Notices29(6), June 1994.
[3] D. Chang and N. Saxena, "Concurrent Error Detection/Correction in the HaL MMU Chip," Proc. Int'l Symp. Fault-Tolerant Computing, pp. 630-635, 1993.
[4] R. Cmelik, S. Kong, D. Ditzel, and E. Kelly, "An Analysis of MIPS and SPARC Instruction Set Utilization on the SPEC Benchmarks," Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 290-302, 1991.
[5] E. Coffman, Computer and Job-Shop Scheduling Theory. Wiley, 1976.
[6] E. Czeck and D. Siewiorek, "Effects of Transient Gate-Level Faults on Program Behavior," Proc. Int'l Symp. Fault-Tolerant Computing, pp. 236-243, 1990.
[7] X. Delord and G. Saucier, "Formalizing Signature Analysis for Control Flow Checking of Pipelined RISC Microprocessors," Proc. Int'l Test Conf., pp. 936-945, 1991.
[8] K. Dixit, "New CPU Benchmark Suites from SPEC," Digest of Papers Compcon, Spring 1992, pp. 305-310, 1992.
[9] D. Dobberphul et al., "A 200-mhz 64-b Dual-Issue CMOS Microprocessor," IEEE J. Solid State Circuits, vol. 27, no. 11, pp. 1,555-1,567, Nov. 1992.
[10] U. Gunneflo, J. Karlsson, and J. Torin, "Evaluation of Error Detection Schemes Using Fault Injection by Heavy-Ion Radiation," Proc. Int'l Symp. Fault-Tolerant Computing, pp. 340-347, 1989.
[11] R. Hastings and B. Joyce, "Purify: Fast Detection of Memory Leaks and Access Errors," Proc. Winter Usenix Conf., pp. 125-136, 1992.
[12] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1995.
[13] IBM, IBM RISC System/6000 Technology. IBM, 1990.
[14] MIPS Computer Systems Inc., Language Programmer's Guide, 1986.
[15] G. Kane, MIPS RISC Architecture. Prentice Hall, 1988.
[16] W. Kao, R. Iyer, and D. Tang, "FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior Under Faults," IEEE Trans. Software Eng., vol. 19, no. 11, pp. 1,105-1,118, Nov. 1993.
[17] B. W. Kernighan and D. M. Ritchie,The C Programming Language. Englewood Cliffs, NJ: Prentice-Hall, 1988, 2nd ed.
[18] S. Krishnamurth, "A Brief Survey of Papers on Scheduling for Pipelined Processors," SIGPLAN Notices, vol. 25, no. 7, pp. 97-106, July 1990.
[19] A. Mahmood and E. McCluskey, "Concurrent Error Detection Using Watchdog Processors—A Survey," IEEE Trans. Computers, vol. 37, no. 2, pp. 160-174, Feb. 1988.
[20] B.P. Miller, L. Fredrikson, and B. So, "An Empirical Study of the Reliability of Unix Utilities," Comm. ACM, Dec. 1990, pp. 32-44.
[21] M. Namjoo, "Techniques for Concurrent Testing of VLSI Processor Operation," Proc. Int'l Test Conf., pp. 461-468, 1982.
[22] J. Ohlsson, M. Rimen, and U. Genneflo, "A Study of the Effects of Transient Fault Injection into a 32-bit RISC with Built-In Watchdog," Proc. Int'l Symp. Fault-Tolerant Computing, pp. 316-325, 1992.
[23] K. Palem and B. Simons, "Scheduling Time-Critical Instructions on RISC Machines," ACM Trans. Programming Languages and Systems, vol. 15, no. 4, pp. 632-658, Sept. 1993.
[24] N. Saxena and E. McCluskey, "Control-Flow Checking Using Watchdog Assists and Extended-Precision Checksums," IEEE Trans. Computers, vol. 39, no. 4, pp. 554-558, Apr. 1990.
[25] M. Schuette and J. Shen, "Processor Control Flow Monitoring Using Signatured Instruction Streams," IEEE Trans. Computers, vol. 36, no. 3, pp. 264-276, Mar. 1987.
[26] D. Siewiorek and R. Swarz, Reliable Computer Systems: Design and Evaluation. Digital Press, 1992.
[27] R. Sites, Alpha Architecture Reference Manual. Digital Press, 1992.
[28] G. Sohi, M. Franklin, and K. Saluja, "A Study of Time-Redundant Fault Tolerant Techniques for High-Performance Pipelined Computers," Proc. Int'l Symp. Fault-Tolerant Computing, pp. 436-443, 1989.
[29] T. Sridhar and S. Thatte, "Concurrent Checking of Program Flow in VLSI Processors," Proc. Int'l Test Conf., pp. 191-199, 1982.
[30] R. Stallman, Using and Porting GNU CC. Free Software Foundation, Inc., 1992.
[31] M. Sullivan and R. Chillarege, "Software Defects and Their Impact on System Availability—A Study of Field Failures in Operating Systems," Proc. Int'l Symp. Fault-Tolerant Computing, pp. 2-9, 1991.
[32] D. Taylor, D. Morgan, and J. Black, "Redundancy in Data Structures: Improving Software Fault Tolerance," IEEE Trans. Software Eng., vol. 6, no. 6, pp. 585-594, Nov. 1980.
[33] J. Wang, S. Fang, and W. Feng, "New Efficient Designs for XOR and XNOR Functions on the Transistor Level," IEEE J. Solid State Circuits, vol. 29, no. 7, pp. 780-786, July 1994.
[34] N. Warter and W. Hwu, "A Software Based Approach to Achieving Optimal Performance for Signature Control Flow Checking," Proc. Int'l Symp. Fault-Tolerant Computing, pp. 442-449, 1990.
[35] K. Wilken, "An Optimal Graph-Construction Approach to Placing Program Signatures for Signature Monitoring," IEEE Trans. Computers, vol. 42, no. 11, pp. 1,372-1,381, Nov. 1993.
[36] K. Wilken, A. Barr, and R. Hoskin, "A RISC Architecture for Concurrent Error Detection," Technical Report ECE-CERL-TR-95-06-01, Univ. of California at Davis, June 1995.
[37] K. Wilken and J. Shen, "Continuous Signature Monitoring: Low-Cost Concurrent-Detection of Processor Control Errors," IEEE Trans. Computer-Aided Design, vol. 9, no. 3, pp. 629-641, June 1990.
[38] K. Wilken and J. Shen, "Concurrent Error Detection Using Signature Monitoring and Encryption," Proc. Int'l Working Conf. Dependable Computing for Critical Applications, pp. 365-384, 1991.

Index Terms:
Software fault detection, hardware fault detection, redundant data structure, on-line testing, concurrent error detection, signature monitoring, architecture support for fault detection, compiler support for fault detection.
Citation:
Kent D. Wilken, Timothy Kong, "Concurrent Detection of Software and Hardware Data-Access Faults," IEEE Transactions on Computers, vol. 46, no. 4, pp. 412-424, April 1997, doi:10.1109/12.588046
Usage of this product signifies your acceptance of the Terms of Use.