This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior Under Faults
November 1993 (vol. 19 no. 11)
pp. 1105-1118

The authors present a fault injection and monitoring environment (FINE) as a tool to study fault propagation in the UNIX kernel. FINE injects hardware-induced software errors and software faults into the UNIX kernel and traces the execution flow and key variables of the kernel. FINE consists of a fault injector, a software monitor, a workload generator, a controller, and several analysis utilities. Experiments on SunOS 4.1.2 are conducted by applying FINE to investigate fault propagation and to evaluate the impact of various types of faults. Fault propagation models are built for both hardware and software faults. Transient Markov reward analysis is performed to evaluate the loss of performance due to an injected fault. Experimental results show that memory and software faults usually have a very long latency, while bus and CPU faults tend to crash the system immediately. About half of the detected errors are data faults, which are detected when the system is tries to access an unauthorized memory location. Only about 8% of faults propagate to other UNIX subsystems. Markov reward analysis shows that the performance loss incurred by bus faults and CPU faults is much higher than that incurred by software and memory faults. Among software faults, the impact of pointer faults is higher than that of nonpointer faults.

[1] J. Lala, "Fault detection, isolation and reconfiguration in ftmp: methods and experimental results," in5th AIAA/IEEE Digital Avionics Systems Conf, 1983, pp. 21.3.1-21.3.9.
[2] K. Shin and Y.-H. Lee, "Error detection process-model, design, and its impact on computer performance,"IEEE Trans. Computers, vol. C-33, pp. 529-540, June 1984.
[3] K. Shin and Y.-H. Lee, "Measurement and application of fault latency,"IEEE Trans. Computers, vol. C-35, pp. 370-375, Apr. 1986.
[4] G. B. Finelli, "Characterization of fault recovery through fault injection on FTMP,"IEEE Trans. Reliab., vol. R-36, pp. 164-170, June 1987.
[5] J. Arlat, Y. Crouzet, and J.-C. Laprie, "Fault injection for dependability validation of fault-tolerant computing systems, in19th Int. Symp. on Fault-Tolerant Computing, June 1989, pp. 348-355.
[6] J. Arlatet al., "Fault injection for dependability validation: A methodology and some applications,"IEEE Trans. Software Eng., vol. 16, pp. 166-182, Feb. 1990.
[7] J. Cusick, R. Koga, W. Kolasinski, and C. King, "SEU vulnerability of the Zilog Z-80 and NSC-800 microprocessors,"IEEE Trans. Nucl. Sci., vol. NS-32, pp. 4206-4211, Dec. 1985.
[8] J. Karlsson, U. Gunneflo, and J. Torin, "The effects of heavy-ion induced single event upsets in the MC6809E microprocessor,"in4th Int. Symp. on Fault-Tolerant Computing Systems, GI/ITG/GMA, 1989
[9] U. Gunneflo, J. Karlsson, and J. Torin, "Evaluation of error detection schemes using fault injection by heavy-ion radiation," inProc. 19th Int. Symp. Fault-Tolerant Comput. (FTCS), June 1989, pp. 340-347.
[10] Z. Segallet al., "Fiat-Fault injection based automated testing environment,"in18th Int. Symp. on Fault-Tolerant Computing, June 1988, pp. 102-107.
[11] J. H. Barton, E. W. Czeck, Z. Segall, and D. P. Siewiorek, "Fault injection experiments using fiat,"IEEE Trans. Computers, vol. 39, pp. 575-582, Apr. 1990.
[12] R. Chillarege and N. S. Bowen, "Understanding large system failures-A fault injection experiment,"in19th Int. Symp. on Fault-Tolerant Computing, June 1989, pp. 356-363.
[13] M. Devarakonda, K. Goswami, and R. Chillarege, "Failure characterization of the nfs using fault-injection," IBM, Res. Rep. RC 16342, Dec. 1990.
[14] L. T. Young, R. K. Iyer, K. K. Goswami, and C. Alonso, "A hybrid monitor assisted fault injection environment,"in3rd IFIP Working Conf. on Dependable Computing for Critical Applications, Sept. 1992.
[15] G. A. Kanawati, N. A. Kanawati, and J. A. Abraham, "Ferrari-A tool for the validation of system dependability,"in22nd Int. Symp. on Fault-Tolerant Computing, July 1992, pp. 336-344.
[16] M. A. Schuette, J. P. Shen, D. P. Siewiorek, and Y. Zhu, "Experimental evaluation of two concurrent error detection schemes,"in16th Int. Symp. on Fault-Tolerant Computing, July 1986, pp. 138-143.
[17] M. Schuette and J. P. Shen, "Processor control flow monitoring using signatured instruction streams,"IEEE Trans. Comput., vol. C-36, pp. 264-276, Mar. 1987.
[18] R. Chillarege and R. K. Iyer, "Measurement-based analysis of error latency,"IEEE Trans. Computers, vol. C-36, pp. 529-537, May 1987.
[19] J. Gray, "A Census of Tandem System Availability, 1985-1990,"IEEE Trans. Reliability, Vol. 39, No. 4, Oct. 1990, pp. 409-418.
[20] J.-C. Laprie, "Dependable computing and fault tolerance concepts and terminology," in15th Int. Symp. on Fault-Tolerant Computing, June 1985, pp. 2-11.
[21] A. Endres, "An analysis of errors and their causes in system programs,"IEEE Trans. Software Eng., vol. SE-1, pp. 140-149, June 1975.
[22] R. Chillarege, W.-L Kao, and R. G. Condit, "Defect type and its impact on the growth curve," inProc. 13th Int. Conf. Software Engineering, 1991.
[23] M. Sullivan and R. Chillarege, "Software defects and their impact on system availability-A study of field failures in operating systems,"Digest of Papers: The 21st Int. Symp. Fault-Tolerant Computing, pp. 2-9, 1991.
[24] D. Tang and R. K. Iyer, "Analysis of the vax/vms error logs in multicomputer environments-A case study of software dependability," in3rd Int. Symp. on Software Reliability Engineering, Oct. 1992, pp. 216-226.
[25] I. Lee and R. K. Iyer, "Analysis of software halts in the tandem guardian operating system,"in3rd Int. Symp. Software Rel. Eng., Oct. 1992, pp. 227-236.
[26] W.-L. Kao and R. K. Iyer, "A user-oriented synthetic workload generator,"in12th Int. Conf. on Distributed Computing Systems, June 1992, pp. 270-277.
[27] A. M. Saleh, J. J. Serrano, and J. H. Patel, "Reliability of scrubbing recovery-techniques for memory systems,"IEEE Trans. Reliab., vol. 39, pp. 114-122, Apr. 1990.
[28] D. J. Taylor, D. E. Morgan, and J. P. Black, "Redundancy in data structures: Improving software fault tolerance,"IEEE Trans. Software Eng., vol. 6, pp. 595-602, Nov. 1980.
[29] K. Kant and A. Ravichandran, "Synthesizing robust data structures-An introduction,"IEEE Trans. Computers, vol. 39, pp. 161-173, Feb. 1990.
[30] C-C. J. Li, S.-K. Chen, W. K. Fuchs, and W.-M. Hwu, "Compiler-assisted multiple instruction retry," Tech. Rep. CRHC-91-31, Coordinated Science Laboratory, Univ. of Illinois, May 1991.
[31] N. J. Alewine, S.-K. Chen, C.-C. Li, W. K. Fuchs, and W.-M. Hwu, "Branch recovery with compiler-assisted multiple instruction retry," in22th Int. Symp. on Fault-Tolerant Computing, June 1992, pp. 66-73.
[32] A. Goyal, S. S. Lavenburg, and K. S. Trivedi, "Probabilistic modeling of computer system availability,"Annals Operations Res., vol. 8, pp. 285-306, Mar. 1987.
[33] K. S. Trivedi, J. K. Muppala, S. P. Woolet, and B. R. Haverkort, "Composite performance and dependability analysis,"Performance Evaluation, vol. 14, pp. 197-215, Feb. 1992.

Index Terms:
FINE; fault injection and monitoring environment; UNIX system behavior; hardware-induced software errors; software faults; fault injector; software monitor; workload generator; analysis utilities; SunOS 4.1.2; transient Markov reward analysis; bus faults; CPU faults; pointer faults; program testing; software tools; system monitoring; Unix
Citation:
W.-I. Kao, R.K. Iyer, D. Tang, "FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior Under Faults," IEEE Transactions on Software Engineering, vol. 19, no. 11, pp. 1105-1118, Nov. 1993, doi:10.1109/32.256857
Usage of this product signifies your acceptance of the Terms of Use.