This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Dependability Analysis of a High-Speed Network Using Software-Implemented Fault Injection and Simulated Fault Injection
January 1998 (vol. 47 no. 1)
pp. 108-119

Abstract—This paper presents a dependability study of high-speed, switched Local Area Networks (LANs) using Myrinet as an example testbed (with theoretical speeds of 2.56 Gbps). The study uses results of two fault injection methods, simulated fault injection and software-implemented fault injection (SWIFI), to analyze the application-level impact of transient faults injected into the network interface hardware. These results include a number of errors, such as dropped or corrupt messages, host interface or host resets, and local or remote host interface hangs. The paper presents the study in two parts: First, the results from the SWIFI method in the real system are used as a basis to validate the simulation and identify the major factors leading to differences between the methods. A comparison between the two injection methods shows that they agree for 83 percent of the fault injections. The results, however, vary greatly, depending on the fault type considered. The study also presents an analysis of the effects of varying workload intensity, host platform, and interface function targeted by the injection. An example of this analysis is to show that the function targeted has a significant impact on the fault activation rate. Finally, the study identifies two mechanisms by which faults may propagate from the interface to other parts of the network; in one example, this propagation caused the interface's host computer to reboot, while another caused a remote interface in the network to hang.

[1] N. Boden et al., "Myrinet: A Gigabit-per-Second Local Area Network," IEEE Micro, Feb. 1995, pp. 29-36.
[2] D. Tang and R.K. Iyer, "Experimental Analysis of Computer System Dependability," in Fault-Tolerant Computer System Design, D.K. Pradhan, ed., Prentice-Hall Prof. Tech. Ref., Upper Saddle River, N.J., pp. 282-392.
[3] J. Abraham, "Challenges in Fault Detection," Proc. 25th Int'l Symp. Fault-Tolerant Computing (FTCS-25), pp. 96-114, June 1995.
[4] W. Kao and R. Iyer, "DEFINE: A Distributed Fault Injection and Monitoring Environment," Proc. IEEE Workshop Fault-Tolerant Parallel and Distributed System, June 1994.
[5] K. Goswami and R. Iyer, "Simulation of Software Behavior under Hardware Faults," Proc. 23rd Int'l Symp. Fault-Tolerant Computing (FTCS-23), pp. 218-277, June 1993.
[6] G. Ries and R. Iyer, "Evaluating the Impact of Transient Faults on Software Behavior: Case Study of a Commercial High-Speed Network," Proc. Sixth IFIP Int'l Working Conf. Dependable Computers for Critical Applications (DCCA-6), Mar. 1997.
[7] A. Dupuy et al., "NEST: A Network Simulation and Prototyping Testbed," Comm. ACM, Oct. 1990, pp. 64-74.
[8] J.H. Barton, E.W. Czeck, Z.Z. Segall, and D.P. Siewiorek, Fault Injection Experiments Using FIAT IEEE Trans. Computers, vol. 39, no. 4, pp. 575-582, Apr. 1990.
[9] G. Kanawati, N. Kanawati, and J. Abraham, “FERRARI: A Tool for the Validation of System Dependability Properties,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 336–344, 1992.
[10] M. Rela, H. Madeira, and J. Silva, “Experimental Evaluation of the Fail-Silent Behavior in Programs with Consistency Checks,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 394–403, 1996.
[11] S. Dawson, F. Jahanian, T. Mitton, and T.-L. Tung, "Testing of Fault-Tolerant and Real-Time Distributed Systems via Protocol Fault Injection," Proc. 26th Int'l Symp. Fault-Tolerant Computing (FTCS-26), pp. 404-414, June 1996.
[12] J. Carreira, H. Madeira, and J. Silva, "Assessing the Effect of Communication Faults on Parallel Applications," Proc. IEEE Int'l Computer Performance and Dependability Symp. (IPDS '95), pp. 214-223, Mar. 1995.
[13] E. Fuchs, "Validating the Fail-Silence of the MARS Architecture," Proc. Sixth IFIP Int'l Working Conf. Dependable Computing for Critical Applications (DCCA-6), Mar. 1997.
[14] J. Clark and D. Pradhan, "REACT: A Synthesis and Evaluation Tool for Fault-Tolerant Multiprocessor Architectures," Proc. Ann. Reliability and Maintainability Symp., pp. 428-435, 1993.
[15] E. Jenn et al., “Fault Injection into VHDL Models: The MEFISTO tool,” Proc. 24th Fault-Tolerant Computer Systems Symp., pp. 66-75, 1994.
[16] A. Ghosh and B. Johnson, “System-Level Modeling in the ADEPT Environment of a Distributed Computer System for Real-Time Applications,” Proc. IEEE Int'l Computer Performance and Dependability Symp., pp. 194-203, 1995.
[17] K. Goswami, R.K. Iyer, and L. Young, “DEPEND: A Simulation Based Environment for System Level Dependability Analysis,” IEEE Trans. Computers, vol. 46, no. 1, pp. 60-74, Jan. 1997.
[18] T.K. Tsai and R.K. Iyer, "An Approach to Benchmarking of Fault-Tolerant Commercial Systems," Proc. 26th Ann. Int'l Symp. Fault-Tolerant Computing, IEEE CS Press, Los Alamitos, Calif., 1996, pp. 314-323.
[19] S. Han, K.G. Shin, and H. Rosenberg, “DOCTOR: An IntegrateD SOftware Fault InjeCTiOn EnviRonment for Distributed Real-Time Systems,” Proc. IEEE Int'l Computer Performance and Dependability Symp., pp. 204–213, 1995.
[20] X. Castillo and D. Siewiorek, "Workload, Performance and Reliability of Digital Computing Systems," Proc. 11th Ann. IEEE Int'l Symp. Fault-Tolerant Computing (FTCS-11), pp. 84-89, July 1981.
[21] R. Iyer, D. Rossetti, and M. Hsueh, “Measurement and Modeling of Computing Reliability as Affected by System Activity,” ACM Trans. Computer Systems, vol. 4, pp. 214-237, Aug. 1986.

Index Terms:
Dependability, fault simulation, Myrinet, SWIFI, fault effect, embedded system.
Citation:
David T. Stott, Greg Ries, Mei-Chen Hsueh, Ravishankar K. Iyer, "Dependability Analysis of a High-Speed Network Using Software-Implemented Fault Injection and Simulated Fault Injection," IEEE Transactions on Computers, vol. 47, no. 1, pp. 108-119, Jan. 1998, doi:10.1109/12.656094
Usage of this product signifies your acceptance of the Terms of Use.