This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Experimental Evaluation of Behavior-Based Failure-Detection Schemes in Real-Time Communication Networks
June 1999 (vol. 10 no. 6)
pp. 613-626

Abstract—Effective detection of failures is essential for reliable communication services. Traditionally, non-real-time computer networks have relied on behavior-based techniques for detecting communication failures. That is, each node uses heartbeats to detect the failure of its neighbors and the end-to-end transport protocol (e.g., TCP) achieves reliable communication by acknowledgment/retransmission. Recently, there has been a growing demand for reliable “real-time” communication, but little research has been done on the failure detection problem. In this paper, we present two behavior-based failure-detection schemes—neighbor detection and end-to-end detection—for reliable real-time communication services and experimentally evaluate their effectiveness. Specifically, we measure and analyze the coverage and latency of these detection schemes through fault-injection experiments. The experimental results have shown that nearly all failures can be detected very quickly by the neighbor detection scheme, while the end-to-end detection scheme uncovers the remaining failures with larger detection latencies.

[1] P. Ramanathan and K.G. Shin, “Delivery of Time-Critical Messages Using a Multiple Copy Approach,” ACM Trans. Computer Systems, vol. 10, no. 2, pp. 144–166, May 1992.
[2] A. Banerjea, “Simulation Study of the Capacity Effects of Dispersity Routing for Fault Tolerant Realtime Channels,” Proc. ACM SIGCOMM, pp. 194–205, 1996.
[3] K.G. Shin and H. Kim, “Derivation and Application of Hard Deadlines for Real-Time Control Systems,” IEEE Trans. Systems, Man, and Cybernetics, vol. 22, no. 6, pp. 1,403–1,413, Nov. 1992.
[4] D. Siewiorek and R. Swarz, Reliable Computer Systems: Design and Evaluation. Digital Press, 1992.
[5] D.E. Comer, Internetworking with TCP/IP. Prentice-Hall, 1991.
[6] M. Hsueh, T. Tsai, and R. Iyer, “Fault Injection Techniques and Tools,” Computer, pp. 75–82, Apr. 1997.
[7] J. Arlat, Y. Crouzet, and J.-C. Laprie, “Fault Injection for Dependability Validation of Fault-Tolerant Computing Systems,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 348–355, 1989.
[8] U. Gunneflo, J. Karlsson, and J. Torin, "Evaluation of Error Detection Schemes Using Fault Injection by Heavy-Ion Radiation," Proc. Int'l Symp. Fault-Tolerant Computing, pp. 340-347, 1989.
[9] H. Madeira and J. Silva, “Experimental Evaluation of the Fail-Silent Behavior in Computers without Error Masking,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 350–359, 1994.
[10] Z. Segall et al., “FIAT—Fault Injection Based Automated Testing Environment,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 102–107, 1988.
[11] R. Chillarege and N.S. Bowen, “Understanding Large System Failures—A Fault Injection Experiment,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 356–363, June 1989.
[12] G. Kanawati, N. Kanawati, and J. Abraham, “FERRARI: A Tool for the Validation of System Dependability Properties,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 336–344, 1992.
[13] K. Echtle and M. Leu, “The EFA Fault Injector for Fault-Tolerant Distributed System Testing,” Proc. Workshop on Fault-Tolerant Parallel and Distributed Systems, pp. 28–35, 1992.
[14] W. Kao, R. Iyer, and D. Tang, "FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior Under Faults," IEEE Trans. Software Eng., vol. 19, no. 11, pp. 1,105-1,118, Nov. 1993.
[15] S. Han, K.G. Shin, and H. Rosenberg, “DOCTOR: An IntegrateD SOftware Fault InjeCTiOn EnviRonment for Distributed Real-Time Systems,” Proc. IEEE Int'l Computer Performance and Dependability Symp., pp. 204–213, 1995.
[16] L. Young, C. Alonso, R. Iyer, and K. Goswami, “A Hybrid Monitor Assisted Fault Injection Environment,” Dependable Computing for Critical Applications, vol. 3, pp. 281–302, 1993.
[17] A. Indiresan, “Exploring Quality-of-Service Issues in Network Interface Design,” PhD thesis, Univ. of Michigan, 1997.
[18] Fibre Channel Physical and Signalling Interface (FC-PH), Am. Nat'l Standards Inst., rev. 3.0 ed., working draft, June 1992.
[19] L.M. Thompson, “Using pSOS+for Embedded Real-Time Computing,” Proc. COMPCON, pp. 282–288, 1990.
[20] L.L. Peterson, N.C. Hutchinson, S.W. O`Malley, and H.C. Rao, “Thex-Kernel: A Platform for Accessing Internet Resources,” Computer, vol. 23, no. 5, pp. 23–33, May 1990.
[21] D.D. Kandlur, K.G. Shin, and D. Ferrari, “Real-Time Communication in Multi-Hop Networks,” IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 10, pp. 1,044-1,056, Oct. 1994.
[22] A. Mehra, A. Indiresan, and K.G. Shin, “Resource Management for Real-Time Communication: Making Theory Meet Practice,” Proc. IEEE Real-Time Technology and Applications Symp., pp. 130–138, 1996.
[23] D. Stott, G. Ries, M. Hsueh, and R. Iyer, “Dependability Analysis of a High-Speed Network Using Software-Implemented Fault Injection and Simulated Fault Injection,” IEEE Trans. Computers, vol. 47, no. 1, pp. 108–119, Jan. 1998.
[24] H. Kopetz and G. Grunsteidl, “TTP—A Time-Triggered Protocol for Fault-Tolerant Real-Time Systems,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 524–533, 1993.
[25] A. Banerjea, C. Parris, and D. Ferrari, “Recovering Guaranteed Performance Service Connections from Single and Multiple Faults,” Technical Report TR-93-066, Univ. of California, Berkeley, 1993.
[26] R. Kawamura, K. Sato, and I. Tokizawa, “Self-Healing ATM Networks Based on Virtual Path Concept,” IEEE J. Selected Areas in Comm., vol. 12, no. 1, pp. 120–127, Jan. 1994.
[27] M. Rela, H. Madeira, and J. Silva, “Experimental Evaluation of the Fail-Silent Behavior in Programs with Consistency Checks,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 394–403, 1996.
[28] J. Silva, J. Carreira, H. Madeira, D. Costa, and F. Moreira, “Experimental Assessment of Parallel Systems,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 415–424, 1996.
[29] S. Chandra and P. Chen, “How Fail-Stop Are Faulty Programs?,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 240–249, 1998.

Index Terms:
Real-time communication, network failures, failure detection, fault-injection experiments.
Citation:
Seungjae Han, Kang G. Shin, "Experimental Evaluation of Behavior-Based Failure-Detection Schemes in Real-Time Communication Networks," IEEE Transactions on Parallel and Distributed Systems, vol. 10, no. 6, pp. 613-626, June 1999, doi:10.1109/71.774910
Usage of this product signifies your acceptance of the Terms of Use.