This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Hierarchical Error Detection in a Software Implemented Fault Tolerance (SIFT) Environment
March/April 2000 (vol. 12 no. 2)
pp. 203-224

Abstract—This paper proposes a hierarchical error detection framework for a Software Implemented Fault Tolerance (SIFT) layer of a distributed system. A four-level error detection hierarchy is proposed in the context of Chameleon, a software environment for providing adaptive fault-tolerance in an environment of commercial off-the-shelf (COTS) system components and software. The design and implementation of a software-based distributed signature monitoring scheme, which is central to the proposed four-level hierarchy, is described. Both intralevel and interlevel optimizations that minimize the overhead of detection and are capable of adapting to runtime requirements are proposed. The paper presents results from a prototype implementation of two levels of the error detection hierarchy and results of a detailed simulation of the overall environment. The results indicate a substantial increase in availability due to the detection framework and help in understanding the trade-offs between overhead and coverage for different combinations of techniques.

[1] B. Ayeb and A. Farhat, “The Byzantine Generals Problem: Identifying the Traitors,” Fast-Abs, Proc. Int'l Symp. Fault-Tolerant Computing 28, 1998.
[2] S. Bagchi, K. Whisnant, Z. Kalbarczyk, and R.K. Iyer, “Chameleon: A Software Infrastructure for Adaptive Fault Tolerance,” Proc. 17th IEEE Symp. Reliable Distributed Systems, pp. 261-267, Oct. 1998.
[3] K. Birman, "The Process Group Approach to Reliable Distributed Computing," Comm. ACM, vol. 36, no. 12, pp. 37-53, 1993.
[4] K.P. Birman and R. Van Renesse, Reliable Distributed Computing with the Isis Toolkit. IEEE CS Press, 1994.
[5] M. Cukier et al., AQuA: An Adaptive Architecture that Provides Dependable Distributed Objects Proc. IEEE Symp. Reliable Distributed Systems, pp. 245-253, Oct. 1998.
[6] D. Dolev and D. Malki, “The Transis Approach to High Availability Cluster Communication,” Comm. ACM, vol. 39, no. 4, pp. 64–70, 1996.
[7] J.B. Eifert and J.P. Shen, “Processor Monitoring Using Asynchronous Signatured Instruction Streams,” Proc. Int'l Symp. Fault-Tolerant Computing 14, pp. 394-399, 1984.
[8] K. Goswami, R.K. Iyer, and L. Young, “DEPEND: A Simulation Based Environment for System Level Dependability Analysis,” IEEE Trans. Computers, vol. 46, no. 1, pp. 60-74, Jan. 1997.
[9] Y. Huang and C. Kintala, “Software Implemented Fault Tolerance: Technologies and Experience,” Proc. Int'l Symp. Fault-Tolerant Computing 23, pp. 2-9, 1993.
[10] M. Kalyanakrishnan, “Analysis of Failures in Windows NT Systems,” master's thesis (R. Iyer, advisor), Center for Reliable and High-Performance Computing, Univ. of Illinois, UILU-ENG-98-2217, July 1998.
[11] Z.T. Kalbarczyk, S. Bagchi, K. Whisnant, and R.K. Iyer, “Chameleon: A Software Infrastructure for Adaptive Fault Tolerance,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 6, pp. 560–579, June 1999.
[12] L. Lamport, R. Shostak, and M. Pease, "The Byzantine Generals Problem," ACM Trans. Programming Languages and Systems, vol. 4, no. 3, July 1982, pp. 382-401.
[13] S. Maffeis, “Piranha: A CORBA Tool for High Availability,” Computer, vol. 30, no. 4, pp. 59-66, 1997.
[14] A. Mahmood and E. McCluskey, "Concurrent Error Detection Using Watchdog Processors—A Survey," IEEE Trans. Computers, vol. 37, no. 2, pp. 160-174, Feb. 1988.
[15] T. Michel, R. Leveugle, and G. Saucier, “A New Approach to Control Flow Checking without Program Modification,” Proc. Int'l Symp. Fault-Tolerant Computing 21, pp. 334-341, 1991.
[16] L.E. Moser, P.M. Melliar-Smith, D.A. Agarwal, R.K. Budhia, and C.A. Lingley-Papadopoulos, “Totem: A Fault-Tolerant Multicast Group Communication System,” Comm. ACM, vol. 39, no. 4, pp. 54–63, 1996.
[17] J. Ohlsson and M. Rimen, “Implicit Signature Checking,” Proc. Int'l Symp. Fault-Tolerant Computing 25, pp. 218-227, 1995.
[18] Object Management Group, The Common Object Request Broker: Architecture and Specification (CORBA), rev. 2, 1995.
[19] D. Powell, “Lessons Learned from Delta-4,” IEEE Micro, vol. 14, no. 4, pp. 36-47, 1994.
[20] J. Reisinger and A. Steininger, “The Design of a Fail-Silent Processing Node for MARS,” Distributed Systems Eng. J., 1994.
[21] R. van Renesse, K.P. Birman, and S. Maffeis, “Horus: A Flexible Group Communication System,” Comm. ACM, vol. 39, no. 4, pp. 76–83, 1996.
[22] K.G. Shin and P. Ramanathan, “Diagnosis of Processors with Byzantine Faults in a Distributed Computing System,” Proc. Int'l Symp. Fault-Tolerant Computing 17, pp. 55-60, 1987.
[23] P.M. Thambidurai and Y.K. Park,"Interactive Consistency with Multiple Failure Modes," Proc. seventh Reliable Dist. Systems Symp., Oct. 1988.
[24] A. Thakur, “Measurement and Analysis of Failures in Computer Systems,” master's thesis (R. Iyer, advisor), Univ. of Illinois, UILU-ENG-97, Sept. 1997.
[25] C.J. Walter, “Identifying the Cause of Detected Errors,” Proc. Int'l Symp. Fault-Tolerant Computing 20, pp. 48-55, 1990.
[26] J.H. Wensley, “SIFT Software Implemented Fault Tolerance,” Proc. Fall Joint Computer Conf., pp. 243-253, 1972.
[27] K. Whisnant, S. Bagchi, B. Srinivasan, Z. Kalbarczyk, and R.K. Iyer, “Incorporating Reconfigurability, Error Detection and Recovery into the Chameleon ARMOR Architecture,” Technical Report CRHC 98-13, Univ. of Illinois at Urbana-Champaign, Dec. 1998.
[28] K. Wilken and J. Shen, "Continuous Signature Monitoring: Low-Cost Concurrent-Detection of Processor Control Errors," IEEE Trans. Computer-Aided Design, vol. 9, no. 3, pp. 629-641, June 1990.

Index Terms:
Software implemented fault tolerance, hierarchical error detection, distributed systems, data and control signatures, speculative execution.
Citation:
Saurabh Bagchi, Balaji Srinivasan, Keith Whisnant, Zbigniew Kalbarczyk, Ravishankar K. Iyer, "Hierarchical Error Detection in a Software Implemented Fault Tolerance (SIFT) Environment," IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 2, pp. 203-224, March-April 2000, doi:10.1109/69.842263
Usage of this product signifies your acceptance of the Terms of Use.