This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
The Effects of an ARMOR-Based SIFT Environment on the Performance and Dependability of User Applications
April 2004 (vol. 30 no. 4)
pp. 257-277
Few distributed software-implemented fault tolerance (SIFT) environments have been experimentally evaluated using substantial applications to show that they protect both themselves and the applications from errors. This paper presents an experimental evaluation of a SIFT environment used to oversee spaceborne applications as part of the Remote Exploration and Experimentation (REE) program at the Jet Propulsion Laboratory. The SIFT environment is built around a set of self-checking ARMOR processes running on different machines that provide error detection and recovery services to themselves and to the REE applications. An evaluation methodology is presented in which over 28,000 errors were injected into both the SIFT processes and two representative REE applications. The experiments were split into three groups of error injections, with each group successively stressing the SIFT error detection and recovery more than the previous group. The results show that the SIFT environment added negligible overhead to the application's execution time during failure-free runs. Correlated failures affecting a SIFT process and application process are possible, but the division of detection and recovery responsibilities in the SIFT environment allows it to recover from these multiple failure scenarios. Only 28 cases were observed in which either the application failed to start or the SIFT environment failed to recognize that the application had completed. Further investigations showed that assertions within the SIFT processes—coupled with object-based incremental checkpointing—were effective in preventing system failures by protecting dynamic data within the SIFT processes.

[1] J. Arlat, M. Aguera, Y. Crouzet, J.C. Fabre, E. Martins, and D. Powell, Experimental Evaluation of the Fault Tolerance of an Atomic Multicast System IEEE Trans. Reliability, vol. 39, no. 4, pp. 455-467, Oct. 1990.
[2] S. Bagchi, Hierarchical Error Detection in a Software-Implemented Fault Tolerance (SIFT) Environment PhD thesis, Univ. of Illinois, Urbana, 2001.
[3] R. Batchus, J.P. Neelamegam, Z. Cui, M. Beddhu, A. Skjellum, Y. Dandass, and M. Apte, MPI/FT: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing Proc. First Int'l Symp. Cluster Computing and the Grid, pp. 26-33, 2001.
[4] J. Beahan et al., Detailed Radiation Fault Modeling of the Remove Exploration and Experimentation (REE) First Generation Testbed Architecture Proc. IEEE Aerospace Conf., vol. 5, pp. 279-281, 2000.
[5] K. Birman and R. van Renesse, Reliable Distributed Computing with the Isis Toolkit. IEEE CS Press, 1994.
[6] L. Buzato and A. Calsavara, Stabilis: A Case Study in Writing Fault-Tolerant Distributed Applications Using Persistent Objects Technical Report 400, Univ. of Newcastle upon Tyne, U.K., 1992.
[7] F. Chen, L. Craymer, J. Deifik, A.J. Fogel, D.S. Katz, A.G. Silliman Jr., R.R. Some, S.A. Upchurch, and K. Whisnant, Demonstration of the Remote Exploration and Experimentation (REE) Fault-Tolerant Parallel-Processing Supercomputer for Spacecraft Onboard Scientific Data Processing Proc. Int'l Conf. Dependable Systems and Networks, pp. 367-372, 2000.
[8] P. Chevocot and I. Puaut, Experimental Evaluation of the Fail-Silent Behavior of a Distributed Real-Time Run-Time Support Build from COTS Components Proc. Int'l Conf. Dependable Systems and Networks, pp. 304-313, 2001.
[9] M. Cukier et al., AQuA: An Adaptive Architecture that Provides Dependable Distributed Objects Proc. IEEE Symp. Reliable Distributed Systems, pp. 245-253, Oct. 1998.
[10] D.D. Deavours, G. Clark, T. Courtney, D. Daly, S. Derisavi, J.M. Doyle, W.H. Sanders, and P.G. Webster, The Möbius Framework and Its Implementation IEEE Trans. Software Eng., vol. 28, no. 10, pp. 956-969, Oct. 2002.
[11] J.-C. Fabre and T. Perennou, “A Metaobject Architecture for Fault-Tolerant Distributed Systems: The FRIENDS Approach,” IEEE Trans. Computers, vol. 47, no. 1, pp. 78-95, Jan. 1998.
[12] G. Fagg and J. Dongarra, FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World Lecture Notes in Computer Science, vol. 1908, pp. 346-353, 2000.
[13] E. Fuchs, Validating the Fail-Silence Assumption of the MARS Architecture Proc. Sixth Dependable Computing for Critical Applications Conf., pp. 225-247, 1998.
[14] J.A. Gunnels et al., "Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice," Proc. Int'l Conf. Dependable Systems and Networks, IEEE Press, July 2001, pp. 47-56.
[15] M. Hayden, The Ensemble System PhD thesis, Cornell Univ., Ithaca, N.Y., 1988.
[16] J. He, M. Rajagopalan, M.A. Hiltunen, and R.D. Schlichting, Providing QoS Customization in Distributed Object Systems Proc. IFIP/ACM Int'l Conf. Distributed Systems Platforms, pp. 351-372, 2001.
[17] K. Huang and J. Abraham, Algorithm-Based Fault Tolerance for Matrix Operations IEEE Trans. Computers, vol. 33, no. 6, pp. 518-528, June 1984.
[18] M. Li, D. Goldberg, W. Tao, and Y. Tamir, Fault-Tolerant Cluster Management for Reliable High-Performance Computing Proc. 13th Conf. Parallel and Distributed Computing and Systems, pp. 480-485, 2001.
[19] Z.T. Kalbarczyk, S. Bagchi, K. Whisnant, and R.K. Iyer, “Chameleon: A Software Infrastructure for Adaptive Fault Tolerance,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 6, pp. 560–579, June 1999.
[20] J. Karlsson, J. Arlat, and G. Leber, Application of Three Physical Fault Injection Techniques to the Experimental Assessment of the MARS Architecture Proc. Fifth Dependable Computing for Critical Applications Conf., pp. 150-161, 1995.
[21] S.E. Kerns, B.D. Shafer, L.R. Rockett, Jr., J.S. Pridmore, D.F. Berndt, N. van Vonno, and F.E. Barber, The Design of Radiation-Hardened ICs for Space: A Compendium of Approaches Proc. IEEE, vol. 76, no. 11, pp. 1470-1509, Nov. 1988.
[22] H. Madeira et al., "Experimental Evaluation of a COTS System for Space Applications," Proc. Int'l Conf. Dependable Systems and Networks, IEEE CS Press, 2002, pp. 325-330.
[23] Message Passing Interface Forum, MPI-2: Extensions to the Message Passing Interface http://www.mpi-forum.org/docsmpi-20.ps, 1997.
[24] L.E. Moser, P.M. Melliar-Smith, D.A. Agarwal, R.K. Budhia, and C.A. Lingley-Papadopoulos, Totem: A Fault-Tolerant Multicast Group Communication System Comm. ACM, vol. 39, pp. 54-63, Apr. 1996.
[25] L. Moser, P. Melliar-Smith, and P. Narasimhan, A Fault Tolerance Framework for CORBA Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 150-157, June 1999.
[26] P. Narasimhan, L.E. Moser, and P.M. Melliar-Smith, “State Synchronization and Recovery for Strongly Consistent Replicated CORBA Objects,” Proc. 2001 Int'l Conf. Dependable Systems and Networks, pp. 261-270, 2001.
[27] D. Powell, G. Bonn, D. Seaton, P. Verissimo, and F. Waeselynck, The Delta-4 Approach to Dependability in Open Distributed Computing Systems Proc. 18th IEEE Int'l Symp. Fault-Tolerant Computing (FTCS-18), pp. 246-251, June 1988.
[28] D. Powell et al., “GUARDS: A Generic Upgradable Architecture for Real-Time Dependable Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 6, pp. 580-599, June 1999.
[29] J. Ren, AQuA: A Framework for Providing Adaptive Fault Tolerance to Distributed Applications PhD thesis, Univ. of Illinois, Urbana, 2001.
[30] S. Shrivastava, Lessons Learned from Building and Using the Arjuna Distributed Programming System Lecture Notes in Computer Science, vol. 938, 1995.
[31] G. Stellner, CoCheck: Checkpointing and Process Migration for MPI Proc. 10th Int'l Parallel Processing Symp., pp. 526-531, 1996.
[32] D.T. Stott, B. Floering, Z. Kalbarczyk, and R.K. Iyer, Dependability Assessment in Distributed Systems with Lightweight Fault Injectors in NFTAPE Proc. Fourth Int'l Computer Performance and Dependability Symp., pp. 91-100, 2000.
[33] R. van Renesse, K. Birman, and S. Maffeis, Horus: A Flexible Group Communication System Comm. ACM, vol. 39, pp. 76-83, Apr. 1996.
[34] J. Wensley, SIFT: Software Implemented Fault Tolerance Proc. Conf. Artificial Intelligence PlanningSystems, vol. 41, pp. 243-253, 1971.
[35] K. Whisnant, Z. Kalbarczyk, and R. Iyer, Micro-Checkpointing: Checkpointing for Multithreaded Applications Proc. Sixth Int'l On-Line Testing Workshop, July 2000.
[36] K. Whisnant, R.K. Iyer, P. Jones, R. Some, and D. Rennels, An Experimental Evaluation of the REE SIFT Environment for Spaceborne Applications Proc. Int'l Conf. Dependable Systems and Networks, pp. 585-594, 2002.
[37] K. Whisnant, Z. Kalbarczyk, and R.K. Iyer, A System Model for Dynamically Reconfigurable Software IBM Systems J., vol. 42, no. 1, pp. 45-59, Apr. 2003.
[38] K. Whisnant, A Process Architecture and Runtime Environment for Dependable Distributed Applications PhD thesis, Univ. of Illinois, Urbana, 2003.

Index Terms:
Software-implemented fault tolerance, distributed systems, high availability.
Citation:
Keith Whisnant, Ravishankar K. Iyer, Zbigniew T. Kalbarczyk, Phillip H. Jones III, David A. Rennels, Raphael Some, "The Effects of an ARMOR-Based SIFT Environment on the Performance and Dependability of User Applications," IEEE Transactions on Software Engineering, vol. 30, no. 4, pp. 257-277, April 2004, doi:10.1109/TSE.2004.1274045
Usage of this product signifies your acceptance of the Terms of Use.