This Article 
 Bibliographic References 
 Add to: 
Chameleon: A Software Infrastructure for Adaptive Fault Tolerance
June 1999 (vol. 10 no. 6)
pp. 560-579

Abstract—This paper presents Chameleon, an adaptive infrastructure, which allows different levels of availability requirements to be simultaneously supported in a networked environment. Chameleon provides dependability through the use of special ARMORs—Adaptive, Reconfigurable, and Mobile Objects for Reliability—that control all operations in the Chameleon environment. Three broad classes of ARMORs are defined: 1) Managers oversee other ARMORs and recover from failures in their subordinates. 2) Daemons provide communication gateways to the ARMORs at the host node. They also make available a host's resources to the Chameleon environment. 3) Common ARMORs implement specific techniques for providing application-required dependability. Employing ARMORs, Chameleon makes available different fault-tolerant configurations and maintains run-time adaptation to changes in the availability requirements of an application. Flexible ARMOR architecture allows their composition to be reconfigured at run-time, i.e., the ARMORs may dynamically adapt to changing application requirements. In this paper, we describe ARMOR architecture, including ARMOR class hierarchy, basic building blocks, ARMOR composition, and use of ARMOR factories. We present how ARMORs can be reconfigured and reengineered and demonstrate how the architecture serves our objective of providing an adaptive software infrastructure. To our knowledge, Chameleon is one of the few real implementations which enables multiple fault tolerance strategies to exist in the same environment and supports fault-tolerant execution of substantially off-the-shelf applications via a software infrastructure only. Chameleon provides fault tolerance from the application's point of view as well as from the software infrastructure's point of view. To demonstrate the Chameleon capabilities, we have implemented a prototype infrastructure which provides set of ARMORs to initialize the environment and to support the dual and TMR application execution modes. Through this testbed environment, we measure the execution overhead and recovery times from failures in the user application, the Chameleon ARMORs, the hardware, and the operating system.

[1] G. Agha and D.C. Sturman, “A Methodology for Adapting to Patterns of Faults,” G. Koob, ed., Foundation of Ultradependebility, vol. 1.Kluwer Academic, 1994.
[2] Y. Amir et al., Transis:“A Communication Subsystem for High Availability,” Proc. Int’l Symp. Fault‐Tolerant Computing, IEEE CS Press, Los Alamitos, Calif., 1992, pp. 76‐84.
[3] P.A. Barrett et al., “The Delta-4 Extra Performance Architecture,” Proc. 20th Int'l Symp. Fault-Tolerant Computing (FTCS-20), pp. 481-488, 1990.
[4] K.P. Birman, Building Secure and Reliable Network Applications. Greenwich, Conn.: Manning Publications, 1996.
[5] K. Birman, "The Process Group Approach to Reliable Distributed Computing," Comm. ACM, vol. 36, no. 12, pp. 37-53, 1993.
[6] K.P. Birman and R. Van Renesse, Reliable Distributed Computing with the Isis Toolkit. IEEE CS Press, 1994.
[7] F. Cristian and S. Mishra, "Automatic Service Availability Management in Asynchronous Systems," Proc. Second Int'l Workshop Configurable Distributed Systems, pp. 58-68, IEEE CS Press, Pittsburgh, Mar. 1994.
[8] F. Cristian, “Automatic Reconfiguration in the Presence of Failures,” Software Eng. J., IEE, pp. 53-60, Mar. 1993.
[9] F. Cristian, "Understanding Fault-Tolerant Distributed Systems," Comm. ACM, vol. 34, no. 2, Feb. 1991.
[10] F. Cristian, B. Dancey, and J. Dehn, “Fault Tolerance in the Advanced Automation System,” Proc. 20th IEEE Int'l Symp. Fault-Tolerant Computing, p. 617, Newcastle, U.K., 1990.
[11] M. Cukier et al., AQuA: An Adaptive Architecture that Provides Dependable Distributed Objects Proc. IEEE Symp. Reliable Distributed Systems, pp. 245-253, Oct. 1998.
[12] D. Dolev and D. Malki, “The Transis Approach to High Availability Cluster Communication,” Comm. ACM, vol. 39, no. 4, pp. 64–70, 1996.
[13] J.-C. Fabre and T. Perennou, “A Metaobject Architecture for Fault-Tolerant Distributed Systems: The FRIENDS Approach,” IEEE Trans. Computers, vol. 47, no. 1, pp. 78-95, Jan. 1998.
[14] J. Gray, “Why Do Computers Stop And What Can We Do About It?” Proc. Fifth Symp. Reliability in Distributed Software and Database Systems, pp. 3-12, 1985.
[15] J.P. Hansen and D.P. Siewiorek, “Models for Time Coalescence in Event Logs,” Proc. 22nd Int'l Symp. Fault-Tolerant Computing (FTCS-22), pp. 221-227, 1992.
[16] R.W. Horst, "TNet: A Reliable System Area Network," IEEE Micro, Feb. 1994, pp. 37-45.
[17] Y. Huang and C. Kintala, “Software Implemented Fault Tolerance: Technologies and Experience,” Proc. 23rd Int'l Symp. Fault-Tolerant Computing (FTCS-23), pp. 2-9, 1993.
[18] K.H. Kim, “ROAFTS: A Middleware Architecture for Real-time Object-oriented Adaptive Fault Tolerance Support,” Proc. High-Assurance Systems Eng. Symp., pp. 50-57, Washington D.C., 1998.
[19] H. Kopetz, A. Damm, C. Koza, M. Mulazzani, W. Schwabi, C. Senft, and R. Zainlinger, "Distributed Fault-Tolerant Real-Time Systems: The MARS Approach," IEEE Micro, pp. 25-58, Feb. 1989.
[20] I. Lee, “Software Dependability in the Operational Phase,” PhD thesis, Univ. of Illinois at Urbana-Champaign, 1994.
[21] M. Little and S. Shrivastava, “Using Application Specific Knowledge for Configuring Object Replicas,” Proc. Third Int'l Conf. Configurable Distributed Systems, May 1996.
[22] D. McCue and M. Little, “Computing Replica Placement in Distributed Systems,” Proc. Second IEEE Workshop Replicated Data, Nov. 1992.
[23] S. Maffeis, “Piranha: A CORBA Tool for High Availability,” Computer, vol. 30, no. 4, pp. 59-66, 1997.
[24] L.E. Moser, P.M. Melliar-Smith, D.A. Agarwal, R.K. Budhia, and C.A. Lingley-Papadopoulos, “Totem: A Fault-Tolerant Multicast Group Communication System,” Comm. ACM, vol. 39, no. 4, pp. 54–63, 1996.
[25] Object Management Group, The Common Object Request Broker: Architecture and Specification (CORBA), Revision 2.0. Inc. Publications, 1995.
[26] D. Powell, Distributed Fault-Tolerance Lessons from Delta-4 IEEE Micro, vol. 14, no. 1, pp. 36-47, Feb. 1994.
[27] Delta-4: A Generic Architecture for Dependable Distributed Computing, vol. 1,D. Powell, ed., ESPRIT Research Reports. Springer-Verlag, 1991.
[28] M.K. Reiter, “Distributing Trust with the Rampart Toolkit,” Comm. ACM, vol. 39, no. 4, pp. 71–74, Apr. 1996.
[29] R. van Renesse, K.P. Birman, and S. Maffeis, “Horus: A Flexible Group Communication System,” Comm. ACM, vol. 39, no. 4, pp. 76–83, 1996.
[30] L. Romano, Z. Kalbarczyk, R. Iyer, A. Mazzeo, and N. Mazzocca, “Behavior of the Computer Based Interlocking System under Transient Hardware Faults,” Proc. Pacific Rim Int'l Symp. Fault Tolerant Systems, pp. 174-179, Taiwan, 1997.
[31] S.K. Shrivastava, G.N. Dixon, and G.D. Parrington, "An Overview of the Arjuna Distributed Programming System," IEEE Software, vol. 8, no. 1, pp. 66-73, 1991.
[32] M. Sullivan and R. Chillarege, "Software Defects and Their Impact on System Availability—A Study of Field Failures in Operating Systems," Proc. Int'l Symp. Fault-Tolerant Computing, pp. 2-9, 1991.
[33] Sun RAS Solutions for Mission-Critical Computing. White Paper,, Oct. 1997.
[34] J.H. Wensley, “SIFT Software Implemented Fault Tolerance,” Proc. Fall Joint Computer Conf., AFIPS, vol. 41, pp. 243-253, 1972.
[35] K. Whisnant, S. Bagchi, B. Srinivasan, Z. Kalbarczyk, and R.K. Iyer, “Incorporating Reconfigurability, Error Detection and Recovery into the Chameleon ARMOR Architecture,” technical report, Univ. of Illinois at Urbana-Champaign, 1998.
[36] “Wolfpack,” Microsoft Clustering Architecture. White Paper,, May 1997.

Index Terms:
Adaptive fault tolerance, high availability networked computing, software-implemented fault tolerance, COTS, extendible modular architecture.
Zbigniew T. Kalbarczyk, Ravishankar K. Iyer, Saurabh Bagchi, Keith Whisnant, "Chameleon: A Software Infrastructure for Adaptive Fault Tolerance," IEEE Transactions on Parallel and Distributed Systems, vol. 10, no. 6, pp. 560-579, June 1999, doi:10.1109/71.774907
Usage of this product signifies your acceptance of the Terms of Use.