This Article 
 Bibliographic References 
 Add to: 
Effective Fault Treatment for Improving the Dependability of COTS and Legacy-Based Applications
October-December 2004 (vol. 1 no. 4)
pp. 223-237
This paper proposes a novel methodology and an architectural framework for handling multiple classes of faults (namely, hardware-induced software errors in the application, process and/or host crashes or hangs, and errors in the persistent system stable storage) in a COTS and Legacy-based application. The basic idea is to use an evidence-accruing fault tolerance manager to choose and carry out one of multiple fault recovery strategies, depending upon the perceived severity of the fault. The methodology and the framework have been applied to a case study system consisting of a Legacy system, which makes use of a COTS DBMS for persistent storage facilities. A thorough performability analysis has also been conducted via combined use of direct measurements and analytical modeling. Experimental results demonstrate that effective fault treatment, consisting of careful diagnosis and damage assessment, plays a key role in leveraging the dependability of COTS and Legacy-based applications.

[1] K.J. Cassidy, K.C. Gross, and A. Malekpour, “Advanced Pattern Recognition for Detection of Complex Software Aging Phenomena in Online Transaction Processing Servers,” Proc. Int'l Conf. Dependable Systems and Networks, 2002.
[2] P. Narasimhan and P.M. Melliar-Smith, “State Synchronization and Recovery for Strongly Consistent Replicated CORBA Objects,” Proc. Int'l Conf. Dependable Systems and Networks, 2001.
[3] C. Sabnis, W.H. Sanders, D.E. Bakken, M.E. Berman, D.A. Karr, and M. Cukier, “AQuA: An Adaptive Architecture that Provides Dependable Distributed Objects,” Proc. IEEE 17th Symp. Reliable Distributed Systems, 1998.
[4] Z.T. Kalbarczyk, R.K. Iyer, S. Bagchi, and K. Whisnant, “Chameleon: A Software Infrastructure for Adaptive Fault Tolerance,” IEEE Trans. Parallel and Distributed Systems, vol. 10, pp. 560-579, 1999.
[5] R. Baldoni, C. Marchetti, M. Mecella, and A. Virgillito, “An Interoperable Replication Logic for CORBA Systems,” Proc. Second Int'l Symp. Distributed Object Applications 2000 (DOA '00), 2000.
[6] B. Natarajan, A. Gokhale, S. Yajnik, and D.C. Schmidt, “DOORS: Towards High-Performance Fault-Tolerant CORBA,” Proc. Int'l Symp. Distributed Objects and Applications (DOA '00), 2000.
[7] D. Cotroneo, N. Mazzocca, L. Romano, and S. Russo, “Building a Dependable System from a Legacy Application with CORBA,” J. Systems Architecture, vol. 48, pp. 81-98, 2002.
[8] J.C. Fabre and T. Perennou, “A Metaobject Architecture for Fault-Tolerant Distributed Systems: The Friends Approach,” IEEE Trans. Computers, vol. 47, pp. 78-95, 1998.
[9] A. Bondavalli, S. Chiaradonna, F. Di Giandomenico, and F. Grandoni, “Threshold-Based Mechanisms to Discriminate Transient from Intermittent Faults,” IEEE Trans. Computers, vol. 49, pp. 230-245, 2000.
[10] D. Powell, G. Bonn, D. Seaton, P. Verissimo, and F. Waeselynck, “The Delta-4 Approach to Dependability in Open Distributed Computing Systems,” Proc. 18th Int'l Symp. Fault-Tolerant Computing Systems (FTCS 18), 1988.
[11] P. Felber, R. Guerraoui, and A. Schiper, “The Implementation of a CORBA Object Group Service,” Proc. Conf. Theory and Practice of Object Systems (TAPOS), vol. 4, no. 2, 1998.
[12] L. Romano, S. Chiaradonna, A. Bondavalli, and D. Cotroneo, “Implementation of Threshold-Based Diagnostic Mechanisms for COTS-Based Applications,” Proc. 21st IEEE Symp. Reliable Distributed Systems (SRDS 2002), 2002.
[13] K.K. Goswami and R.K. Iyer, “Simulation of Software Behavior under Hardware Faults,” Proc. 23rd Ann. Int'l Symp. Fault-Tolerant Computing, 1993.
[14] R.K. Iyer and D. Tang, “Experimental Analysis of Computer System Fault Tolerance,” Fault-Tolerant Computer System Design, D.K. Pradhan, ed., Prentice Hall, 1996.
[15] D. Stott, P.H. Jones, M. Hamman, Z. Kalbarczyk, and R.K. Iyer, “NFTAPE: Networked Fault Tolerance and Performance Evaluator,” Proc. Int'l Conf. Dependable Systems and Networks, 2002.
[16] W.T. Ng and P.M. Chen, “Integrating Reliable Memory in Databases,” Proc. Int'l Conf. Very Large Databases, pp. 76-85, Aug. 1997.
[17] D.E. Bakken, Z. Zhan, C.C. Jones, and D.A. Karr, “Middleware Support for Voting and Data Fusion,” Proc. IEEE Int'l Conf. Dependable Systems and Networks, pp. 453-462, 2001.
[18] DBench Consortium, Measurements, Deliverable ETIE1, IST-2000-25425 Dependability Benchmarking (DBench), 2002.
[19] R.K. Iyer, L.T. Young, and P.V.K. Iyer, “Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data,” IEEE Trans. Computers, vol. 39, pp. 525-537, 1990.
[20] T.T.Y. Lin and D.P. Siewiorek, “Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis,” IEEE Trans. Reliability, vol. 39, pp. 419-432, 1990.
[21] P. Agrawal, “Fault Tolerance in Multiprocessor Systems without Dedicated Redundancy,” IEEE Trans. Computers, vol. 37, pp. 358-362, 1988.
[22] A. Bondavalli, S. Chiaradonna, F. Di Giandomenico, and F. Grandoni, “Discriminating Fault Rate and Persistency to Improve Fault Treatment,” Proc. 27th Ann. Int'l Symp. Fault-Tolerant Computing (FTCS-27), pp. 354-362, 1997.
[23] G. Mongardi, “Dependable Computing for Railway Control Systems,” Proc. Conf. Dependable Computing for Critical Applications, pp. 255-277, 1993.
[24] N.N. Tendolkar and R.L. Swann, “Automated Diagnostic Methodology for the IBM 3081 Processor Complex,” IBM J. Research Development, vol. 26, pp. 78-88, 1982.
[25] Y. Huang, C.M.R. Kintala, N. Kolettis, and N.D. Fulton, “Software Rejuvenation: Analysis, Module, and Applications,” Proc. Int'l Symp. Fault-Tolerant Computing, pp. 381-390, 1995.
[26] R. Mullen, “The Lognormal Distribution of Software Failure Rates: Origin and Evidence,” Proc. Ninth Int'l Symp. Software Reliability Eng., 1998.
[27] D.D. Deavours, G. Clark, T. Courtney, D. Daly, S. Derisavi, J.M. Doyle, W.H. Sanders, and P.G. Webster, “The Mobius Framework and Its Implementation,” IEEE Trans. Software Eng., vol. 28, no. 10, pp. 956-969, Oct. 2002.
[28] J.F. Meyer, “On Evaluating the Performability of Degradable Computing Systems,” IEEE Trans. Computers, vol. 29, no. 8, pp. 720-731, Aug. 1980.
[29] K. Birman, R. Constable, M. Hayden, C. Kreitz, O. Rodeh, R. van Renesse, and W. Vogels, “The Horus and Ensemble Projects: Accomplishments and Limitations,” Proc. DARPA Information Survivability Conf. and Exposition (DISCEX '00), 2000.
[30] D. Cotroneo, A. Mazzeo, L. Romano, and S. Russo, “Implementing a Corba-Based Architecture for Leveraging the Security Level of Existing Applications,” Proc. Eighth Int'l Symp. Distributed Objects and Applications (DOA 2002), 2002.
[31] A. Bondavalli, S. Chiaradonna, D. Cotroneo, and L. Romano, “A Fault-Tolerant Distributed Legacy-Based System and Its Evaluation,” Proc. First Latin Am. Symp. Dependable Computing (LADC '03), pp. 303-320, 2003.
[32] R. Chillarege, S. Biyani, and J. Rosenthal, “Measurement of Failure Rate in Widely Distributed Software Fault-Tolerant Computing,” Proc. 25th Int'l Symp. Fault Tolerant Computing Systems (FTCS-25), pp. 424-433, June 1995.
[33] W.H. Sanders and J.F. Meyer, “Stochastic Activity Networks: Formal Definitions and Concepts,” Lectures on Formal Methods and Performance Analysis, 2001.
[34] D. Powell, J. Arlat, L. Beus-Dukic, A. Bondavalli, P. Coppola, A. Fantechi, E. Jenn, C. Rabéjac, and A. Wellings, “GUARDS: A Generic Upgradable Architecture for Real-Time Dependable Systems,” IEEE Trans. Parallel and Distributed Systems, special issue on dependable real-time systems, vol. 10, no. 6, pp. 580-599, June 1999.

Index Terms:
Legacy systems and COTS components, fault diagnosis and treatment, fault injection, modeling and evaluation, performability.
Andrea Bondavalli, Silvano Chiaradonna, Domenico Cotroneo, Luigi Romano, "Effective Fault Treatment for Improving the Dependability of COTS and Legacy-Based Applications," IEEE Transactions on Dependable and Secure Computing, vol. 1, no. 4, pp. 223-237, Oct.-Dec. 2004, doi:10.1109/TDSC.2004.40
Usage of this product signifies your acceptance of the Terms of Use.