This Article 
 Bibliographic References 
 Add to: 
Probabilistic Model-Driven Recovery in Distributed Systems
November/December 2011 (vol. 8 no. 6)
pp. 913-928
Kaustubh R. Joshi, AT&T Labs Research, Florham Park
Matti A. Hiltunen, AT&T Labs Research, Florham Park
William H. Sanders, University of Illinois at Urbana-Champaign, Urbana
Richard D. Schlichting, AT&T Labs Research, Florham Park
Automatic system monitoring and recovery has the potential to provide effective, low-cost ways to improve dependability in distributed software systems. However, automating recovery is challenging in practice because accurate fault diagnosis is hampered by monitoring tools and techniques that often have low fault coverage, poor fault localization, detection delays, and false positives. In this paper, we present a holistic model-based approach that overcomes these challenges and enables automatic recovery in distributed systems. To do so, it uses theoretically sound techniques including Bayesian estimation and Markov decision theory to provide controllers that choose good, if not optimal, recovery actions according to a user-defined optimization criteria. By combining monitoring and recovery, the approach realizes benefits that could not have been obtained by using them in isolation. We experimentally validate our framework by fault injection on realistic e-commerce systems.

[1] Y. Huang, C. Kintala, N. Kolettis, and N. Fulton, “Software Rejuvenation: Analysis, Module and Applications,” Proc. Int'l Symp. Fault-Tolerant Computing (FTCS), pp. 381-390, June 1995.
[2] G. Candea, J. Cutler, A. Fox, R. Doshi, P. Gang, and R. Gowda, “Reducing Recovery Time in a Small Recursively Restartable System,” Proc. Int'l Conf. Dependable Systems and Networks (DSN), pp. 605-614, June 2002.
[3] D. Oppenheimer, A. Brown, J. Beck, D. Hettena, J. Kuroda, N. Treuhaft, D. Patterson, and K. Yelick, “Roc-1: Hardware Support for Recovery-Oriented Computing,” IEEE Trans. Computers, vol. 51, no. 2, pp. 100-107, Feb. 2002.
[4] J. Case, M. Fedor, M. Schoffstall, and J. Davin, “A Simple Network Management Protocol (SNMP),” IETF, Request for Comments RFC 1157, May 1990.
[5] Y.-F. Chen, H. Huang, R. Jana, T. Jim, M. Hiltunen, R. Muthumanickam, S. John, S. Jora, and B. Wei, “iMobile EE - An Enterprise Mobile Service Platform,” ACM J. Wireless Networks, vol. 9, no. 4, pp. 283-297, July 2003.
[6] E. Cecchet, A. Chanda, S. Elnikety, J. Marguerite, and W. Zwaenepoel, “Performance Comparison of Middleware Architectures for Generating Dynamic Web Content,” Proc. Int'l Conf. Middleware, 2003.
[7] M. Cukier, D. Powell, and J. Arlat, “Coverage Estimation Methods for Stratified Fault-Injection,” IEEE Trans. Computers, vol. 48, no. 7, pp. 707-723, July 1999.
[8] I. Lee and R.K. Iyer, “Diagnosing Rediscovered Software Problems Using Symptoms,” IEEE Trans. Software Eng., vol. 26, no. 2, pp. 113-127, Feb. 2000.
[9] K.R. Joshi, “Stochastic-Model-Driven Adaptation and Recovery in Distributed Systems,” PhD dissertation, Univ. of Illinois at Urbana-Champaign, May 2007.
[10] Y. Chang, L. Lander, H.-S. Lu, and M. Wells, “Bayesian Analysis for Fault Location in Homogenous Distributed Systems,” Proc. Symp. Reliable Distributed Systems (SRDS), pp. 44-53, Oct. 1993.
[11] G. Monahan, “A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms,” Management Science, vol. 28, no. 1, pp. 1-16, 1982.
[12] S. Microsystems, “Enterprise Javabeans Technology,”, 2007.
[13] ObjectWeb, “JOnAS: Java Open Application Server,” World Wide Web, http://www.objectweb.orgjonas, 2006.
[14] G. Candea, M. Delgado, M. Chen, and A. Fox, “Automatic Failure-Path Inference: A Generic Introspection Technique for Internet Applications,” Proc. Third IEEE Workshop Internet Applications (WIAPP '03), pp. 132-141, 2003.
[15] F. Preparata, G. Metze, and R. Chien, “On the Connection Assignment Problem of Diagnosable Systems,” IEEE Trans. Electronic Computers, vol. EC-16, no. 6, pp. 848-854, Dec. 1967.
[16] M. Blount, “Probabilistic Treatment of Diagnosis in Digital Systems,” Proc. Int'l Conf. Fault-Tolerant Computing (FTCS), pp. 72-77, 1977.
[17] D. Blough and A. Pelc, “Diagnosis and Repair in Multiprocessor Systems,” IEEE Trans. Computers, vol. 42, no. 2, pp. 205-217, Feb. 1993.
[18] G. Khanna, M.Y. Cheng, P. Varadharajan, S. Bagchi, M.P. Correia, and P.J. Verissimo, “Automated Rule-Based Diagnosis through a Distributed Monitor System,” IEEE Trans. Dependable and Secure Computing, vol. 4, no. 4, pp. 266-279, Oct.-Dec. 2007.
[19] A. Daidone, F. Di Giandomenico, A. Bondavalli, and S. Chiaradonna, “Hidden Markov Models as a Support for Diagnosis: Formalization of the Problem and Synthesis of the Solution,” Proc. IEEE Symp. Reliable Distributed Systems (SRDS), pp. 245-256, Oct. 2006.
[20] C. Basile, M. Gupta, Z. Kalbarczyk, and R. Iyer, “An Approach for Detecting and Distinguishing Errors versus Attacks in Sensor Networks,” Proc. Int'l Conf. Dependable Systems and Networks (DSN), pp. 473-484, 2006.
[21] I. Rish, M. Brodie, S. Ma, N. Odintsova, A. Beygelzimer, G. Grabarnik, and K. Hernandez, “Adaptive Diagnosis in Distributed Systems,” IEEE Trans. Neural Networks, Special Issue on Adaptive Learning Systems in Comm. Networks, vol. 16, no. 5, pp. 1088-1109, Sept. 2005.
[22] P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D. Maltz, and M. Zhang, “Towards Highly Reliable Enterprise Network Services via Inference of Multi-Level Dependencies,” Proc. ACM SIGCOMM, Aug. 2007.
[23] S. Ruan, F. Tu, and K. Pattipati, “On Multi-Mode Test Sequencing Problem,” Proc. IEEE Systems Readiness Technology Conf., pp. 194-201, Sept. 2003.
[24] M. Castro and B. Liskov, “Practical Byzantine Fault Tolerance and Proactive Recovery,” ACM Trans. Computer Systems, vol. 20, no. 4, pp. 398-461, 2002.
[25] L. Zhou, F.B. Schneider, and R.V. Renesse, “Coca: A Secure Distributed Online Certification Authority,” ACM Trans. Computer Systems, vol. 20, no. 4, pp. 329-368, 2002.
[26] M.A. Marsh and F.B. Schneider, “Codex: A Robust and Secure Secret Distribution System,” IEEE Trans. Dependable and Secure Computing, vol. 1, no. 1, pp. 34-47, Jan.-Mar. 2004.
[27] P. Sousa, A.N. Bessani, M. Correia, N.F. Neves, and P. Verissimo, “Resilient Intrusion Tolerance through Proactive and Reactive Recovery,” Proc. Pacific Rim Int'l Symp. Dependable Computing, pp. 373-380, 2007.
[28] F. Qin, J. Tucek, J. Sundaresan, and Y. Zhou, “Rx: Treating Bugs as Allergies—A Safe Method to Survive Software Failures,” Proc. Symp. Operating Systems Principles (SOSP), pp. 235-248, 2005.
[29] H. de Meer and K.S. Trivedi, “Guarded Repair of Dependable Sys.,” Theoretical Computer Science, vol. 128, pp. 179-210, 1994.
[30] K.G. Shin, C.M. Krishna, and Y.-H. Lee, “Optimal Dynamic Control of Resources in a Distributed System,” IEEE Trans. Software Eng., vol. 15, no. 10, pp. 1188-1198, Oct. 1989.
[31] M. Littman, N. Ravi, E. Fenson, and R. Howard, “An Instance-Based State Representation for Network Repair,” Proc. Nat'l Conf. Artificial Intelligence (AAAI '04), pp. 287-292, July 2004.
[32] S. Montani and C. Anglano, “Achieving Self-Healing in Service Delivery Software Systems by Means of Case-Based Reasoning,” J. Applied Intelligence, vol. 28, no. 2, pp. 139-152, 2008.
[33] P. Barham, A. Donnelly, R. Isaacs, and R. Mortier, “Using Magpie for Request Extraction and Workload Modelling,” Proc. Conf. Symp. Operating Systems Design and Implementation (OSDI), pp. 259-272, Dec. 2004.
[34] M. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer, “Pinpoint: Problem Determination in Large, Dynamic Internet Services,” Proc. Int'l Conf. Dependable Systems and Networks (DSN), pp. 595-604, 2002.
[35] E. Kiciman, “Using Statistical Monitoring to Detect Failures in Internet Services,” PhD dissertation, Stanford Univ., Sept. 2005.
[36] M.K. Aguilera, J.C. Mogul, J.L. Wiener, P. Reynolds, and A. Muthitacharoen, “Performance Debugging for Distributed Systems of Black Boxes.” Proc. ACM Symp. Operating Systems and Principles (SOSP), pp. 74-89, 2003.
[37] G. Jiang, H. Chen, and K. Yoshihira, “Discovering Likely Invariants of Distributed Transaction Systems for Autonomic System Management,” Cluster Computing, vol. 9, no. 4, pp. 385-399, 2006.
[38] H. Chen, G. Jiang, and K. Yoshihira, “Failure Detection in Large-Scale Internet Services by Principal Subspace Mapping,” IEEE Trans. Knowledge and Data Eng., vol. 19, no. 10, pp. 1308-1320, Oct. 2007.

Index Terms:
Fault tolerance, monitoring, diagnosis, recovery, distributed systems, adaptive systems, POMDP, Bayesian.
Kaustubh R. Joshi, Matti A. Hiltunen, William H. Sanders, Richard D. Schlichting, "Probabilistic Model-Driven Recovery in Distributed Systems," IEEE Transactions on Dependable and Secure Computing, vol. 8, no. 6, pp. 913-928, Nov.-Dec. 2011, doi:10.1109/TDSC.2010.45
Usage of this product signifies your acceptance of the Terms of Use.