loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
24th IEEE Symposium on Reliable Distributed Systems (SRDS'05)
Automatic Model-Driven Recovery in Distributed Systems
Orlando, Florida
October 26-October 28
ISBN: 0-7695-2463-X
Kaustubh R. Joshi, University of Illinois at Urbana-Champaign
William H. Sanders, University of Illinois at Urbana-Champaign
Matti A. Hiltunen, AT&T Labs Research
Richard D. Schlichting, AT&T Labs Research

Automatic system monitoring and recovery has the potential to provide a low-cost solution for high availability. However, automating recovery is difficult in practice because of the challenge of accurate fault diagnosis in the presence of low coverage, poor localization ability, and false positives that are inherent in many widely used monitoring techniques. In this paper, we present a holistic model-based approach that overcomes these challenges and enables automatic recovery in distributed systems. To do so, it uses theoretically sound techniques including Bayesian estimation and Markov decision theory to provide controllers that choose good, if not optimal, recovery actions according to a user-defined optimization criteria. By combining monitoring and recovery, the approach realizes benefits that could not have been obtained by using them in isolation. In this paper, we present two recovery algorithms with complementary properties and trade-offs, and validate our algorithms (through simulation) by fault injection on a realistic e-commerce system.

Citation:
Kaustubh R. Joshi, William H. Sanders, Matti A. Hiltunen, Richard D. Schlichting, "Automatic Model-Driven Recovery in Distributed Systems," srds, pp.25-38, 24th IEEE Symposium on Reliable Distributed Systems (SRDS'05), 2005
Usage of this product signifies your acceptance of the Terms of Use.