loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)
Application Resilience: Making Progress in Spite of Failure
May 19-May 22
ISBN: 978-0-7695-3156-4
While measures such as raw compute performance and system capacity continue to be important factors for evaluating cluster performance, such issues as system reliability and application resilience have become increasingly important as cluster sizes rapidly grow. Although efforts to directly improve fault-tolerance are important, it is also essential to accept that application failures will inevitably occur and to ensure that progress is made despite these failures. Application monitoring frameworks are central to providing application resilience. As such, the central theme of this paper is to address the impact that application monitoring detection latency has on the overall system performance. We find that immediate fault detection is not necessary in order to obtain substantial improvementin performance. This conclusion is significant becauseit implies that less complex, highly portable, and predominately less expensive failure detection schemes would provide adequate application resilience.
Index Terms:
application resilience, application monitoring, cluster computing, fault tolerance, error detection
Citation:
William M. Jones, John T. Daly, Nathan A. DeBardeleben, "Application Resilience: Making Progress in Spite of Failure," ccgrid, pp.789-794, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), 2008
Usage of this product signifies your acceptance of the Terms of Use.