loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06)
Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing
Singapore
May 16-May 19
ISBN: 0-7695-2585-7
Yawei Li, Illinois Institute of Technology, USA
Zhiling Lan, Illinois Institute of Technology, USA
As the scale of cluster computing grows, it is becoming hard for long-running applications to complete without facing failures on large-scale clusters. To address this issue, checkpointing/restart is widely used to provide the basic fault-tolerant functionality, yet it suffers from high overhead and its reactive characteriristic. In this work, we propose FT-Pro, an adaptive fault management mechanism that optimally chooses migration, checkpointing or no action to reduce the application execution time in the presence of failures based on the failure prediction. A cost-based evaluation model is presented for dynamic decision at run-time. Using the actual failure log from a production cluster at NCSA, we demonstrate that even with modest failure prediction accuracy, FT-Pro outperforms the traditional checkpointing/restart strategy by 13%-30% in terms of reducing the application execution time despite failures, which is a significant performance improvement for long-running applications.
Citation:
Yawei Li, Zhiling Lan, "Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing," ccgrid, pp.531-538, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06), 2006
Usage of this product signifies your acceptance of the Terms of Use.