loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
2007 International Conference on Parallel Processing (ICPP 2007)
Fault-Driven Re-Scheduling For Improving System-level Fault Resilience
Xi'an, China
September 10-September 14
ISBN: 0-7695-2933-X
Yawei Li, Illinois Institute of Technology, USA
Prashasta Gujrati, Illinois Institute of Technology, USA
Zhiling Lan, Illinois Institute of Technology, USA
Xian-he Sun, Illinois Institute of Technology, USA; Fermi National Accelerator Laboratory, USA
The productivity of HPC system is determined not only by their performance, but also by their reliability. The conventional method to limit the impact of failures is checkpointing. However, existing research shows that such a reactive fault tolerance approach can only improve system productivity marginally. Leveraging the recent progress made in the field of failure prediction, we propose fault-driven rescheduling (FARS) to improve system resilience to failures, and investigate the feasibility and effectiveness of utilizing failure prediction to dynamically adjust the placement of active jobs (e.g. running jobs) in response to failure prediction. In particular, a rescheduling algorithm is designed to enable effective job adjustment by evaluating performance impact of potential failures and rescheduling on user jobs. The proposed FARS complements existing research on fault-aware scheduling by allowing user jobs to avoid imminent failures at runtime. We evaluate FARS by using actual workloads and failure events collected from production HPC systems. Our preliminary results show the potential of FARS on improving system resilience to failures.
Citation:
Yawei Li, Prashasta Gujrati, Zhiling Lan, Xian-he Sun, "Fault-Driven Re-Scheduling For Improving System-level Fault Resilience," icpp, pp.39, 2007 International Conference on Parallel Processing (ICPP 2007), 2007
Usage of this product signifies your acceptance of the Terms of Use.