2007 International Conference on Parallel Processing (ICPP 2007) (2007)
Sept. 10, 2007 to Sept. 14, 2007
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICPP.2007.42
Yawei Li , Illinois Institute of Technology, USA
Prashasta Gujrati , Illinois Institute of Technology, USA
Zhiling Lan , Illinois Institute of Technology, USA
Xian-he Sun , Illinois Institute of Technology, USA; Fermi National Accelerator Laboratory, USA
The productivity of HPC system is determined not only by their performance, but also by their reliability. The conventional method to limit the impact of failures is checkpointing. However, existing research shows that such a reactive fault tolerance approach can only improve system productivity marginally. Leveraging the recent progress made in the field of failure prediction, we propose fault-driven rescheduling (FARS) to improve system resilience to failures, and investigate the feasibility and effectiveness of utilizing failure prediction to dynamically adjust the placement of active jobs (e.g. running jobs) in response to failure prediction. In particular, a rescheduling algorithm is designed to enable effective job adjustment by evaluating performance impact of potential failures and rescheduling on user jobs. The proposed FARS complements existing research on fault-aware scheduling by allowing user jobs to avoid imminent failures at runtime. We evaluate FARS by using actual workloads and failure events collected from production HPC systems. Our preliminary results show the potential of FARS on improving system resilience to failures.
X. Sun, Y. Li, P. Gujrati and Z. Lan, "Fault-Driven Re-Scheduling For Improving System-level Fault Resilience," 2007 International Conference on Parallel Processing(ICPP), Xi'an, 2007, pp. 39.