Issue No. 12 - December (2008 vol. 57)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TC.2008.90
Zhiling Lan , Illinois Institute of Technology, Chicago
Yawei Li , Illinois Institute of Technology, Chicago
As the scale of high performance computing (HPC) grows, application fault resilience becomes increasingly important. In this paper, we propose FT-Pro, an adaptive fault management approach that combines the merits of reactive checkpointing and proactive migration. It enables parallel applications to avoid anticipated failures via preventive migration, and in the case of unforeseeable failures, to minimize their impact through selective checkpointing. An adaptation manager is designed for making runtime decision in response to failure prediction. We evaluate FT-Pro through stochastic modeling and case studies with real applications under a wide range of settings. Preliminary results indicate that FT-Pro outperforms periodic checkpointing, in terms of both reducing application completion times and improving resource utilization, by up to 43%.
Fault tolerance, Performance evaluation of algorithms and systems
Y. Li and Z. Lan, "Adaptive Fault Management of Parallel Applications for High-Performance Computing," in IEEE Transactions on Computers, vol. 57, no. , pp. 1647-1660, 2008.