Issue No.12 - December (2008 vol.57)
Zhiling Lan , Illinois Institute of Technology, Chicago
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TC.2008.90
As the scale of high performance computing (HPC) grows, application fault resilience becomes increasingly important. In this paper, we propose FT-Pro, an adaptive fault management approach that combines the merits of reactive checkpointing and proactive migration. It enables parallel applications to avoid anticipated failures via preventive migration, and in the case of unforeseeable failures, to minimize their impact through selective checkpointing. An adaptation manager is designed for making runtime decision in response to failure prediction. We evaluate FT-Pro through stochastic modeling and case studies with real applications under a wide range of settings. Preliminary results indicate that FT-Pro outperforms periodic checkpointing, in terms of both reducing application completion times and improving resource utilization, by up to 43%.
Fault tolerance, Performance evaluation of algorithms and systems
Zhiling Lan, "Adaptive Fault Management of Parallel Applications for High-Performance Computing", IEEE Transactions on Computers, vol.57, no. 12, pp. 1647-1660, December 2008, doi:10.1109/TC.2008.90