The Community for Technology Leaders
2008 IEEE International Conference on Cluster Computing (2008)
Tsukuba Japan
Sept. 29, 2008 to Oct. 1, 2008
ISSN: 1552-5244
ISBN: 978-1-4244-2639-3
pp: 326-329
H. Jitsumoto , Tokyo Inst. of Technol., Tokyo
T. Endo , Tokyo Inst. of Technol., Tokyo
S. Matsuoka , Tokyo Inst. of Technol., Tokyo
Fault-tolerance for HPC systems with long-running applications of massive and growing scale is now essential. Although checkpointing with rollback recovery is a popular technique, automated checkpointing is becoming troublesome in a real system, due to the extremely large size of collective application memory. Therefore, automated optimization of the checkpoint interval is essential, but the optimal point depends on hardware failure rates and I/O bandwidth. Our new model and an algorithm, which is an extension of Vaidyapsilas model, solve the problem by taking such parameters into account. Prototype implementation on our fault-tolerant MPI framework ABARIS showed approximately 5.5% improvement over statically user-determined cases.
optimisation, checkpointing, fault tolerant computing, message passing

H. Jitsumoto, T. Endo and S. Matsuoka, "Environmental-aware optimization of MPI checkpointing intervals," 2008 IEEE International Conference on Cluster Computing(CLUSTER), Tsukuba Japan, 2009, pp. 326-329.
97 ms
(Ver 3.3 (11022016))