The Community for Technology Leaders
IEEE International Performance Computing and Communications Conference (2011)
Orlando, FL, USA
Nov. 17, 2011 to Nov. 19, 2011
ISBN: 978-1-4673-0010-0
pp: 1-2
Bran Selic , Electrical & Information Engineering, The University of Sydney, NSW 2006, Australia
Ifeanyi P. Egwutuoha , Electrical & Information Engineering, The University of Sydney, NSW 2006, Australia
David Levy , Electrical & Information Engineering, The University of Sydney, NSW 2006, Australia
ABSTRACT
In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel custom architectures to clusters of commodity personal computers to take advantage of cost and performance benefits. To avoid having to restart an application in case of sudden failure, checkpointing/restart fault tolerance mechanisms are commonly implemented. One drawback to checkpointing/restart is that it creates an overhead which increases the execution time of an application. We present a theoretical analysis of our technique. The results show that the PLR checkpointing/restart can significantly improve the overall reliability of an HPC system.
INDEX TERMS
CITATION
Bran Selic, Ifeanyi P. Egwutuoha, David Levy, "Evaluation of process level redundant checkpointing/restart for HPC systems", IEEE International Performance Computing and Communications Conference, vol. 00, no. , pp. 1-2, 2011, doi:10.1109/PCCC.2011.6108098
149 ms
(Ver 3.3 (11022016))