The Community for Technology Leaders
Parallel and Distributed Systems, International Conference on (2007)
Hsinchu, Taiwan
Dec. 5, 2007 to Dec. 7, 2007
ISBN: 978-1-4244-1889-3
pp: 1-7
R. Badrinath , Hewlett-Packard, USA
R.K. Palanivel Rajan , Hewlett-Packard, USA
R. Krishnakumar , Hewlett-Packard, USA
ABSTRACT
Application checkpoint and restart has been a widely studied problem over the last several decades. Despite immense volume of theory and several research project level implementations, there is very little by way of working solutions for the case of parallel distributed applications (such as MPI programs on a cluster). We describe our experiences in enhancing a job scheduler to leverage mechanisms of a virtual machine environment to support checkpoint-restart. We also describe the basic coordinated checkpoint-restart framework that we implemented on which this solution is based.
INDEX TERMS
null
CITATION
R. Badrinath, R.K. Palanivel Rajan, R. Krishnakumar, "Virtualization aware job schedulers for checkpoint-restart", Parallel and Distributed Systems, International Conference on, vol. 02, no. , pp. 1-7, 2007, doi:10.1109/ICPADS.2007.4447844
98 ms
(Ver )