|
| This Article | ||
| ||
| Share | ||
| Bibliographic References | ||
| Add to: | ||
| | ||
| Search | ||
| ||
2010 IEEE International Conference on Cluster Computing
RDMA-Based Job Migration Framework for MPI over InfiniBand
Heraklion, Greece
September 20-September 24
ISBN: 978-0-7695-4220-1
| ASCII Text | x | ||
| Xiangyong Ouyang, Sonya Marcarelli, Raghunath Rajachandrasekar, Dhabaleswar K. Panda, "RDMA-Based Job Migration Framework for MPI over InfiniBand," 2012 IEEE International Conference on Cluster Computing, pp. 116-125, 2010 IEEE International Conference on Cluster Computing, 2010. | |||
| BibTex | x | ||
| @article{ 10.1109/CLUSTER.2010.20, author = {Xiangyong Ouyang and Sonya Marcarelli and Raghunath Rajachandrasekar and Dhabaleswar K. Panda}, title = {RDMA-Based Job Migration Framework for MPI over InfiniBand}, journal ={2012 IEEE International Conference on Cluster Computing}, volume = {0}, year = {2010}, isbn = {978-0-7695-4220-1}, pages = {116-125}, doi = {http://doi.ieeecomputersociety.org/10.1109/CLUSTER.2010.20}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, } | |||
| RefWorks Procite/RefMan/Endnote | x | ||
| TY - CONF JO - 2012 IEEE International Conference on Cluster Computing TI - RDMA-Based Job Migration Framework for MPI over InfiniBand SN - 978-0-7695-4220-1 SP116 EP125 A1 - Xiangyong Ouyang, A1 - Sonya Marcarelli, A1 - Raghunath Rajachandrasekar, A1 - Dhabaleswar K. Panda, PY - 2010 KW - Checkpoint KW - Process-Migration KW - Proactive Fault Tolerance KW - MVAPICH2 VL - 0 JA - 2012 IEEE International Conference on Cluster Computing ER - | |||
Coordinated checkpoint and recovery is a common approach to achieve fault tolerance on large-scale systems. The traditional mechanism dumps the process image to a local disk or a central storage area of all the processes involved in the parallel job. When a failure occurs, the processes are restarted and restored to the latest checkpoint image. However, this kind of approach is unable to provide the scalability required by increasingly large-sized jobs, since it puts heavy I/O burden on the storage subsystem, and resubmitting a job during restart phase incurs lengthy queuing delay. In this paper, we enhance the fault tolerance of MVAPICH2, an open-source high performance MPI-2 implementation, by using a proactive job migration scheme. Instead of checkpointing all the processes of the job and saving their process images to a stable storage, we transfer the processes running on a health-deteriorating node to a healthy spare node, and resume these processes from the spare node. RDMA-based process image transmission is designed to take advantage of high performance communication in InfiniBand. Experimental results show that the Job Migration scheme can achieve a speedup of 4.49 times over the Checkpoint/Restart scheme to handle a node failure for a 64-process application running on 8 compute nodes. To the best of our knowledge, this is the first such job migration design for InfiniBand-based clusters.
Index Terms:
Checkpoint, Process-Migration, Proactive Fault Tolerance, MVAPICH2
Citation:
Xiangyong Ouyang, Sonya Marcarelli, Raghunath Rajachandrasekar, Dhabaleswar K. Panda, "RDMA-Based Job Migration Framework for MPI over InfiniBand," cluster, pp.116-125, 2010 IEEE International Conference on Cluster Computing, 2010
Usage of this product signifies your acceptance of the Terms of Use.
