Cluster Computing and the Grid, IEEE International Symposium on (2011)
Newport Beach, California USA
May 23, 2011 to May 26, 2011
-- Coordinated Checkpoint/Restart (C/R) is a widely deployed strategy to achieve fault-tolerance. However, C/R by itself is not capable enough to meet the demands of upcoming exascale systems, due to its heavy I/O overhead. Process migration has already been proposed in literature as a pro-active fault-tolerance mechanism to complement C/R. Several popular MPI implementations have provided support for process migration, includingMVAPICH2 and Open MPI. But these existing solutions cannot yield a satisfactory performance. In this paper we conduct extensive profiling on several process migration mechanisms, and reveal that ineffi-cient I/O and network transfer are the principal factors responsible for the high overhead. We then propose anew approach, Pipelined Process Migration with RDMA(PPMR), to overcome these overheads. Our new protocol fully pipelines data writing, data transfer, and data read operations during different phases of a migration cycle. PPMR aggregates data writes on the migration source node and transfers data to the target node via high through put RDMA transport. It implements an efficient process restart mechanism at the target node to restart processes from the RDMA data streams. We have implemented this Pipelined Process Migration protocol in MVAPICH2 and studied the performance benefits. Experimental results show that PPMR achieves a 10.7X speedup to complete a process migration over the conventional approach at a moderate(8MB) memory usage. Process migration overhead on the application is significantly minimized from 38% to 5% by PPMR when three migrations are performed in succession.
process-migration, fault-tolerance, pipelining, RDMA
X. Besseron, X. Ouyang, D. K. Panda and R. Rajachandrasekar, "High Performance Pipelined Process Migration with RDMA," Cluster Computing and the Grid, IEEE International Symposium on(CCGRID), Newport Beach, California USA, 2011, pp. 314-323.