Parallel and Distributed Processing Symposium, International (2007)
Long Beach, CA, USA
Mar. 26, 2007 to Mar. 30, 2007
Hideyuki Jitsumoto , Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-ku, Tokyo, 152-8552 JAPAN. firstname.lastname@example.org
Toshio Endo , Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-ku, Tokyo, 152-8552 JAPAN. email@example.com
Satoshi Matsuoka , Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-ku, Tokyo, 152-8552 JAPAN; National Institute of Informatics, 2-1-1 Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430 JAPAN. firstname.lastname@example.org
Long-running MPI applications on clusters and grids that are prone to node and network failures, motivates the use of fault tolerant MPI implementations. However, previous fault tolerant MPIs lack the ability to allow the user to easily choose appropriate fault recovery strategies acording to the execution environment, independent of the application codesrather, the user often had to hard-code restoration strateties in accordance to diverse sets of fault patterns, which could be numerous: for instance, if the fault is transient to a particular process, we merely have to restart the process on the same computing node; on the other hand, if the fault is due to repetitive hardware unreliability, we must migrate the process to a new node in its recovery. ABARIS is our new Fault/Recovery model aware component framework for MPI, where users can customize MPI fault detection and recovery algorithms according to their application and execution environmental requirements by merely selecting appropriate fault/recovery components, independent of the application code. Currently, the ARA-BIS framework prototype is implemented on top of MPICH-P4MPD. Preliminary evaluation of the prototype using NPB on our MPI fault simulator demonstrates that overhead compared to the original MPICH-P4MPD is almost negligible (less than 1%) under normal execution, and when faults occur, appropriate selections and pairings of fault model and recovery method components for corresponding to the execution environment is significant to the overall execution time.
H. Jitsumoto, T. Endo and S. Matsuoka, "ABARIS: An Adaptable Fault Detection/Recovery Component Framework for MPIs," 2007 IEEE International Parallel and Distributed Processing Symposium(IPDPS), Rome, 2007, pp. 413.