High Performance Computing and Grid in Asia Pacific Region, International Conference on (2004)
Omiya Sonic City, Tokyo, Japan
July 20, 2004 to July 22, 2004
Hidemoto Nakada , National Institute of Advanced Industrial, Japan
Satoshi Matsuoka , Tokyo Institute of Technology, Japan
Yoshio Tanaka , National Institute of Advanced Industrial, Japan
Satoshi Sekiguchi , National Institute of Advanced Industrial, Japan
We describe the design and implementation of a fault tolerant GridRPC system, Ninf-C, designed for easy programming of large-scale master-worker programs that take from few days to few months for its execution in a Grid environment. Ninf-C employs Condor, developed at University of Wisconsin, as the underlying middleware supporting remote file transmission and checkpointing for system-wide robustness for application users on the Grid. Ninf-C layers all the GridRPC communication and task parallel programming features on top of Condor in a non-trivial fashion, assuming that the entire program is structured in a master-worker style-in fact, older Ninf master-worker programs can be run directly or trivially ported to Ninf-C. In contrast to the original Ninf, Ninf-C exploits and extends Condor features extensively for robustness and transparency, such as 1) checkpointing and stateful recovery of the master process, 2) the master and workers mutually communicating using (remote) files, not IP sockets, and 3) automated throttling of parallel GridRPC calls; and in contrast to using Condor directly, programmers can set up complex dynamic workflow as well as master-worker parallel structure with almost no learning curve involved. To prove the robustness of the system, we performed an experiment on a heterogeneous cluster that consists of x86 and SPARC CPUs, and ran a simple but long-running master-worker program with staged rebooting of multiple nodes to simulate some serious fault situations. The program execution finished normally avoiding all the fault scenarios, demonstrating the robustness of Ninf-C.
Y. Tanaka, H. Nakada, S. Sekiguchi and S. Matsuoka, "The Design and Implementation of a Fault-Tolerant RPC System: Ninf-C," High Performance Computing and Grid in Asia Pacific Region, International Conference on(HPCASIA), Omiya Sonic City, Tokyo, Japan, 2004, pp. 9-18.