Cluster Computing and the Grid, IEEE International Symposium on (2008)
May 19, 2008 to May 22, 2008
One of the key functionalities provided by Grid systems is the remote execution of applications. This paper introduces a research proposal on fault-tolerance mechanisms for the execution of sequential and message-passing parallel applications on the Grid. A service-based architecture called CPPC-G is proposed. The CPPC (Controller/Precompiler for Portable Checkpointing) framework is used to insert checkpointing instrumentation into the application code. CPPC-G services will be in charge of the submission and monitoring of the application execution, management of checkpoint files generated by CPPC-enabled applications, and detection and automatic restart of failed executions. The development of the CPPC-G architecture will involve research in different areas such as storage and management of data files (checkpoint files); automatic selection of suitable computing resources; reliable detection of execution failures and robustness issues to make the architecture fault-tolerant itself.
grid computation, parallel computation, checkpointing, fault tolerance, CPPC, Globus
X. C. Pardo, P. Gonz?lez, D. D?az and M. J. Mart?, "Application-Level Fault-Tolerance Solutions for Grid Computing," 2008 8th International Symposium on Cluster Computing and the Grid (CCGRID '08)(CCGRID), Lyon, 2008, pp. 554-559.