The Community for Technology Leaders
Cluster Computing and the Grid, IEEE International Symposium on (2008)
May 19, 2008 to May 22, 2008
ISBN: 978-0-7695-3156-4
pp: 554-559
ABSTRACT
One of the key functionalities provided by Grid systems is the remote execution of applications. This paper introduces a research proposal on fault-tolerance mechanisms for the execution of sequential and message-passing parallel applications on the Grid. A service-based architecture called CPPC-G is proposed. The CPPC (Controller/Precompiler for Portable Checkpointing) framework is used to insert checkpointing instrumentation into the application code. CPPC-G services will be in charge of the submission and monitoring of the application execution, management of checkpoint files generated by CPPC-enabled applications, and detection and automatic restart of failed executions. The development of the CPPC-G architecture will involve research in different areas such as storage and management of data files (checkpoint files); automatic selection of suitable computing resources; reliable detection of execution failures and robustness issues to make the architecture fault-tolerant itself.
INDEX TERMS
grid computation, parallel computation, checkpointing, fault tolerance, CPPC, Globus
CITATION

X. C. Pardo, P. Gonz?lez, D. D?az and M. J. Mart?, "Application-Level Fault-Tolerance Solutions for Grid Computing," 2008 8th International Symposium on Cluster Computing and the Grid (CCGRID '08)(CCGRID), Lyon, 2008, pp. 554-559.
doi:10.1109/CCGRID.2008.38
84 ms
(Ver 3.3 (11022016))