2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)
Fault Tolerance in Cluster Federations with O2P-CF
May 19-May 22
ISBN: 978-0-7695-3156-4
Fault tolerance is one of the key issues for large scale applications executed on high performance computing systems. In a cluster federation, clusters are gathered to provide huge computing power. To work efficiently on such systems, networks characteristics have to be taken into account: the latency between two nodes of different clusters is much higher than the latency between two nodes of the same cluster. In this paper, we present O2P-CF a message logging protocol well-suited to provide fault tolerance for message passing applications executed on cluster federations. O2P-CF is based on the combination of O2P, an extremely optimistic message logging protocol, with a pessimistic message logging protocol.
Index Terms:
Cluster federation, fault tolerance, message passing application, message logging
Citation:
Thomas Ropars, Christine Morin, "Fault Tolerance in Cluster Federations with O2P-CF," ccgrid, pp.807-812, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), 2008