Eighth International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT 2007) Incorporating Fault Tolerance with Replication on Very Large Scale Grids Adelaide, Australia December 03-December 06 ISBN: 0-7695-3049-4
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/PDCAT.2007.26
Providing fault tolerance for message passing parallel application on a distributed environment is a rule rather than an exception. A node failure can cause the whole computation to stop and has to be restarted from the begin- ning if no fault tolerance is available. However, introducing fault tolerance has some overhead on speedup that can be achieved. In this paper, we introduce a new technique called replication with cross-over packets for reliability and to in- crease fault tolerance over Very Large Scale Grids (VLSG). This technique has two pronged effect of avoiding single point of failure and single link of failure. We incorporate this new technique into the L-BSP model and show the pos- sible speedup of parallel process. We also derive the achiev- able speedup for some fundamental parallel algorithms us- ing this technique.
Citation:
Elankovan Sundararajan, Aaron Harwood, Ramamohanarao Kotagiri, "Incorporating Fault Tolerance with Replication on Very Large Scale Grids," pdcat, pp.319-328, Eighth International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT 2007), 2007 Usage of this product signifies your acceptance of the Terms of Use. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||