2013 42nd International Conference on Parallel Processing (2012)
Pittsburgh, PA, USA USA
Sept. 10, 2012 to Sept. 13, 2012
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICPP.2012.18
A pool of distributed volunteer PCs presents an extremely hostile environment for execution of communicating parallel codes due to system and network heterogeneity, varying availability, and frequent failures. Well known methods for fault tolerance, specifically replication and check pointing, are challenging to deploy and not sufficient individually to provide continuous forward application progress. As the failure of a single logical process leads to application failure, the degree of redundancy needed for long running applications is too large to be practical. Check pointing and rollback does not provide protection against slow and variable speed nodes and is impractical when system wide MTBF is in minutes or less, common for a moderate size volunteer computing pool. The approach taken in this research is to exploit both, but that presents formidable challenges, efficient check pointing of distributed replicated processes, dynamic management of redundancy, quick restart in a distributed environment, and others. Proposed solution also leverages node selection based on availability prediction. The integrated runtime system is shown to effectively execute moderate size, coarse grain, communicating codes on a worldwide distributed volunteer environment, a new milestone in volunteer computing. The results provide new insight into how multiple techniques interact and contribute to robustness. The programming model is based on one-sided Put/Get calls to an abstract global shared space that works seamlessly with replicated processes. A Replica Exchange Molecular Dynamics code is employed to drive evaluation. The execution environment includes hosts on a University campus as well as hosts distributed around the world.
Checkpointing, Servers, Availability, Redundancy, Programming, Fault tolerant systems, PC grids, Volunteer computing, distributed computing, fault tolerance, BOINC
Hien Nguyen, Eshwar Pedamallu, Jaspal Subhlok, Edgar Gabriel, Qian Wang, Margaret S. Cheung, David Anderson, "An Execution Environment for Robust Parallel Computing on Volunteer PC Grids", 2013 42nd International Conference on Parallel Processing, vol. 00, no. , pp. 158-167, 2012, doi:10.1109/ICPP.2012.18