|
| This Article | ||
| ||
| Share | ||
| Bibliographic References | ||
| Add to: | ||
| | ||
| Search | ||
| ||
2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing
Assessing HPC Failure Detectors for MPI Jobs
Munich, Germany
February 15-February 17
ISBN: 978-0-7695-4633-9
| ASCII Text | x | ||
| Kishor Kharbas, Donghoon Kim, Torsten Hoefler, Frank Mueller, "Assessing HPC Failure Detectors for MPI Jobs," 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008), pp. 81-88, 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, 2012. | |||
| BibTex | x | ||
| @article{ 10.1109/PDP.2012.11, author = {Kishor Kharbas and Donghoon Kim and Torsten Hoefler and Frank Mueller}, title = {Assessing HPC Failure Detectors for MPI Jobs}, journal ={16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008)}, volume = {0}, year = {2012}, issn = {1066-6192}, pages = {81-88}, doi = {http://doi.ieeecomputersociety.org/10.1109/PDP.2012.11}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, } | |||
| RefWorks Procite/RefMan/Endnote | x | ||
| TY - CONF JO - 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008) TI - Assessing HPC Failure Detectors for MPI Jobs SN - 1066-6192 SP81 EP88 A1 - Kishor Kharbas, A1 - Donghoon Kim, A1 - Torsten Hoefler, A1 - Frank Mueller, PY - 2012 VL - 0 JA - 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008) ER - | |||
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/PDP.2012.11
Reliability is one of the challenges faced by exascale computing. Components are poised to fail during large-scale executions given current mean time between failure (MTBF) projections. To cope with failures, resilience methods have been proposed as explicit or transparent techniques. For the latter techniques, this paper studies the challenge of fault detection. This work contributes a study on generic fault detection capabilities at the MPI level and beyond. The objective is to assess different detectors, which ultimately may or may not be implemented within the application's runtime layer. A first approach utilizes a periodic liveness check while a second method promotes sporadic checks upon communication activities. The contributions of this paper are two-fold: (a) We provide generic interposing of MPI applications for fault detection. (b) We experimentally compare periodic and sporadic methods for liveness checking. We show that the sporadic approach, even though it imposes lower bandwidth requirements and utilizes lower frequency checking, results in equal or worse application performance than a periodic liveness test for larger number of nodes. We further show that performing liveness checks in separation from MPI applications results in lower overhead than interpositioning, as demonstrated by our prototypes. Hence, we promote separate periodic fault detection as the superior approach for fault detection.
Citation:
Kishor Kharbas, Donghoon Kim, Torsten Hoefler, Frank Mueller, "Assessing HPC Failure Detectors for MPI Jobs," pdp, pp.81-88, 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, 2012
Usage of this product signifies your acceptance of the Terms of Use.
