2015 IEEE 22nd International Conference on High Performance Computing (HiPC) (2015)

Bengaluru, India

Dec. 16, 2015 to Dec. 19, 2015

ISBN: 978-1-4673-8487-2

pp: 2-11

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/HiPC.2015.26

ABSTRACT

Many methods are available to detect silent errors in high-performance computing (HPC) applications. Each comes with a given cost and recall (fraction of all errors that are actually detected). The main contribution of this paper is to characterize the optimal computational pattern for an application: which detector(s) to use, how many detectors of each type to use, together with the length of the work segment that precedes each of them. We conduct a comprehensive complexity analysis of this optimization problem, showing NP-completeness and designing an FPTAS (Fully Polynomial-Time Approximation Scheme). On the practical side, we provide a greedy algorithm whose performance is shown to be close to the optimal for a realistic set of evaluation scenarios.

INDEX TERMS

Detectors, Protocols, Checkpointing, Greedy algorithms, Interpolation, Time series analysis, Redundancy

CITATION

"Which Verification for Soft Error Detection?,"

*2015 IEEE 22nd International Conference on High Performance Computing (HiPC)(HIPC)*, Bengaluru, India, 2016, pp. 2-11.

doi:10.1109/HiPC.2015.26