The Community for Technology Leaders
2015 IEEE 22nd International Conference on High Performance Computing (HiPC) (2015)
Bengaluru, India
Dec. 16, 2015 to Dec. 19, 2015
ISBN: 978-1-4673-8487-2
pp: 2-11
ABSTRACT
Many methods are available to detect silent errors in high-performance computing (HPC) applications. Each comes with a given cost and recall (fraction of all errors that are actually detected). The main contribution of this paper is to characterize the optimal computational pattern for an application: which detector(s) to use, how many detectors of each type to use, together with the length of the work segment that precedes each of them. We conduct a comprehensive complexity analysis of this optimization problem, showing NP-completeness and designing an FPTAS (Fully Polynomial-Time Approximation Scheme). On the practical side, we provide a greedy algorithm whose performance is shown to be close to the optimal for a realistic set of evaluation scenarios.
INDEX TERMS
Detectors, Protocols, Checkpointing, Greedy algorithms, Interpolation, Time series analysis, Redundancy
CITATION

"Which Verification for Soft Error Detection?," 2015 IEEE 22nd International Conference on High Performance Computing (HiPC)(HIPC), Bengaluru, India, 2016, pp. 2-11.
doi:10.1109/HiPC.2015.26
160 ms
(Ver 3.3 (11022016))