2015 IEEE 22nd International Conference on High Performance Computing (HiPC) (2015)
Dec. 16, 2015 to Dec. 19, 2015
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/HiPC.2015.26
Many methods are available to detect silent errors in high-performance computing (HPC) applications. Each comes with a given cost and recall (fraction of all errors that are actually detected). The main contribution of this paper is to characterize the optimal computational pattern for an application: which detector(s) to use, how many detectors of each type to use, together with the length of the work segment that precedes each of them. We conduct a comprehensive complexity analysis of this optimization problem, showing NP-completeness and designing an FPTAS (Fully Polynomial-Time Approximation Scheme). On the practical side, we provide a greedy algorithm whose performance is shown to be close to the optimal for a realistic set of evaluation scenarios.
Detectors, Protocols, Checkpointing, Greedy algorithms, Interpolation, Time series analysis, Redundancy
L. Bautista-Gomez, A. Benoit, A. Cavelan, S. K. Raina, Y. Robert and H. Sun, "Which Verification for Soft Error Detection?," 2015 IEEE 22nd International Conference on High Performance Computing (HiPC)(HIPC), Bengaluru, India, 2016, pp. 2-11.