2015 IEEE 22nd International Conference on High Performance Computing (HiPC) (2015)
Dec. 16, 2015 to Dec. 19, 2015
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/HiPC.2015.26
Many methods are available to detect silent errors in high-performance computing (HPC) applications. Each comes with a given cost and recall (fraction of all errors that are actually detected). The main contribution of this paper is to characterize the optimal computational pattern for an application: which detector(s) to use, how many detectors of each type to use, together with the length of the work segment that precedes each of them. We conduct a comprehensive complexity analysis of this optimization problem, showing NP-completeness and designing an FPTAS (Fully Polynomial-Time Approximation Scheme). On the practical side, we provide a greedy algorithm whose performance is shown to be close to the optimal for a realistic set of evaluation scenarios.
Detectors, Protocols, Checkpointing, Greedy algorithms, Interpolation, Time series analysis, Redundancy
"Which Verification for Soft Error Detection?," 2015 IEEE 22nd International Conference on High Performance Computing (HiPC)(HIPC), Bengaluru, India, 2016, pp. 2-11.