2017 13th European Dependable Computing Conference (EDCC) (2017)
Sept. 4, 2017 to Sept. 8, 2017
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/EDCC.2017.23
As the number of components in high-performance computing (HPC) systems continues to grow, the number of vehicles for soft errors will rise in parallel. Petascale research has shown that soft errors on supercomputers can occur as frequently as multiple times per day, and this rate will only increase with the exascale era. Due to this frequency, the resilience community has taken an interest in algorithmic resilience as a means for reliable computing in faulty environments. Probabilistic algorithms in particular have generated interest, due to their imprecise nature and ability to handle incorrect guesses. In this paper, we analyze the intrinsic resilience of a probabilistic Top K selection algorithm to silent data corruption in the event of a single event upset. We introduce a new paradigm of analytically quantifying an algorithm's resilience as a function of its inputs, which permits a precise comparison of the resilience of competing algorithms. In addition, we discuss the implications of our findings on the resilience of probabilistic algorithms as a whole in comparison to their deterministic counterparts.
parallel processing, probability, software fault tolerance
R. Slechta, L. Monroe, N. DeBardeleben, Q. Guan, J. Wendelberger and S. Michalak, "Resilience Analysis of Top K Selection Algorithms," 2017 13th European Dependable Computing Conference (EDCC), Geneva, Switzerland, 2018, pp. 42-49.