2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W) (2016)
June 28, 2016 to July 1, 2016
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/DSN-W.2016.46
Silent data corruptions (SDCs) are one of the most critical issues in modern HPC systems, as they are "silent" by definition and raise no warnings to users and application developers that a calculation has been corrupted. A significant amount of effort has been made to characterize, detect, and tolerate SDCs. However, current approaches do not share the same understanding of SDC, hence it is not only difficult to evaluate their effectiveness, but also to compare with each other. This position paper argues that SDCs should be discussed at each layer of the system and are confined within the goal of the approach. We provide a preliminary result to differentiate data corruptions across system layers, and show that application-specific correctness checks can tolerate about 50% of the errors that appear in the application output.
Hardware, Fault tolerance, Fault tolerant systems, Operating systems, Registers, Computer crashes, Electronic mail
B. Fang et al., "SDC is in the Eye of the Beholder: A Survey and Preliminary Study," 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W), Toulouse, France, 2016, pp. 72-76.