SC16: International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2016)
Salt Lake City, Utah, USA
Nov. 13, 2016 to Nov. 18, 2016
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/SC.2016.41
Supercomputing platforms are expected to have larger failure rates in the future because of scaling and power concerns. The memory and performance impact may vary with error types and failure modes. Therefore, localized recovery schemes will be important for scientific computations, including failure modes where application intervention is suitable for recovery. We present a resiliency methodology for applications using structured adaptive mesh refinement, where failure modes map to granularities within the application for detection and correction. This approach also enables parameterization of cost for differentiated recovery. The cost model is built with tuning parameters that can be used to customize the strategy for different failure rates in different computing environments. We also show that this approach can make recovery cost proportional to the failure rate.
Arrays, Libraries, Bars, Contamination, Resilience, Mathematical model, Transient analysis
A. Dubey, H. Fujita, D. T. Graves, A. Chien and D. Tiwari, "Granularity and the Cost of Error Recovery in Resilient AMR Scientific Applications," SC16: International Conference for High Performance Computing, Networking, Storage and Analysis(SC), Salt Lake City, UT, USA, 2016, pp. 492-501.