The Community for Technology Leaders
SC16: International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2016)
Salt Lake City, Utah, USA
Nov. 13, 2016 to Nov. 18, 2016
ISSN: 2167-4337
ISBN: 978-1-4673-8815-3
pp: 492-501
ABSTRACT
Supercomputing platforms are expected to have larger failure rates in the future because of scaling and power concerns. The memory and performance impact may vary with error types and failure modes. Therefore, localized recovery schemes will be important for scientific computations, including failure modes where application intervention is suitable for recovery. We present a resiliency methodology for applications using structured adaptive mesh refinement, where failure modes map to granularities within the application for detection and correction. This approach also enables parameterization of cost for differentiated recovery. The cost model is built with tuning parameters that can be used to customize the strategy for different failure rates in different computing environments. We also show that this approach can make recovery cost proportional to the failure rate.
INDEX TERMS
Arrays, Libraries, Bars, Contamination, Resilience, Mathematical model, Transient analysis,High performance computing, Scientific computing
CITATION
Anshu Dubey, Hajime Fujita, Daniel T. Graves, Andrew Chien, Devesh Tiwari, "Granularity and the Cost of Error Recovery in Resilient AMR Scientific Applications", SC16: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), vol. 00, no. , pp. 492-501, 2016, doi:10.1109/SC.2016.41
90 ms
(Ver 3.3 (11022016))