Parallel and Distributed Processing Symposium, International (2009)
May 23, 2009 to May 29, 2009
Greg Bronevetsky , Lawrence Livermore National Lab, USA
Daniel Marques , Ballista Securities, USA
Keshav Pingali , The University of Texas at Austin, USA
Sally McKee , Chalmers University of Technology, USA
Radu Rugina , VMWare, USA
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety of causes. Checkpointing is a popular technique for tolerating such failures, enabling applications to periodically save their state and restart computation after a failure. Although a many automated system-level checkpointing solutions are currently available to HPC users, manual application-level checkpointing remains more popular due to its superior performance. This paper improves performance of automated checkpointing via a compiler analysis for incremental checkpointing. This analysis, which works with both sequential and OpenMP applications, reduces checkpoint sizes by as much as 80% and enables asynchronous checkpointing.
R. Rugina, G. Bronevetsky, K. Pingali, S. McKee and D. Marques, "Compiler-enhanced incremental checkpointing for OpenMP applications," 2009 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), Rome, 2009, pp. 1-12.