loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007)
Modeling the Impact of Checkpoints on Next-Generation Systems
San Diego, California, USA
September 24-September 27
ISBN: 0-7695-3025-7
Ron A. Oldfield, Sandia National Laboratories
Sarala Arunagiri, The University of Texas at El Paso, USA
Patricia J. Teller, The University of Texas at El Paso, USA
Seetharami Seelam, IBM TJ Watson Research Center, USA
Maria Ruiz Varela, The University of Texas at El Paso, USA
Rolf Riesen, Sandia National Laboratories
Philip C. Roth, Oak Ridge National Laboratory
The next generation of capability-class, massively parallel processing (MPP) systems is expected to have hundreds of thousands of processors. For application-driven, periodic checkpoint operations, the state-of-the-art does not provide a solution that scales to next-generation systems. We demonstrate this by using mathematical modeling to compute a lower bound of the impact of these approaches on the performance of applications executed on three massive-scale, in-production, DOE systems and a theoretical petaflop system. We also adapt the model to investigate a proposed optimization that makes use of "lightweight" storage architectures and overlay networks to overcome the storage system bottleneck. Our results indicate that (1) as we approach the scale of next-generation systems, traditional checkpoint/restart approaches will increasingly impact application performance, accounting for over 50% of total application execution time; (2) although our alternative approach improves performance, it has limitations of its own; and (3) there is a critical need for new approaches to fault tolerance that allow continuous computing with minimal impact on application scalability.
Citation:
Ron A. Oldfield, Sarala Arunagiri, Patricia J. Teller, Seetharami Seelam, Maria Ruiz Varela, Rolf Riesen, Philip C. Roth, "Modeling the Impact of Checkpoints on Next-Generation Systems," msst, pp.30-46, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), 2007
Usage of this product signifies your acceptance of the Terms of Use.