From the September 2014 Issue
Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System
Kathryn Mohror, Adam Moody, Greg Bronevetsky, and Bronis R. de Supinski
High-performance computing (HPC) systems are growing more powerful by utilizing more components. As the system mean time before failure correspondingly drops, applications must checkpoint frequently to make progress. However, at scale, the cost of checkpointing becomes prohibitive. A solution to this problem is multilevel checkpointing, which employs multiple types of checkpoints in a single run. Lightweight checkpoints can handle the most common failure modes, while more expensive checkpoints can handle severe failures. We designed a multilevel checkpointing library, the Scalable Checkpoint/Restart (SCR) library, that writes lightweight checkpoints to node-local storage in addition to the parallel file system. We present probabilistic Markov models of SCR's performance. We show that on future large-scale systems, SCR can lead to a gain in machine efficiency of up to 35 percent, and reduce the load on the parallel file system by a factor of two. Additionally, we predict that checkpoint scavenging, or only writing checkpoints to the parallel file system on application termination, can reduce the load on the parallel file system by 20x on today's systems and still maintain high application efficiency.
Editorials and Announcements
- According to Thomson Reuters' 2013 Journal Citation Report, TPDS has an impact factor of 2.173.
- TPDS celebrates its 25th Anniversary. Editor-in-Chief David A. Bader says, "Congratulations to TPDS on its Silver Jubilee! For 25 years, TPDS has been the parallel and distributed computing community's flagship journal for research breakthroughs!"
- Get Your Journals as eBooks for Free
- Print on Demand is Now Available for OnlinePlus Titles
- eBooks of issues of TPDS can now be downloaded from the Computer Society Digital Library
- State of the Journal (Jan 2014)
- Editor's Note: EIC Farewell and New EIC Introduction (Dec 2013)
- Editor's Note (Nov 2013)
- Editor's Note (Jan 2013)
- Editor's Note (April 2012)
- Editor's Note (January 2012)
- Editorial: Media Center (November 2011)
- Editor's Note: How to Write Research Articles in Computing and Engineering Disciplines by Ivan Stojmenovic
- Full Supplemental PDF of Editor's Note: How to Write Research Articles in Computing and Engineering Disciplines by Ivan Stojmenovic and Veljko Milutinovic (PDF)
- Special Issue on Trust, Security, and Privacy in Parallel and Distributed Systems (Feb 2014)
- Special Issue on Cloud Computing (June 2013)
- Special Issue on Cyber-Physical Systems (CPS) (Sept 2012)
- Special Section on Many-Task Computing (June 2011)
Access recently published TPDS articles
Subscribe to the RSS feed of latest TPDS content added to the digital library.
Sign up for the Transactions Connection newsletter.
Listen to the OnlinePlus Podcast: Computer Society Publishing—two more titles migrate to OnlinePlus™ in 2012.
In this podcast, VP of Publications, David Alan Grier talks about Transactions on Mobile Computing and Transactions on Parallel and Distributed Systems migrating to OnlinePlus™.
TPDS is indexed in ISI
IEEE Transactions on Parallel and Distributed Systems (TPDS) is a scholarly archival journal published monthly. Parallelism and distributed computing are foundational research and technology to rapidly advance computer systems and their applications.
Read the full scope of TPDS