2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) (2014)
Chicago, IL, USA
May 26, 2014 to May 29, 2014
Checkpoint/Restart is an indispensable fault tolerance technique commonly used by high-performance computing applications that run continuously for hours or days at a time. However, even with state-of-the-art checkpoint/restart techniques, high failure rates at large scale will limit application efficiency. To alleviate the problem, we consider using burst buffers. Burst buffers are dedicated storage resources positioned between the compute nodes and the parallel file system, and this new tier within the storage hierarchy fills the performance gap between node-local storage and parallel file systems. With burst buffers, an application can quickly store checkpoints with increased reliability. In this work, we explore how burst buffers can improve efficiency compared to using only node-local storage. To fully exploit the bandwidth of burst buffers, we develop a user-level Infini Band-based file system (IBIO). We also develop performance models for coordinated and uncoordinated checkpoint/restart strategies, and we apply those models to investigate the best checkpoint strategy using burst buffers on future large-scale systems.
Checkpointing, Buffer storage, Servers, Bandwidth, Reliability, Instruction sets, Computational modeling
K. Sato et al., "A User-Level InfiniBand-Based File System and Checkpoint Strategy for Burst Buffers," 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)(CCGRID), Chicago, IL, USA, 2014, pp. 21-30.