2008 IEEE Fourth International Conference on eScience (2008)
Dec. 7, 2008 to Dec. 12, 2008
There exists a class of scientific applications for which utilizing distributed resources is critical for reducing the time-to-solution. In this paper, we discuss a specific class of applications - Replica-Exchange simulations - where the orchestration of many distributed jobs in a dynamic and inherently unreliable distributed environment is essential for a successful completion. We describe the design, development and deployment of a unique framework for constructing fault-tolerant distributed simulations. The framework consists of two primary components - SAGA and Migol. SAGA is a high-level programmatic abstraction layer that provides a standardised interface for the primary distributed functionality required for application development. We present details of a newly developed functionality in SAGA - the Checkpoint and Recovery (CPR) API. Migol is an adaptive middleware, which supports the fault-tolerance of distributed applications by providing the capability to recover applications from checkpoint files transparently. In addition to describing the integration of SAGA-CPR with the Migol infrastructure, we outline our experiences with running a large scale, general-purpose Replica-Exchange application in a production distributed environment.
checkpointing, data structures, middleware, software fault tolerance
A. Luckow, S. Jha, J. Kim, A. Merzky and B. Schnor, "Distributed Replica-Exchange Simulations on Production Environments Using SAGA and Migol," 2008 IEEE Fourth International Conference on eScience(E-SCIENCE), Indianapolis, IN, 2010, pp. 253-260.