Proceedings 16th Workshop on Parallel and Distributed Simulation (2002)
May 12, 2002 to May 15, 2002
Francesco Quaglia , Università d Roma "La Sapienza"
Andrea Santoro , Università d Roma "La Sapienza"
Bruno Cician , Università d Roma "La Sapienza"
Recently, a Checkpointing and Communication Library (CCL) to support optimistic parallel simulation on myrinet based clusters has been presented. Beyond classical low latency message delivery functionalities, this library additionally offers CPU offloaded checkpointing functionalities based on data transfer capabilities provided by a programmable DMA engine on board of myrinet network cards. A re-synchronization functionality is also supported for both logical (i.e. data consistency) and practical (i.e. hardware contention) reasons, which is implemented according to the following semantic: at any re-synchronization point, the simulation application is momentarily frozen until the last activated DMA based checkpoint operation is completed. In case long freezing periods are experienced, the checkpointing functionalities offered by CCL might not be fully effective in reducing the real checkpointing overhead at the simulation application level. To tackle this drawback, we present an alternative semantic for re-synchronization, namely conditional checkpoint abort, leading to application freezing only in case at least a threshold fraction of the state vector currently being checkpointed has already been transferred into the checkpoint buffer. In the opposite case, the checkpoint operation is aborted and the simulation application is immediately allowed to proceed, thus avoiding excessive checkpointing overhead (due to freezing) at the simulation application level. We also report the results of an evaluation, carried out using classical parameterized synthetic benchmarks, which show that the execution speed of the simulation application can be significantly increased by the alternative semantic we propose.
Optimistic Simulation, Rollback Based Synchronization, Checkpointing, Performance Optimization
A. Santoro, B. Cician and F. Quaglia, "Conditional Checkpoint Abort: An Alternative Semantic for Re-synchronization in CCL," Proceedings 16th Workshop on Parallel and Distributed Simulation(PADS), Washington, D.C., 2002, pp. 143.