2010 IEEE Workshop on Principles of Advanced and Distributed Simulation (2010)
May 17, 2010 to May 19, 2010
Zengxiang Li , Sch. of Comput. Eng., Nanyang Technol. Univ., Singapore, Singapore
Wentong Cai , Sch. of Comput. Eng., Nanyang Technol. Univ., Singapore, Singapore
Stephen John Turner , Sch. of Comput. Eng., Nanyang Technol. Univ., Singapore, Singapore
Ke Pan , Sch. of Comput. Eng., Nanyang Technol. Univ., Singapore, Singapore
A large scale HLA-based simulation (federation) is composed of a large number of simulation components (federates), which may be developed by different participants and executed at different locations. These federates are subject to failures due to various reasons. What is worse, the risk of federation failure increases with the number of federates in the federation. In this paper, a fault tolerance mechanism is proposed to tolerate the crash-stop failures of federates. By exploiting the decoupled federate architecture, federate failures can be masked from the federation and recovery can take place without interrupting the executions of other federates. A basic state recovery protocol is first proposed to recover the state of the failed federate relying on the checkpoint and message logging taken before the failure. Then, an optimized protocol is further developed to accelerate the state recovery procedure. Experiments are carried out to verify that the proposed mechanism provides correct failure recovery. The experimental results also indicate that the optimized protocol can outperform the basic one considerably.
message logging, federate fault tolerance, HLA-based simulation, fault tolerance mechanism, crash-stop failures
Ke Pan, Zengxiang Li, S. J. Turner and Wentong Cai, "Federate Fault Tolerance in HLA-Based Simulation," 2010 IEEE Workshop on Principles of Advanced and Distributed Simulation(PADS), Atlanta, GA, 2010, pp. 1-10.