This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Active Replication of Multithreaded Applications
May 2006 (vol. 17 no. 5)
pp. 448-465

Abstract—Software-based active replication is expensive in terms of performance overhead. Multithreading can help improve performance; however, thread scheduling is a source of nondeterminism in replica behavior. To achieve strong replica consistency in multithreaded environments, this paper proposes intercepting mutex lock/unlock operations performed by threads on accessing the shared data and contributes with two algorithmic solutions: 1) a Loose Synchronization Algorithm (LSA), which captures the natural concurrency in a leader replica and projects it on follower replicas through interreplica communication, and 2) a Preemptive Deterministic Scheduler (PDS) algorithm, which removes the need for interreplica communication through the notion of round and by suspending threads when it is unable (yet) to schedule them deterministically. Failure behavior and performance of LSA and PDS implementations are evaluated in a triplicated system and compared with existing solutions. A performance evaluation indicates that LSA and PDS outperform existing solutions, with PDS offering lower throughput than LSA. A fault-injection campaign shows that PDS is more robust to errors due to the absence of interreplica communication. Hence, LSA and PDS represent a trade-off between performance and dependability. Finally, LSA and PDS are demonstrated in replicating the Apache Web server, a substantial real-world application.

[1] M. Cukier et al., “AQuA: An Adaptive Architecture that Provides Dependable Distributed Objects,” Proc. IEEE Symp. Reliable Distributed Systems, pp. 245-253, 1998.
[2] A. Borg et al., “Fault Tolerance under UNIX,” ACM Trans. Computer Systems, vol. 7, no. 1, pp. 1-24, 1989.
[3] T.C. Bressoud and F.B. Schneider, “Hypervisor-Based Fault Tolerance,” ACM Trans. Computer Systems, vol. 14, no. 1, pp. 80-107, 1996.
[4] P.A. Barrett et al., “The Delta-4 Extra Performance Architecture (XPA),” Proc. Int'l Symp. Fault-Tolerant Computing, pp. 481-488, 1990.
[5] L.E. Moser, P.M. Melliar-Smith, and P. Narasimhan, “Consistent Object Replication in the Eternal System,” Theory and Practice of Object Systems, vol. 4, no. 2, pp. 81-92, 1998.
[6] R. Jimenez-Peris, M. Patino-Martinez, and S. Arevalo, “Deterministic Scheduling for Transactional Multithreaded Replicas,” Proc. IEEE Symp. Reliable Distributed Systems, pp. 164-173, 2000.
[7] C. Basile, K. Whisnant, Z. Kalbarczyk, and R. Iyer, “Loose Synchronization of Multithreaded Replicas,” Proc. IEEE Symp. Reliable Distributed Systems, pp. 250-255, 2002.
[8] C. Basile, Z. Kalbarczyk, and R. Iyer, “Preemptive Deterministic Scheduling Algorithm for Multithreaded Replicas,” Proc. Int'l Conf. Dependable Systems and Networks, pp. 149-158, 2003.
[9] M. Hayden, “The Ensemble System,” PhD dissertation, Dept. of Computer Science, Cornell Univ. 1997.
[10] M.K. Reiter, “The Rampart Toolkit for Building High-Integrity Services,” Lecture Notes in Computer Science, vol. 938, pp. 99-110, 1994.
[11] G. Chockler, I. Keidar, and R. Vitenberg, “Group Communication Specifications: A Comprehensive Study,” ACM Computing Surveys, vol. 33, no. 4, pp. 427-469, 2001.
[12] R. Jimenez-Peris, M. Patino-Martinez, and G. Alonso, “Non-Intrusive, Parallel Recovery of Replicated Data,” Proc. IEEE Symp. Reliable Distributed Systems, pp. 150-159, 2002.
[13] C. Basile, Z. Kalbarczyk, and R. Iyer, “Active Replication of Multithreaded Replicas: Appendix,” http://www.computer.org/tpdsarchives.htm , 2005.
[14] L. Lamport, “Time, Clocks and the Ordering of Events in Distributed Systems,” Comm. ACM, vol. 21, no. 7, pp. 558-564, 1978.
[15] F.B. Schneider, “Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial,” ACM Computing Surveys, vol. 22, no. 4, pp. 299-319, 1990.
[16] C. Basile, Z. Kalbarczyk, K. Whisnant, and R. Iyer, “Active Replication of Multithreaded Applications,” Technical Report CRHC-02-01, Univ. of Illinois at Urbana-Champaign, 2002.
[17] D. Stott, B. Floering, Z. Kalbarczyk, and R. Iyer, “Dependability Assessment in Distributed Systems with Lightweight Fault Injectors in NFTAPE,” Proc. Int'l Computer Performance and Dependability Symp., 2000.
[18] E. Fuchs, “Validating the Fail-Silence Assumption of the MARS Architecture,” Proc. Dependable Computing for Critical Applications Conf., pp. 225-247, 1998.
[19] H. Madeira and J.G. Silva, “Experimental Evaluation of the Fail-Silent Behavior in Computers without Error Masking,” Proc. Int'l Symp. Fault-Tolerant Computing, pp. 350-359, 1994.
[20] M. Rimen, J. Ohlsson, and J. Torin, “On Microprocessor Error Behavior Modeling,” Proc. Int'l Symp. Fault-Tolerant Computing, pp. 76-85, 1994.
[21] C. Basile, L. Wang, Z. Kalbarczyk, and R. Iyer, “Group Communication Protocols under Errors,” Proc. IEEE Symp. Reliable Distributed Systems, pp. 35-44, 2003.
[22] S. Pleisch and A. Schiper, “FATOMAS— A Fault-Tolerant Mobile Agent System Based on the Agent-Dependent Approach,” Proc. Int'l Conf. Dependable Systems and Networks, pp. 215-224, 2001.
[23] G.D. Parrington, S.K. Shrivastava, S.M. Wheater, and M.C. Little, “The Design and Implementation of Arjuna,” Computing Systems, vol. 8, no. 2, pp. 255-308, 1995.
[24] R. Guerraoui, P. Felber, B. Garbinato, and K.R. Mazouni, “System Support for Object Groups,” Proc. ACM Conf. Object-Oriented Programming Systems, Languages, and Applications, pp. 244-258, 1998.

Index Terms:
Fault tolerance, replication, multithreading, nondeterminism, fault injection.
Citation:
Claudio Basile, Zbigniew Kalbarczyk, Ravishankar K. Iyer, "Active Replication of Multithreaded Applications," IEEE Transactions on Parallel and Distributed Systems, vol. 17, no. 5, pp. 448-465, May 2006, doi:10.1109/TPDS.2006.56
Usage of this product signifies your acceptance of the Terms of Use.