This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
A Timeout-Based Message Ordering Protocol for a Lightweight Software Implementation of TMR Systems
January 2004 (vol. 15 no. 1)
pp. 53-65

Abstract—Replicated processing with majority voting is a well-known method for achieving reliability and availability. Triple Modular Redundant (TMR) processing is the most commonly used version of that method. Replicated processing requires that the replicas reach agreement on the order in which input requests are to be processed. Almost all synchronous and deterministic ordering protocols published in the literature are time-based in the sense that they require replicas' clocks to be kept synchronized within some known bound. We present a protocol for TMR systems that is based on timeouts and does not require clocks to be kept in bounded synchronism. Our design efforts focus on keeping the ordering delays small, without an unnecessary increase in message overhead. Consequently, we are able to show that no symmetric protocol that works only with unsynchronized clocks can provide a smaller worst-case delay. We also demonstrate through analysis and experiments that our protocol is faster than a time-based one of identical message complexity in certain situations which can prevail in many application settings.

[1] M. Pease, R. Shostak, and L. Lamport, Reaching Agreement in the Presence of Faults J. ACM, vol. 27, no. 2, pp. 228-234, Apr. 1980.
[2] D. Powell, G. Bonn, D. Seaton, P. Verissimo, and F. Waeselynck, The Delta-4 Approach to Dependability in Open Distributed Computing Systems Proc. 18th IEEE Int'l Symp. Fault-Tolerant Computing (FTCS-18), pp. 246-251, June 1988.
[3] F.B. Schneider, Implementing Fault Tolerant Services Using the State Machine Approach: A Tutorial ACM Computing Surveys, vol. 22, no. 4, pp. 299-319, Dec. 1990.
[4] L. Lamport, Using Time Instead of Timeout for Fault-Tolerant Distributed Systems ACM Trans. Programming Languages and Systems, vol. 6, no. 2, pp. 254-280, Apr. 1984.
[5] N. Vasanthavada and P.N. Marinos,"Synchronization of Fault-Tolerant Clocks in the Presence of Malicious Failures," IEEE Trans. Computers, vol. 37, no. 4, pp. 440-448, Apr. 1988.
[6] P. Verissimo, L. Rodrigues, and A. Casimoro, Cesium Spray: A Precise and Accurate Global Clock Service of Large Scale Systems J. Real Time Systems, vol. 11, no. 3, 1997.
[7] D. Dolev, J. Halpern, and H.R. Strong, On the Possibility and Impossibility of Achieving Clock Synchronisation Proc. 16th Ann. ACM STOC, pp. 504-511, Apr. 1984.
[8] F. Schmuck and F. Cristian, Continuous Clock Amortization Need Not Affect the Precision of a Clock Synchronisation Algorithm Proc. Ninth ACM Symp. Principles of Distributed Computing, pp. 133-141, Aug. 1990.
[9] K. Echtle, Fault Masking and Sequence Agreement by a Voting Protocol with Low Message Number Proc. Sixth Symp. Reliability in Distributed Software and Database Systems, pp. 149-160, Mar. 1987.
[10] P. Verissimo, L. Rodrigues, and J. Rufino, The Atomic Multicast Protocol (AMp) Delta-4: A Generic Architecture for Dependable Distributed Computing, D. Powell, ed., pp. 267-294, ESPRIT Research Papers, Springer-Verlag, 1991.
[11] P. Verissimo, Causal Delivery Protocols in Real-Time Systems: A Generic Model J. Real Time Systems, vol. 10, no. 1, pp. 45-73, 1996.
[12] R.L. Rivest, A. Shamir, and L. Adleman, A Method for Obtaining Digital Signatures and Public Key Cryptosystems Comm. ACM, vol. 31, no. 2, pp. 120-126, Feb. 1978.
[13] P.D. Ezhilchelvan, Early Stopping Algorithms for Distributed Agreement under Fail-Stop, Omission, and Timing Fault Types Proc. Sixth Symp. Reliability in Distributed Software and Database Systems, pp. 201-212, Mar. 1987.
[14] D. Dolev, R. Reischuk, and H.R. Strong, Early Stopping in Byzantine Agreement J. ACM, vol. 37, no. 4, pp. 720-741, Oct. 1990.
[15] S.K. Shrivastava, P.D. Ezhilchelvan, N.A. Speirs, S. Tao, and A. Tully, “Principal Features of the VOLTAN Family of Reliable Node Architectures for Distributed Systems,” IEEE Trans. Computers, vol. 41, no. 5, pp. 542–549, May 1992.
[16] D. Dolev and H.R. Strong, Requirements for Agreement in a Distributed System Proc. Second Symp. Distributed Databases, pp. 115-129, Sept. 1982.
[17] M. Castro and B. Liskov, Practical Byzantine Fault Tolerance Proc. Third ACM Symp. Operating Systems Design and Implementation (OSDI), pp. 173-186, Feb. 1999.
[18] F.V. Brasileiro, P.D. Ezhilchelvan, and N.A. Speirs, TMR Processing Without Explicit Clock Synchronisation Proc. 14th Symp. Reliable Distributed Systems, pp. 186-195, Sept. 1995.
[19] P.D. Ezhilchelvan, F.V. Brasileiro, and N.A. Speirs, Timeout Based Message Ordering Protocols for a Lightweight, Software Implementation of TMR Systems www.aciri.org/floyd/papers/simulate_2001.pdfhttp:/ /www.cs.ncl.ac.uk/research/pubs/ trs/papers817.pdf, June 2002.
[20] F. Cristian, H. Aghili, R. Strong, and D. Dolev, Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement Digest of Papers, FTCS-15, Ann Arbor, pp. 200-206, June 1985.
[21] L. Lamport, Time, Clocks, and Ordering of Events in a Distributed System Comm. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[22] P. Verissimo and M. Raynal, Time in Distributed System Models and Algorithms Advances in Distributed Systems, S. Krakowiak and S.K. Shrivastava, eds., pp. 1-32, LNCS 1752, Springer-Verlag, 2000.
[23] H. Kopetz, A. Damm, C. Koza, M. Mulazzani, W. Schwabi, C. Senft, and R. Zainlinger, "Distributed Fault-Tolerant Real-Time Systems: The MARS Approach," IEEE Micro, pp. 25-58, Feb. 1989.
[24] J.Y. Halpern et. al., Fault-Tolerant Clock Synchronisation Proc. Third ACM Symp. Principles of Distributed Computing, pp. 89-102, Aug. 1984.
[25] L. Lamport and P.M. Melliar-Smith, Synchronising Clocks in the Presence of Faults J. ACM, vol. 32, no. 1, pp. 52-78, Jan. 1985.
[26] T.K. Srikanth and S. Toueg, Optimal Clock Synchronisation Proc. Fourth ACM Symp. Principles of Distributed Computing, pp. 71-86, Aug. 1985.
[27] H. Kopetz and W. Ochsenreiter, Clock Synchronisation in Distributed Real Time Systems IEEE Trans. Computers, vol 36, no. 8, pp. 933-940, 1987.
[28] INMOS Limited, Transputer Instruction Set, Prentice Hall Int'l (UK) Ltd., ISBN 0-13-929100-8, 1988.
[29] N.A. Speirs, S. Tao, F.V. Brasileiro, P.D. Ezhilchelvan, and S.K Shrivastava, The Design and Implementation of Voltan Fault-Tolerant Systems for Distributed Systems Transputer Comm., vol. 1, no. 2, pp. 1-17, Nov. 1993.
[30] M.J. Fischer, N.A. Lynch, and M.S. Paterson, Impossibility of Distributed Consensus with One Faulty Process J. ACM, vol. 32, no. 2, pp. 374-382, Apr. 1985.
[31] T.D. Chandra and S. Toueg, Unreliable Failure Detectors for Reliable Distributed Systems J. ACM, vol. 43, no. 2, pp. 225-267, Mar. 1996.
[32] F. Cristian and C. Fetzer, “The Timed Asynchronous Distributed System Model,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 6, pp. 642-657, June 1999.

Index Terms:
Byzantine failures, fault tolerance, Triple Modular Redundancy (TMR), process replication, agreement, message ordering, physical and logical clocks.
Citation:
Paul D. Ezhilchelvan, Francisco V. Brasileiro, Neil A. Speirs, "A Timeout-Based Message Ordering Protocol for a Lightweight Software Implementation of TMR Systems," IEEE Transactions on Parallel and Distributed Systems, vol. 15, no. 1, pp. 53-65, Jan. 2004, doi:10.1109/TPDS.2004.1264786
Usage of this product signifies your acceptance of the Terms of Use.