This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Multistep Interactive Convergence: An Efficient Approach to the Fault-Tolerant Clock Synchronization of Large Multicomputers
December 1998 (vol. 9 no. 12)
pp. 1195-1212

Abstract—We present a new approach for fault-tolerant internal clock synchronization in multicomputer systems employing not-completely connected networks (NCCNs). The approach is referred to as multistep interactive convergence and is locally implemented in each multicomputer node by a time server process (TSP). We describe a specific algorithm that uses multistep interactive convergence and bases its operation on a logical mapping of the system's TSPs into an m-dimensional array. A TSP executes m steps per round of synchronization, with each step including a call to an interactive convergence procedure. For any TSP, clock readings in step i are gathered only from TSPs with which it shares a row along dimension i of the array. Hence, a TSP reads clocks only from a small subset of the TSPs in the system, which reduces the number of messages by orders of magnitude over a conventional interactive convergence algorithm in which reliable all-to-all broadcast of clock values is done. The algorithm can be used in systems of arbitrary topology and provides the added benefit of increased locality of communication in regular NCCNs such as hypercubes and tori. These advantages can be combined with a variety of message staggering mechanisms to maintain network contention at a minimum. We present expressions for the maximum clock skew, maximum clock drift, maximum clock discontinuity, and number of messages produced by the algorithm, and show that it tolerates arbitrary faults. A comparison with other algorithms that elucidates the advantages of multistep interactive convergence is also provided.

[1] R.W. Butler, "A Survey of Provably Correct Fault-Tolerant Clock Synchronization Techniques," NASA Technical Memorandum 100553, Langley Research Center, Feb. 1988.
[2] A. Ciuffoletti, "Using Simple Diffusion to Synchronize the Clocks in a Distributed System," Proc. 14th Int'l Conf. Distributed Computing Systems, pp. 484-491, 1994.
[3] F. Cristian, H. Aghili, and R. Strong, "Clock Synchronization in the Presence of Omission and Performance Faults, and Processor Joins," Digest 16th Int'l Symp. Fault-Tolerant Computing, pp. 218-223, 1986.
[4] F. Cristian and C. Fetzer, "Probabilistic Internal Clock Synchronization," Proc. 13th Symp. Reliable Distributed Systems, pp. 22-31, 1994.
[5] M.M. de Azevedo and D.M. Blough, "Fault-Tolerant Clock Synchronization for Distributed Systems with High Message Delay Variation," 1994 IEEE Workshop Fault-Tolerant Parallel and Distributed Systems. (Appears in Fault-Tolerant Parallel and Distributed Systems, D. Pradhan and D. Avresky, eds., pp. 268-277, IEEE CS Press, 1995.)
[6] M.M. de Azevedo, "Star-Based Interconnection Networks and Fault-Tolerant Clock Synchronization for Large Multicomputers," PhD dissertation, Univ. of California Irvine, 1997. (Univ. Microfilms Int'l, Ann Arbor, Mich., order no. 9712627). URL:http://www.eng.uci.edu/~mazevedodissert.html .
[7] D. Dolev, "The Byzantine Generals Strike Again," J. Algorithms, vol. 3, pp. 14-30, 1982.
[8] D. Dolev, N.A. Lynch, S.S. Pinter, E.W. Stark, and W.E. Weihl, "Reaching Approximate Agreement in the Presence of Faults," Proc. Third Symp. Reliable Distributed Software and Database Systems, pp. 145-154, 1983.
[9] D. Dolev, J. Y. Halpern, and H. R. Strong,“On the possibility and impossibility of achieving clock synchronization,”J. Comput. and Syst. Sci., vol. 32, pp. 230–250, 1986.
[10] D. Dolev, J.Y. Halpern, B. Simons, and R. Strong, "Dynamic Fault-Tolerant Clock Synchronization," J. ACM, vol. 42, pp. 143-185, Jan. 1995.
[11] R. Guerraoui and A. Schiper, "Consensus Service: A Modular Approach for Building Agreement Protocols in Distributed Systems," Proc. IEEE 26th Int'l Symp. Fault-Tolerant Computing, IEEE CS Press, Los Alamitos, Calif., 1996, pp. 168-177.
[12] R.M. Kieckhafer and M.H. Azadmanesh, "Low Cost Approximate Agreement in Partially Connected Networks," J. Computer Information Systems, vol. 3, no. 2, 1993.
[13] R.M. Kieckhafer and M.H. Azadmanesh,"Reaching Approximate Agreement with Mixed Mode Faults," IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 1, pp. 53-63, Jan. 1994.
[14] H. Kopetz and W. Ochsenreiter, “Clock Synchronization in Distributed Real-Time Systems,” IEEE Trans. Computers, vol. 36, no. 8, pp. 933–940, Aug. 1987
[15] A. Kumar, “Hierarchical Quorum Consensus: A New Algorithm for Managing Replicated Data,” IEEE Trans. Computers, vol. 40, no. 9, pp. 996-1,004, Sept. 1991.
[16] L. Lamport and P.M. Melliar-Smith, “Synchronizing Clocks in the Presence of Faults,” J. ACM, vol. 32, no. 1, pp. 52–78, Jan. 1985.
[17] L. Lamport, "Synchronizing Time Servers," Technical Report 018, Systems Research Center, Digital Equipment Corp., June 1987.
[18] Y. Lan, "Adaptive Fault-Tolerant Multicast in Hypercube Multicomputers," J. Parallel and Distributed Computing, vol. 23, pp. 80-93, 1994.
[19] A.C. Liang, S. Bhattacharya, W.T. Tsai, "Fault-Tolerant Multicast on Hypercube," J. Parallel and Distributed Computing, Vol. 23, No. 12, Dec. 1994, pp. 418-428.
[20] R. Hadas, K. Watkins, and T. Hehre, “Fault-Tolerant Multicast Routing in the Mesh with No Virtual Channels,” Proc. 1996 Int'l Symp. High-Performance Computer Architecture, pp. 180–190, 1996.
[21] B. Liskov, "Practical Uses of Synchronized Clocks in Distributed Systems," Distributed Computing, vol. 6, pp. 211-219, 1993.
[22] S. Mahaney and F. Schneider, "Inexact Agreement: Accuracy, Precision, and Graceful Degradation," Proc. Fouth ACM Symp. Principles of Distributed Computing, ACM Press, New York, 1985, pp. 237-249.
[23] P.S. Miner, "Verification of Fault-Tolerant Clock Synchronization Systems," NASA Technical Paper 3349, Langley Research Center, Nov. 1993.
[24] A. Olson and K.G. Shin, "Fault-Tolerant Clock Synchronization in Large Multicomputer Systems," IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 9, pp. 912-923, Sept. 1994.
[25] A. Olson, K.G. Shin, and B.J. Jambor, "Fault-Tolerant Clock Synchronization for Distributed Systems Using Continuous Synchronization Messages," Digest 25th Int'l Symp. Fault-Tolerant Computing, pp. 154-163, 1995.
[26] D. Peleg and A. Wool, "How to be an Efficient Snoop, or the Probe Complexity of Quorum Systems," Proc. 15th Symp. Principles of Distributed Computing, pp. 290-299, 1996.
[27] M. Pfluegl and D. Blough, “Communication Protocols for Fault-Tolerant Clock Synchronization in Not-Completely-Connected Networks,” Proc. 11th Symp. Reliable Distributed Systems, pp. 130-137, 1992.
[28] M.J. Pfluegl and D.M. Blough, “A New and Improved Algorithm for Fault-Tolerant Clock Synchronization,” J. Parallel and Distributed Computing, vol. 27, no. 1, May 1995.
[29] P. Ramanathan, D.D. Kandlur, and K.G. Shin, “Hardware-Assisted Software Clock Synchronization for Homogeneous Distributed Systems,” IEEE Trans. Computers, vol. 39, no. 4, pp. 514-524, Apr. 1990.
[30] P. Ramanathan and K.G. Shin, "Reliable Broadcast in Hypercube Multicomputers," IEEE Trans. Computers, vol. 37, no. 12, pp. 1,654-1,657, Dec. 1988.
[31] P. Ramanathan, K.G. Shin, and R.W. Butler, “Fault-Tolerant Clock Synchronization in Distributed Systems,” Computer, vol. 23, no. 10, Oct. 1990.
[32] S. Rangarajan and D. Fussell, “Diagnosing Arbitrarily Connected Parallel Computers with High Probability,” IEEE Trans. Computers, vol. 41, pp. 606-615, 1992.
[33] Y. Saad and M. Schultz, "Topological Properties of Hypercubes," IEEE Trans. Computers, vol. 37, no. 7, pp. 867-872, July 1988.
[34] F. Schmuck and F. Cristian, “Continuous Clock Amortization Need Not Affect the Precision of a Clock Synchronization Algorithm,” Proc. Ninth ACM Symp. Principles of Distributed Computing pp. 133–143, Québec City, Québec, Canada, Aug. 1990.
[35] F.B. Schneider,"Understanding Protocols for Byzantine Clock Synchronization," Report No. 87-859, Dept. of Computer Science, Cornell Univ., Aug. 1987.
[36] N. Shankar, “Mechanical Verification of a Generalized Protocol for Byzantine Fault-Tolerant Clock Synchronization,” pp. 217-236, Jan. 1992.
[37] K.G. Shin and P. Ramanathan, “Synchronization of a Large Clock Network in the Presence of Malicious Faults,” IEEE Trans. Computers, vol. 36, no. 1, pp. 2-12, Jan. 1987.
[38] T. K. Srikanth and S. Toueg,“Optimal clock synchronization,”J. ACM, pp. 626–645, July 1987.
[39] N. Suri,M. Hugue, and C. Walter,"Fault Classification and Distribution Effects on the Reliability Modeling of Large Fault-Tolerant Systems," Proc. 22nd Fault-Tolerant Computing Symp., pp. 212-220, July 1992.
[40] N. Suri,M. Hugue, and C. Walter,"Synchronization Issues in Real-Time Systems," Proc. IEEE: Special Issue on Real-Time Computing, vol. 82, no. 1, pp. 41-54, Jan. 1994.
[41] N. Vasanthavada and P.N. Marinos,"Synchronization of Fault-Tolerant Clocks in the Presence of Malicious Failures," IEEE Trans. Computers, vol. 37, no. 4, pp. 440-448, Apr. 1988.
[42] P. Verissimo and L. Rodrigues, “A Posteriori Agreement for Fault-Tolerant Clock Synchronization on Broadcast Networks,” Proc. 22nd Int'l Symp. Fault-Tolerant Computing, 1992.
[43] J. Lundelius Welch and N. Lynch, “A New Fault-Tolerant Algorithm for Clock Synchronization,” Information Computing 77, pp. 1-36, 1988.

Index Terms:
Clock synchronization, fault tolerance, interactive convergence, multicomputers.
Citation:
Marcelo Moraes de Azevedo, Douglas M. Blough, "Multistep Interactive Convergence: An Efficient Approach to the Fault-Tolerant Clock Synchronization of Large Multicomputers," IEEE Transactions on Parallel and Distributed Systems, vol. 9, no. 12, pp. 1195-1212, Dec. 1998, doi:10.1109/71.737696
Usage of this product signifies your acceptance of the Terms of Use.