The Community for Technology Leaders
RSS Icon
Issue No.03 - March (2013 vol.24)
pp: 493-505
Jichiang Tsai , National Chung Hsing University, Taichung
Most existing global-snapshot algorithms in distributed systems use control messages to coordinate the construction of a global snapshot among all processes. Since these algorithms typically assume the underlying logical overlay topology is fully connected, the number of control messages exchanged among the whole processes is proportional to the square of number of processes, resulting in higher possibility of network congestion. Hence, such algorithms are neither efficient nor scalable for a large-scale distributed system composed of a huge number of processes. Recently, some efforts have been presented to significantly reduce the number of control messages, but doing so incurs higher response time instead. In this paper, we propose an efficient global-snapshot algorithm able to let every process finish its local snapshot in a given number of rounds. Particularly, such an algorithm allows a tradeoff between the response time and the message complexity. Moreover, our global-snapshot algorithm is symmetrical in the sense that identical steps are executed by every process. This means that our algorithm is able to achieve better workload balance and less network congestion. Most importantly, based on our framework, we demonstrate that the minimum number of control messages required by a symmetrical global-snapshot algorithm is \Omega (N\log N), where N is the number of processes. Finally, we also assume non-FIFO channels.
Process control, Program processors, Time factors, Vectors, Algorithm design and analysis, Hypercubes, Complexity theory, checkpointing, Distributed systems, global snapshots, process symmetry, message passing
Jichiang Tsai, "Flexible Symmetrical Global-Snapshot Algorithms for Large-Scale Distributed Systems", IEEE Transactions on Parallel & Distributed Systems, vol.24, no. 3, pp. 493-505, March 2013, doi:10.1109/TPDS.2012.139
[1] K.M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems," ACM Trans. Computer Systems, vol. 3, no. 1, pp. 63-75, Feb. 1985.
[2] R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems," IEEE Trans. Software Eng., vol. SE-13, no. 1, pp. 23-31, Jan. 1987.
[3] F. Mattern, "Algorithms for Distributed Termination Detection," Distributed Computing, vol. 2, pp. 161-75, 1987.
[4] K.M. Chandy, J. Misra, and L. Haas, "Distributed Deadlock Detection," ACM Trans. Computer Systems, vol. 1, no. 2, pp. 144-156, 1983.
[5] A.D. Kshemkalyani and M. Singhal, "Efficient Detection and Resolution of Generalized Distributed Deadlocks," IEEE Trans. Software Eng., vol. 20, no. 1, pp. 43-54, Jan. 1994.
[6] A.D. Kshemkalyani and B. Wu, "Detecting Arbitrary Stable Properties Using Efficient Snapshots," IEEE Trans. Software Eng., vol. 5, no. 33, pp. 330-346, May 2007.
[7] S. Agarwal, R. Garg, M. Gupta, and J. Moreira, "Adaptive Incremental Checkpointing for Massively Parallel Systems," Proc. Int'l Conf. Supercomputing, pp. 277-286, 2004.
[8] E.N. Elnozahy, L. Alvisi, Y.M. Wang, and D.B. Johnson, "A Survey of Rollback-Recovery Protocols in Message-Passing Systems," ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, Sept. 2002.
[9] K. Taylor, "The Role of Inhibition in Asynchronous Consistent-Cut Protocols," Proc. Third Int'l Workshop Distributed Algorithms, pp. 280-291, 1989.
[10] C. Critchlow and K. Taylor, "The Inhibition Spectrum and the Achievement of Causal Consistency," Proc. Ninth ACM Symp. Principles of Distributed Computing, pp. 31-42, 1990.
[11] J. Helary, "Observing Global States of Asynchronous Distributed Applications," Proc. Third Int'l Workshop Distributed Algorithms, pp. 45-56, 1989.
[12] T.-H. Lai and T. Yang, "On Distributed Snapshots," Information Processing Letters, vol. 25, no. 3, pp. 153-158, 1987.
[13] F. Mattern, "Efficient Algorithms for Distributed Snapshots and Global Virtual Time Approximation," J. Parallel and Distributed Computing, vol. 18, no. 4, pp. 423-434, Aug. 1993.
[14] A.D. Kshemkalyani, M. Raynal, and M. Singhal, "An Introduction to Snapshot Algorithms in Distributed Computing," Distributed Systems Eng., vol. 2, no. 4, pp. 224-233, Dec. 1995.
[15] R. Garg, V. Garg, and Y. Sabharwal, "Efficient Algorithms for Global Snapshots in Large Distributed Systems," IEEE Trans. Parallel and Distributed Systems, vol. 21, no. 5, pp. 620-630, May 2010.
[16] A.D. Kshemkalyani, "Fast and Message-Efficient Global Snapshot Algorithms for Large-Scale Distributed Systems," IEEE Trans. Parallel and Distributed Systems, vol. 21, no. 9, pp. 1281-1209, Sept. 2010.
[17] L. Lamport, "Time, Clocks and the Ordering of Events in a Distributed System," Comm. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[18] B. Janssens and W.K. Fuchs, "Experimental Evaluation of Multiprocessor Cache-Based Error Recovery," Proc. Int'l Conf. Parallel Processing, pp. 505-508, 1991.
[19] D. Manivannan and M. Singhal, "Quasi-Synchronous Checkpointing: Models, Characterization, and Classification," IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 7, pp. 703-713, July 1999.
[20] J.M. Helary, A. Mostefaoui, and M. Raynal, "Communication-Induced Determination of Consistent Snapshots," IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 9, pp. 865-877, Sept. 1999.
[21] Q. Jiang, Y. Luo, and D. Manivannan, "An Optimistic Checkpointing and Message Logging Approach for Consistent Global Checkpoint Collection in Distributed Systems," J. Parallel and Distributed Computing, vol. 68, no. 12, pp. 1575-1589, 2008.
[22] V.T. Chakaravarthy, A.R. Choudhury, and Y. Sabharwal, "Improved Algorithms for the Distributed Trigger Counting Problem," Proc. IEEE Int'l Symp. Parallel and Distributed Processing, pp. 515-523, May 2011.
[23] V.K. Garg, Concurrent and Distributed Computing in Java. Wiley & Sons, 2004.
[24] A. Grama, A. Gupta, G. Karypis, and V. Kumar, Introduction to Parallel Computing, second ed. Addison-Wesley, 2003.
30 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool