This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory
October 2002 (vol. 13 no. 10)
pp. 1085-1098

Abstract—In this paper, we address the problem of garbage collection in a single-failure fault-tolerant home-based lazy release consistency (HLRC) distributed shared-memory (DSM) system based on independent checkpointing and logging. Our solution uses laziness in garbage collection and exploits consistency constraints of the HLRC memory model for low overhead and scalability. We prove safe bounds on the state that must be retained in the system to guarantee correct recovery after a failure. We devise two algorithms for garbage collection of checkpoints and logs, checkpoint garbage collection (CGC), and lazy log trimming (LLT). The proposed approach targets large-scale distributed shared-memory computing on local-area clusters of computers. In such systems, using global synchronization or extra communication for garbage collection is inefficient or simply impractical due to system scale and temporary disconnections in communication. The challenge lies in controlling the size of the logs and the number of checkpoints without global synchronization while tolerating transient disruptions in communication. Our garbage collection scheme is completely distributed, does not force processes to synchronize, does not add extra messages to the base DSM protocol, and uses only the available DSM protocol information. Evaluation results for real applications show that it effectively bounds the number of past checkpoints to be retained and the size of the logs in stable storage.

[1] R. Bianchini, L.I. Kontothanassis, R. Pinto, M.D. Maria, M. Abud, and C.L. Amorim, "Hiding Communication Latency and Coherence Overhead in Software DSMs," Proc. Seventh Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS VII), pp. 198-209, Oct. 1996.
[2] N. Boden et al., "Myrinet: A Gigabit-per-Second Local Area Network," IEEE Micro, Feb. 1995, pp. 29-36.
[3] G. Cabillic, G. Muller, and I. Puaut, “The Performance of Consistent Checkpointing in Distributed Shared Memory Systems,” Proc. Int'l Symp. Reliable Distributed Systems (SRDS), pp. 96–105, Sept. 1995.
[4] G. Cao and M. Singhal, “On Coordinated Checkpointing in Distributed Systems,” IEEE Trans. Parallel and Distributed System pp. 1213-1225, Dec. 1998.
[5] J.B. Carter, J.K. Bennett, and W. Zwaenepoel, "Implementation and Performance of Munin," Proc. 13th ACM SIGOPS Symp. Operating Systems Principles, pp. 152-164,Pacific Grove, Calif., Oct. 1991.
[6] H. Bunke and K. Shearer, “A Graph-Distance Metric Based on the Maximal Common Subgraph,” Pattern Recognition Letters, vol. 19, nos. 3-4, pp. 255-259, 1998.
[7] D. Chen, S. Dwarkadas, S. Parthasarathy, E. Pinheiro, and M.L. Scott, “InterWeave: A Middleware System for Distributed Shared State,” Proc. Fifth Workshop Languages, Compilers, and Run-Time Systems for Scalable Computers (LCR 2000), May 2000.
[8] M. Costa, P. Guedes, M. Sequeira, N. Neves, and M. Castro, "Lightweight Logging for Lazy Release Consistent Distributed Shared Memory," Proc. Second Symp. Operating Systems Design and Implementation, Oct. 1996.
[9] C. Dubnicki et al., "Software Support for Virtual Memory-Mapped Communication," Proc. 1996 Int'l Parallel Processing Symp., IEEE CS Press, Apr. 1996, pp. 372-381.
[10] E.N. Elnozahy, L. Alvisi, D.B. Johnson, and Y.M. Wang, “A Survey of Rollback-Recovery Protocols in Message-Passing Systems,” Technical Report CMU-CS-99-148, Carnegie Mellon Univ., June 1999.
[11] E.N. Elnozahy, D.B. Johnson, and W. Zwaenepoel, "The Performance of Consistent Checkpointing," Proc. 11th Symp. Reliable Distributed Systems, pp. 86-95, Oct. 1992.
[12] M.J. Feeley, J.S. Chase, V.R. Narasayya, and H.M. Levy, “Integrating Coherency and Recoverability in Distributed Systems,” Proc. First Symp. Operating Systems Design and Implementation (OSDI-1), Nov. 1994.
[13] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy, “Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors,” Proc. 17th Ann. Int'l Symp. Computer Architecture, 1990.
[14] L. Iftode, “Home-Based Shared Virtual Memory,” PhD thesis, Princeton Univ., June 1998.
[15] L. Iftode and J.P. Singh, “Shared Virtual Memory: Progress and Challenges,” Proc. IEEE, vol. 83, no. 3, Mar. 1999.
[16] G. Janakiraman and Y. Tamir, “Coordinated Checkpointing Rollback Error Recovery for Distributed Shared Memory Multicomputers,” Proc. 13th Symp. Reliable Distributed Systems (SRDS-13), Oct. 1994.
[17] B. Janssens and W.K. Fuchs, “Reducing Interprocessor Dependence in Recoverable Distributed Shared Memory,” Proc. 13th Symp. Reliable Distributed Systems (SRDS-13), Oct. 1994.
[18] B. Janssens and W.K. Fuchs, “Ensuring Correct Rollback Recovery in Distributed Shared Memory Systems,” J. Parallel and Distributed Computing, Oct. 1995.
[19] D.B. Johnson and W. Zwaenepoel, “Sender-Based Message Logging,” Proc. 17th Int'l Fault-Tolerant Computing Symp. (FTCS-17), June 1987.
[20] P. Keleher, A.L. Cox, and W. Zwaenepoel, “Lazy Release Consistency for Software Distributed Shared Memory,” Proc. 19th Ann. Int'l Symp. Computer Architecture, pp. 13-21, May 1992.
[21] P. Keleher, S. Dwarkadas, A.L. Cox, and W. Zwaenepoel, “TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems,” Proc. Winter '94 USENIX Conf., Jan. 1994.
[22] A. Kermarrec, G. Cabillic, A. Gefflaut, C. Morin, and I. Puaut, “A Recoverable Distributed Shared Memory Integrating Coherency and Recoverability,” Proc. 25th Int'l Symp. Fault-Tolerant Computing Systems (FTCS-25), June 1995.
[23] A. Kongmunvattana and N.-F. Tzeng, “Coherence-Centric Logging and Recovery for Home-Based Software Distributed Shared Memory,” Proc. 1999 Int'l Conf. Parallel Processing (ICPP '99), Sept. 1999.
[24] A. Kongmunvattana and N.-F. Tzeng, “Lazy Logging and Prefetch-Based Crash Recovery in Software Distributed Shared Memory Systems,” Proc. 13th Int'l Parallel Processing Symp. (IPPS-13), Apr. 1999.
[25] L. Lamport, "Time, clocks and the ordering of events in a distributed system," Comm. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[26] N. Neves, M. Castro, and P. Guedes, "A Checkpoint Protocol for an Entry Consistent Shared Memory System," Proc. 13th ACM Symp. Principles of Distributed Computing, Aug. 1994.
[27] J.S. Plank, Y. Chen, K. Li, M. Beck, and G. Kingsley, “Memory Exclusion: Optimizing the Performance of Checkpointing Systems,” Software—Practice and Experience, vol. 29, no. 2, pp. 125-142, 1999.
[28] G.G. Richard III and M. Singhal, “Using Logging and Asynchronous Checkpointing to Implement Recoverable Distributed Shared Memory,” Proc. 12th Symp. Reliable Distributed Systems (SRDS-12), Oct. 1993.
[29] M. Satyanarayanan, H.H. Mashburn, P. Kumar, D.C. Steere, and J.J. Kistler, "Lightweight Recoverable Virtual Memory," ACM Trans. Computer Systems, vol. 12, no. 1, pp. 33-57, Feb. 1994.
[30] D.J. Scales, K. Gharachorloo, and C.A. Thekkath, "Shasta: A Low Overhead, Software-Only Approach for Supporting Fine-Grain Shared Memory," Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems VII, ACM Press, New York, 1996, pp. 174-185.
[31] R. Stets et al., “CASHMERE-2L: Software Coherent Shared Memory on a Clustered Remote-Write Network,” Proc. 16th ACM Symp. Operating Systems Principles, Oct. 1997.
[32] R.E. Strom and S.A. Yemini, "Optimistic Recovery in Distributed Systems," ACM Trans. Computer Systems, vol. 3, no. 3, pp. 204-226, Aug. 1985.
[33] F. Sultan, T. Nguyen, and L. Iftode, “Limited-Size Logging for Fault-Tolerant Distributed Shared Memory with Independent Checkpointing,” Technical Report DCS-TR-409, Dept. of Computer Science, Rutgers Univ., Feb. 2000.
[34] F. Sultan, T. Nguyen, and L. Iftode, “Scalable Fault-Tolerant Distributed Shared Memory,” Proc. SC 2000 High Performance Networking and Computing Conf., Nov. 2000.
[35] G. Suri, B. Janssens, and W.K. Fuchs, "Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory," Proc. IEEE Fault-Tolerant Computing Symp., pp. 279-288, June 1995.
[36] S.C. Woo et al., "The SPLASH-2 Programs: Characterization and Methodological Considerations," Proc. 22nd Annual Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., June 1995, pp. 24-36.
[37] Y. Zhou, L. Iftode, and K. Li, “Performance Evaluation of Two Home-Based Lazy Release Consistency Protocols for Shared Virtual Memory Systems,” Proc. Second USENIX Symp. Operating Systems Design and Implementation (OSDI-2), Oct. 1996.

Index Terms:
Fault tolerance, distributed shared memory, checkpointing, log-based rollback recovery, garbage collection.
Citation:
Florin Sultan, Thu D. Nguyen, Liviu Iftode, "Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory," IEEE Transactions on Parallel and Distributed Systems, vol. 13, no. 10, pp. 1085-1098, Oct. 2002, doi:10.1109/TPDS.2002.1041885
Usage of this product signifies your acceptance of the Terms of Use.