This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
A Survey of Recoverable Distributed Shared Virtual Memory Systems
September 1997 (vol. 8 no. 9)
pp. 959-969

Abstract—Distributed Shared Virtual Memory (DSVM) systems provide a shared memory abstraction on distributed memory architectures. Such systems ease parallel application programming because the shared-memory programming model is often more natural than the message-passing paradigm. However, the probability of failure of a DSVM increases with the number of sites. Thus, fault tolerance mechanisms must be implemented in order to allow processes to continue their execution in the event of a failure. This paper gives an overview of recoverableDSVMs (RDSVMs) that provide a checkpointing mechanism to restart parallel computations in the event of a site failure.

[1] D. Lenoski et al., “The Stanford DASH Multiprocessor,” Computer, pp. 63-79, Mar. 1992.
[2] T. Lovett and R. Clapp, “STiNG: A CC-NUMA Computer System for the Commercial Marketplace,” Proc. 23rd Ann. Int'l Symp. Computer Architecture, pp. 308-317, May 1996.
[3] E. Hagersten, A. Landin, and S. Haridi,“DDM—A cache-only memory architecture,”IEEE Comput. Mag., vol. 25, pp. 44–54, Sept. 1992.
[4] J. Franck, H. B. III, and J. Rothnie, "The KSR1: Bridging the Gap Between Shared Memory and MMPs," Proc. Compcon '93 38th IEEE CS Int'l Conf., pp. 285-294, Feb. 1993.
[5] K. Li and P. Hudak, "Memory Coherence in Shared Virtual Memory Systems," ACM Trans. Computer Surveys, vol. 7, no. 4, Nov. 1989.
[6] M. Stumm and S. Zhou, "Algorithms Implementing Distributed Shared Memory," Computer, Vol. 23, No. 5, May 1990, pp. 54-64.
[7] K. Birman, "Replication and Fault-Tolerance in the ISIS System," Proc. 10th ACM Symp. Operating Systems Principles, pp. 79-86, Dec. 1985.
[8] J. Bartlett, J. Gray, and B. Horst, "Fault Tolerance in Tandem Computer Systems," The Evolution of Fault-Tolerant Computing, A. Avizienis, H. Kopetz, and J. Laprie, eds., vol. 1, pp. 55-76. Springer Verlag, 1987.
[9] P.A. Lee and T. Anderson, Fault Tolerance: Principles and Practice, second ed. Vienna, Austria: Springer–Verlag, 1990.
[10] B. Lampson, "Atomic Transactions," Lecture notes in Computer Science—Distributed Systems: Architecture and Implementation, vol. 105, pp. 246-265. Springer-Verlag, 1981.
[11] B.D. Fleisch, “Reliable Distributed Shared Memory,” Proc. Second Workshop Experimental Distributed Systems, pp. 102-105, 1990.
[12] M. Stumm and S. Zhou, "Fault Tolerant Distributed Shared Memory Algorithms," Proc. Second IEEE Symp. Parallel and Distributed Processing, pp. 719-724. Dec. 1990.
[13] F.B. Schneider, "The Fail-Stop Processor Approach," Concurrency Control and Reliability in Distributed Systems, chapter 13, pp. 370-394. Barghava, 1987.
[14] J.N. Gray, "Notes on Database Operating Systems" Operating Systems: An Advanced Course, R. Bayer, R.M. Graham, and G. Seegmuller, eds., Lecture Notes in Computer Science 60, Springer-Verlag, Heidelberg, Germany, 1978.
[15] K. Li,J.F. Naughton,, and J.S. Plank,“Real-time, concurrent checkpoint for parallel programs,” Second ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPOPP), SIGPLAN Notices, vol. 25, no. 3, pp. 79-88, Mar. 1990.
[16] A.W. Appel and K. Li, “Virtual Memory Primitives for User Programs,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '91), pp. 96-107, Apr. 1991.
[17] K.M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems," ACM Trans. Computer Systems, Feb. 1985.
[18] K. Kim, J. You, and A. Abouelnaga, "A Scheme for Coordinated Execution of Independently Design Recoverable Distributed Processes," Proc. 16th Int'l Symp. Fault-Tolerant Computing Systems, pp. 130-135, July 1986.
[19] R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems," IEEE Trans. Software Eng., vol. 13, no. 1, pp. 23-31, Jan. 1987.
[20] B. Randell, "System Structure for Software Fault Tolerance," IEEE Trans. Software Eng., vol. 1, no. 2, pp. 220-232, 1975.
[21] K. Li, "Shared Virtual Memory on Loosely Coupled Multiprocessors," PhD thesis, Dept. of Computer Science, Yale Univ., Sept. 1986.
[22] J.B. Carter, J.K. Bennett, and W. Zwaenepoel, "Implementation and Performance of Munin," Proc. 13th ACM SIGOPS Symp. Operating Systems Principles, pp. 152-164,Pacific Grove, Calif., Oct. 1991.
[23] B.N. Bershad and M.J. Zekauskas, "Midway: Shared Memory Parallel Programming with Entry Consistency for Distributed Memory Multiprocessors," Research Report CMU-CS-91-170, Dept. of Computer Science, Carnegie-Mellon Univ., Sept. 1991.
[24] L. Lamport, "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs," IEEE Trans. Computers, vol. 28, no. 9, pp. 690-691, Sept. 1979.
[25] M. Dubois,C. Scheurich,, and F. Briggs,“Memory access buffering in multiprocessors,” Proc. 13th Int’l Symp. Comp. Arch., pp. 434-442, June 1986.
[26] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy, “Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors,” Proc. 17th Ann. Int'l Symp. Computer Architecture, 1990.
[27] H. Bunke and K. Shearer, “A Graph-Distance Metric Based on the Maximal Common Subgraph,” Pattern Recognition Letters, vol. 19, nos. 3-4, pp. 255-259, 1998.
[28] G. Janakiraman and Y. Tamir, “Coordinated Checkpointing Rollback Error Recovery for Distributed Shared Memory Multicomputers,” Proc. 13th Symp. Reliable Distributed Systems (SRDS-13), Oct. 1994.
[29] A. Kermarrec, G. Cabillic, A. Gefflaut, C. Morin, and I. Puaut, “A Recoverable Distributed Shared Memory Integrating Coherency and Recoverability,” Proc. 25th Int'l Symp. Fault-Tolerant Computing Systems (FTCS-25), June 1995.
[30] G. Cabillic, G. Muller, and I. Puaut, “The Performance of Consistent Checkpointing in Distributed Shared Memory Systems,” Proc. Int'l Symp. Reliable Distributed Systems (SRDS), pp. 96–105, Sept. 1995.
[31] C. Cabillic and I. Puaut, "Stardust: An Environment for Parallel Programming on Networks of Heterogeneous Workstations," J. Parallel and Distributed Computing, vol. 40, pp. 65-80, Jan. 1997.
[32] B. Janssens and W.K. Fuchs, “Reducing Interprocessor Dependence in Recoverable Distributed Shared Memory,” Proc. 13th Symp. Reliable Distributed Systems (SRDS-13), Oct. 1994.
[33] B. Janssens and W.K. Fuchs, “Ensuring Correct Rollback Recovery in Distributed Shared Memory Systems,” J. Parallel and Distributed Computing, Oct. 1995.
[34] K.-L. Wu and W.K. Fuchs, "Recoverable Distributed Shared Virtual Memory," IEEE Trans. Computers, vol. 39, no. 4, pp. 460-469, Apr. 1990.
[35] V.O. Tam and M. Hsu, "Fast Recovery in Distributed Shared Virtual Memory Systems," Proc. 10th Int'l Conf. Distributed Computing Systems, pp. 38-45, May 990.
[36] B. Janssens and W.K. Fuchs, "Relaxing Consistency in Recoverable Distributed Shared Memory," Proc. 23rd Int'l Symp. Fault-Tolerant Computing, pp. 155-163, June 1994.
[37] T. Fuchi and M. Tokoro, "A Mechanism for Recoverable Shared Virtual Memory," manuscript, Univ. of Tokyo, May 1994.
[38] G.G. Richard III and M. Singhal, "Using Logging and Asynchronous Checkpointing to Implement Recoverable Distributed Shared Memory," Proc. 12th Symp. Reliable Distributed Systems, pp. 58-67, 1993.
[39] G. Suri, B. Janssens, and W.K. Fuchs, "Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory," Proc. IEEE Fault-Tolerant Computing Symp., pp. 279-288, June 1995.
[40] N. Neves, M. Castro, and P. Guedes, "A Checkpoint Protocol for an Entry Consistent Shared Memory System," Proc. 13th ACM Symp. Principles of Distributed Computing, Aug. 1994.
[41] L. Brown, "Fault Tolerant Distributed Shared Memories." PhD thesis, Florida Atlantic Univ., Dec. 1993.
[42] L. Brown and J. Wu, "Fault Tolerant Distributed Shared Memories," The J. Systems and Software, vol. 29, pp. 149-165, May 995.
[43] M.J. Feeley, J.S. Chase, V.R. Narasayya, and H.M. Levy, "Integrating Coherency and Recoverability in Distributed Systems," Proc. First Symp. Operating Systems Design and Implementation, pp. 215-227, Nov. 1994.
[44] T.J. Wilkinson, "Implementing Fault Tolerance in a 64-bit Distributed Operating System," PhD thesis, City Univ. of London, July 1993.
[45] C. Morin, A. Gefflaut, M. Banâtre, and A.-M. Kermarrec, “COMA: An Opportunity for Building Fault-Tolerant Scalable Shared Memory Multiprocessors,” Proc. 23rd Ann. Int'l Symp. Computer Architecture, May 1996.

Index Terms:
Distributed systems, distributed shared virtual memory, availability, backward error recovery, consistent states.
Citation:
Christine Morin, Isabelle Puaut, "A Survey of Recoverable Distributed Shared Virtual Memory Systems," IEEE Transactions on Parallel and Distributed Systems, vol. 8, no. 9, pp. 959-969, Sept. 1997, doi:10.1109/71.615441
Usage of this product signifies your acceptance of the Terms of Use.