This Article 
 Bibliographic References 
 Add to: 
A Survey of Recoverable Distributed Shared Virtual Memory Systems
September 1997 (vol. 8 no. 9)
pp. 959-969

Abstract—Distributed Shared Virtual Memory (DSVM) systems provide a shared memory abstraction on distributed memory architectures. Such systems ease parallel application programming because the shared-memory programming model is often more natural than the message-passing paradigm. However, the probability of failure of a DSVM increases with the number of sites. Thus, fault tolerance mechanisms must be implemented in order to allow processes to continue their execution in the event of a failure. This paper gives an overview of recoverableDSVMs (RDSVMs) that provide a checkpointing mechanism to restart parallel computations in the event of a site failure.

[1] D. Lenoski et al., “The Stanford DASH Multiprocessor,” Computer, pp. 63-79, Mar. 1992.
[2] T. Lovett and R. Clapp, “STiNG: A CC-NUMA Computer System for the Commercial Marketplace,” Proc. 23rd Ann. Int'l Symp. Computer Architecture, pp. 308-317, May 1996.
[3] E. Hagersten, A. Landin, and S. Haridi,“DDM—A cache-only memory architecture,”IEEE Comput. Mag., vol. 25, pp. 44–54, Sept. 1992.
[4] J. Franck, H. B. III, and J. Rothnie, "The KSR1: Bridging the Gap Between Shared Memory and MMPs," Proc. Compcon '93 38th IEEE CS Int'l Conf., pp. 285-294, Feb. 1993.
[5] K. Li and P. Hudak, "Memory Coherence in Shared Virtual Memory Systems," ACM Trans. Computer Surveys, vol. 7, no. 4, Nov. 1989.
[6] M. Stumm and S. Zhou, "Algorithms Implementing Distributed Shared Memory," Computer, Vol. 23, No. 5, May 1990, pp. 54-64.
[7] K. Birman, "Replication and Fault-Tolerance in the ISIS System," Proc. 10th ACM Symp. Operating Systems Principles, pp. 79-86, Dec. 1985.
[8] J. Bartlett, J. Gray, and B. Horst, "Fault Tolerance in Tandem Computer Systems," The Evolution of Fault-Tolerant Computing, A. Avizienis, H. Kopetz, and J. Laprie, eds., vol. 1, pp. 55-76. Springer Verlag, 1987.
[9] P.A. Lee and T. Anderson, Fault Tolerance: Principles and Practice, second ed. Vienna, Austria: Springer–Verlag, 1990.
[10] B. Lampson, "Atomic Transactions," Lecture notes in Computer Science—Distributed Systems: Architecture and Implementation, vol. 105, pp. 246-265. Springer-Verlag, 1981.
[11] B.D. Fleisch, “Reliable Distributed Shared Memory,” Proc. Second Workshop Experimental Distributed Systems, pp. 102-105, 1990.
[12] M. Stumm and S. Zhou, "Fault Tolerant Distributed Shared Memory Algorithms," Proc. Second IEEE Symp. Parallel and Distributed Processing, pp. 719-724. Dec. 1990.
[13] F.B. Schneider, "The Fail-Stop Processor Approach," Concurrency Control and Reliability in Distributed Systems, chapter 13, pp. 370-394. Barghava, 1987.
[14] J.N. Gray, "Notes on Database Operating Systems" Operating Systems: An Advanced Course, R. Bayer, R.M. Graham, and G. Seegmuller, eds., Lecture Notes in Computer Science 60, Springer-Verlag, Heidelberg, Germany, 1978.
[15] K. Li,J.F. Naughton,, and J.S. Plank,“Real-time, concurrent checkpoint for parallel programs,” Second ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPOPP), SIGPLAN Notices, vol. 25, no. 3, pp. 79-88, Mar. 1990.
[16] A.W. Appel and K. Li, “Virtual Memory Primitives for User Programs,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '91), pp. 96-107, Apr. 1991.
[17] K.M. Chandy and L. Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems," ACM Trans. Computer Systems, Feb. 1985.
[18] K. Kim, J. You, and A. Abouelnaga, "A Scheme for Coordinated Execution of Independently Design Recoverable Distributed Processes," Proc. 16th Int'l Symp. Fault-Tolerant Computing Systems, pp. 130-135, July 1986.
[19] R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems," IEEE Trans. Software Eng., vol. 13, no. 1, pp. 23-31, Jan. 1987.
[20] B. Randell, "System Structure for Software Fault Tolerance," IEEE Trans. Software Eng., vol. 1, no. 2, pp. 220-232, 1975.
[21] K. Li, "Shared Virtual Memory on Loosely Coupled Multiprocessors," PhD thesis, Dept. of Computer Science, Yale Univ., Sept. 1986.
[22] J.B. Carter, J.K. Bennett, and W. Zwaenepoel, "Implementation and Performance of Munin," Proc. 13th ACM SIGOPS Symp. Operating Systems Principles, pp. 152-164,Pacific Grove, Calif., Oct. 1991.
[23] B.N. Bershad and M.J. Zekauskas, "Midway: Shared Memory Parallel Programming with Entry Consistency for Distributed Memory Multiprocessors," Research Report CMU-CS-91-170, Dept. of Computer Science, Carnegie-Mellon Univ., Sept. 1991.
[24] L. Lamport, "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs," IEEE Trans. Computers, vol. 28, no. 9, pp. 690-691, Sept. 1979.
[25] M. Dubois,C. Scheurich,, and F. Briggs,“Memory access buffering in multiprocessors,” Proc. 13th Int’l Symp. Comp. Arch., pp. 434-442, June 1986.
[26] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy, “Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors,” Proc. 17th Ann. Int'l Symp. Computer Architecture, 1990.
[27] H. Bunke and K. Shearer, “A Graph-Distance Metric Based on the Maximal Common Subgraph,” Pattern Recognition Letters, vol. 19, nos. 3-4, pp. 255-259, 1998.
[28] G. Janakiraman and Y. Tamir, “Coordinated Checkpointing Rollback Error Recovery for Distributed Shared Memory Multicomputers,” Proc. 13th Symp. Reliable Distributed Systems (SRDS-13), Oct. 1994.
[29] A. Kermarrec, G. Cabillic, A. Gefflaut, C. Morin, and I. Puaut, “A Recoverable Distributed Shared Memory Integrating Coherency and Recoverability,” Proc. 25th Int'l Symp. Fault-Tolerant Computing Systems (FTCS-25), June 1995.
[30] G. Cabillic, G. Muller, and I. Puaut, “The Performance of Consistent Checkpointing in Distributed Shared Memory Systems,” Proc. Int'l Symp. Reliable Distributed Systems (SRDS), pp. 96–105, Sept. 1995.
[31] C. Cabillic and I. Puaut, "Stardust: An Environment for Parallel Programming on Networks of Heterogeneous Workstations," J. Parallel and Distributed Computing, vol. 40, pp. 65-80, Jan. 1997.
[32] B. Janssens and W.K. Fuchs, “Reducing Interprocessor Dependence in Recoverable Distributed Shared Memory,” Proc. 13th Symp. Reliable Distributed Systems (SRDS-13), Oct. 1994.
[33] B. Janssens and W.K. Fuchs, “Ensuring Correct Rollback Recovery in Distributed Shared Memory Systems,” J. Parallel and Distributed Computing, Oct. 1995.
[34] K.-L. Wu and W.K. Fuchs, "Recoverable Distributed Shared Virtual Memory," IEEE Trans. Computers, vol. 39, no. 4, pp. 460-469, Apr. 1990.
[35] V.O. Tam and M. Hsu, "Fast Recovery in Distributed Shared Virtual Memory Systems," Proc. 10th Int'l Conf. Distributed Computing Systems, pp. 38-45, May 990.
[36] B. Janssens and W.K. Fuchs, "Relaxing Consistency in Recoverable Distributed Shared Memory," Proc. 23rd Int'l Symp. Fault-Tolerant Computing, pp. 155-163, June 1994.
[37] T. Fuchi and M. Tokoro, "A Mechanism for Recoverable Shared Virtual Memory," manuscript, Univ. of Tokyo, May 1994.
[38] G.G. Richard III and M. Singhal, "Using Logging and Asynchronous Checkpointing to Implement Recoverable Distributed Shared Memory," Proc. 12th Symp. Reliable Distributed Systems, pp. 58-67, 1993.
[39] G. Suri, B. Janssens, and W.K. Fuchs, "Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory," Proc. IEEE Fault-Tolerant Computing Symp., pp. 279-288, June 1995.
[40] N. Neves, M. Castro, and P. Guedes, "A Checkpoint Protocol for an Entry Consistent Shared Memory System," Proc. 13th ACM Symp. Principles of Distributed Computing, Aug. 1994.
[41] L. Brown, "Fault Tolerant Distributed Shared Memories." PhD thesis, Florida Atlantic Univ., Dec. 1993.
[42] L. Brown and J. Wu, "Fault Tolerant Distributed Shared Memories," The J. Systems and Software, vol. 29, pp. 149-165, May 995.
[43] M.J. Feeley, J.S. Chase, V.R. Narasayya, and H.M. Levy, "Integrating Coherency and Recoverability in Distributed Systems," Proc. First Symp. Operating Systems Design and Implementation, pp. 215-227, Nov. 1994.
[44] T.J. Wilkinson, "Implementing Fault Tolerance in a 64-bit Distributed Operating System," PhD thesis, City Univ. of London, July 1993.
[45] C. Morin, A. Gefflaut, M. Banâtre, and A.-M. Kermarrec, “COMA: An Opportunity for Building Fault-Tolerant Scalable Shared Memory Multiprocessors,” Proc. 23rd Ann. Int'l Symp. Computer Architecture, May 1996.

Index Terms:
Distributed systems, distributed shared virtual memory, availability, backward error recovery, consistent states.
Christine Morin, Isabelle Puaut, "A Survey of Recoverable Distributed Shared Virtual Memory Systems," IEEE Transactions on Parallel and Distributed Systems, vol. 8, no. 9, pp. 959-969, Sept. 1997, doi:10.1109/71.615441
Usage of this product signifies your acceptance of the Terms of Use.