This Article 
 Bibliographic References 
 Add to: 
Fault-Tolerant Distributed Shared Memory on a Broadcast-Based Architecture
December 2004 (vol. 15 no. 12)
pp. 1082-1092

Abstract—Due to advances in fiber-optics and VLSI technology, interconnection networks that allow multiple simultaneous broadcasts are becoming feasible. Distributed-shared-memory implementations on such networks promise high performance even for applications with small granularity. This paper presents the architecture of one such implementation, called the Simultaneous Optical Multiprocessor Exchange Bus, and examines the performance of augmented DSM protocols that exploit the natural duplication of data to maintain a recovery memory in each processing node and provide basic fault tolerance. Simulation results show that the additional data duplication necessary to create fault-tolerant DSM causes no reduction in system performance during normal operation and eliminates most of the overhead at checkpoint creation. Under certain conditions, data blocks that are duplicated to maintain the recovery memory are utilized by the underlying DSM protocol, reducing network traffic, and increasing the processor utilization significantly.

[1] M. Banatre, A. Gefflaut, P. Joubert, and C. Morin, “An Architecture for Tolerating Processor Failures in Shared Memory Multiprocessors,” IEEE Trans. Computers, vol. 45, no. 10, pp. 1101-1115, Oct. 1996.
[2] C. Calvin, “All-To-All Broadcast in Torus with Wormhole-Like Routing,” Proc. IEEE Symp. Parallel Distributed Processing, pp. 130-137, 1995.
[3] R. Christodoulopoulou, R. Azimi, and A. Bilas, “Dynamic Data Replication: An Approach to Providing Fault-Tolerant Shared Memory Clusters,” Proc. Ninth Int'l Symp. High-Performance Computer Architecture, pp. 203-214, Feb. 2003.
[4] L. Dong, B. Ortega, and L. Reekie, “Coupling Characteristics of Claddding Modes in Tilted Optical Fiber Gratings,” Applied Optics, vol. 37, no. 22, pp. 5099-5105, Aug. 1998.
[5] B.D. Fleisch, “Reliable Distributed Shared Memory,” Proc. IEEE Workshop Experimental Distributed Systems, pp. 102-105, 1990.
[6] B.D. Fleisch, H. Michel, S.K. Shah, and O.E. Theel, “Fault Tolerance and Configurability in DSM Coherence Protocols,” IEEE Concurrency, vol. 8, no. 2, pp. 10-21, Apr.-June 2000.
[7] A. Grujic, M. Tomasevic, and V. Milutinovic, “A Simulation Study of Hardware-Oriented DSm Approaches,” IEEE Parallel & Distributed Technology, vol. 4, no. 1, p. 74, 1996.
[8] D.L. Hecht, “Fault Tolerant Distributed-Shared Memory on a Broadcast-Based Architecture,” PhD thesis, Drexel Univ., Dec. 2002.
[9] C. Ho, “Optimal Broadcast in All-Port Wormhole-Routed Hypercubes,” IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 2, pp. 203-205, Feb. 1995.
[10] Y.C. Hu, H. Lu, A.L. Cox, and W. Zwaenepoell, “OpenMP for Network of SMPs,” Proc. 13th Int'l Symp. Parallel and Distributed Processing, pp. 302-310, 1999.
[11] C. Katsinis, “Performance Analysis of the Simultaneous Optical Multiprocessor Exchange Bus,” Parallel Computing J., vol. 27, no. 8, pp. 1079-1115, July 2001.
[12] A.M. Kermarrec, C. Morin, and M. Banatre, “Design, Implementation and Evaluation of ICARE: An Efficient Recoverable DSM,” Software Practice and Experience, vol. 28, no. 9, pp. 981-1010, 1998.
[13] J.H. Kim and N.H. Vaidya, “Single Fault-Tolerant Distributed Shared Memory Using Competitive Update,” Microprocessors and Microsystems, vol. 21, no. 3, pp. 183-196, Dec. 1997.
[14] Y. Li, T. Wang, and K. Fasanella, “Cost-Effective Side-Coupling Polymer Fiber Optics for Optical Interconnections,” J. Lightwave Technology, vol. 16, no. 5, pp. 892-901, May 1998.
[15] C. Morin and I. Puaut, “Survey of Recoverable Distributed Shared Virtual Memory Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 9, pp. 959-969, Sept. 1997.
[16] D.V. Plant, M.B. Venditti, E. Laprise, J. Faucher, K. Razavi, M. Chateauneuf, A.G. Kirk, and J.S. Ahearn, “256 Channel Bidirectional Optical Interconnect Using VCSELs and Photodiodes on CMOS,” J. Lightwave Technolgy, vol. 19, no. 8, pp. 1093-1103, Aug. 2001.
[17] S. Roy and V. Chaudhary, “Strings: A High-Performance Distributed Shared Memory for Symmetric Multiprocessor Clusters,” Proc. Seventh Int'l Symp. High Performance Distributed Computing, pp. 90-97, 1998.
[18] R. Stets, S. Dwarkadas, N. Hardavellas, G. Hung, L. Kontothanassis, S. Parthasarahy, and M. Scott, “Cashmere-2L: Software Coherent Shared Memory on a Clustered Remote Write Network,” Proc. 16th ACM Symp. Operating Systems Principles, pp. 170-183, 1997.
[19] O.E. Theel and B.D. Fleisch, “A Dynamic Coherence Protocol for Distributed Shared Memory Enforcing High Data Availability at Low Costs,” IEEE Trans. Parallel and Distributed Systems, vol. 7, no. 9, pp. 915-930, Sept. 1996.
[20] J.G. Turk and B.D. Fleisch, “DBRpc: A Highly Adaptable Protocol for Reliable DSM Systems,” Proc. 19th IEEE Int'l Conf. Distributed Computing Systems, pp. 340-348, 1999.
[21] http://tracebase.nmsu.edutracebase.html, 2004.

Index Terms:
Multiprocessors, distributed-shared-memory, fault tolerance.
Constantine Katsinis, Diana Hecht, "Fault-Tolerant Distributed Shared Memory on a Broadcast-Based Architecture," IEEE Transactions on Parallel and Distributed Systems, vol. 15, no. 12, pp. 1082-1092, Dec. 2004, doi:10.1109/TPDS.2004.83
Usage of this product signifies your acceptance of the Terms of Use.