This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Hardware Support for Flexible Distributed Shared Memory
October 1998 (vol. 47 no. 10)
pp. 1056-1072

Abstract—Workstation-based parallel systems are attractive due to their low cost and competitive uniprocessor performance. However, supporting a cache-coherent global address space on these systems involves significant overheads. We examine two approaches to coping with these overheads. First, DSM-specific hardware can be added to the off-the-shelf component base to reduce overheads. Second, application-specific coherence protocols can avoid some overheads by exploiting programmer (or compiler) knowledge of an application's communication patterns. To explore the interaction between these approaches, we simulated four designs that add DSM acceleration hardware to a collection of off-the-shelf workstation nodes. Three of the designs support user-level software coherence protocols, enabling application-specific protocol optimizations. To verify the feasibility of our hardware approach, we constructed a prototype of the simplest design. Measured speedups from the prototype match simulation results closely. We find that, even with aggressive DSM hardware support, custom protocols can provide significant speedups for some applications. In addition, the custom protocols are generally effective at reducing the impact of other overheads, including those due to less aggressive hardware support and larger network latencies. However, for three of our benchmarks, the additional hardware acceleration provided by our most aggressive design avoids the need to develop more efficient custom protocols.

[1] S. Adve and M. Hill, “Weak Ordering—A New Definition,” Proc. 17th Ann. Int'l Symp. Computer Architecture, May 1990.
[2] A. Agarwal et al., “The MIT Alewife Machine: Architecture and Performance,” Proc. Int'l Symp. Computer Architecture, pp. 2-13, June 1995.
[3] T.E. Anderson, D.E. Culler, and D.A. Patterson, “A Case for NOW (Networks of Workstations),” IEEE Micro, vol. 15, no. 1, pp. 54–64, 1995.
[4] D. Bailey, J. Barton, T. Lasinski, and H. Simon, "The NAS Parallel Benchmarks," Technical Report RNR-91- 002 Revision 2, Ames Research Center, Aug. 1991.
[5] H.E. Bal and M.F. Kaashoek, "Object Distribution in Orca Using Compile-Time and Run-Time Techniques," Proc. Eigth Ann. Conf. Object-Oriented Programming Systems, Languages and Applications (OOPSLA '93), pp. 162-177, Oct. 1993.
[6] R. Bianchini, L.I. Kontothanassis, R. Pinto, M.D. Maria, M. Abud, and C.L. Amorim, "Hiding Communication Latency and Coherence Overhead in Software DSMs," Proc. Seventh Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS VII), pp. 198-209, Oct. 1996.
[7] M.A. Blumrich et al., "Virtual Memory Mapped Network Interface for the SHRIMP Multicomputer," Proc. 21st Int'l Symp. Computer Architecture, Apr. 1994, pp. 142-153.
[8] N. Boden et al., "Myrinet: A Gigabit-per-Second Local Area Network," IEEE Micro, Feb. 1995, pp. 29-36.
[9] D. Burger and S. Mehta, "Parallelizing Appbt for a Shared-Memory Multiprocessor," Technical Report 1286, Computer Sciences Dept., Univ. of Wisconsin-Madison, Sept. 1995.
[10] J.B. Carter, J.K. Bennett, and W. Zwaenepoel, "Implementation and Performance of Munin," Proc. 13th ACM SIGOPS Symp. Operating Systems Principles, pp. 152-164,Pacific Grove, Calif., Oct. 1991.
[11] S. Chandra and J.R. Larus, "Optimizing Communication in HPF Programs on Fine-Grain Distributed Shared Memory," Proc. Sixth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 100-111, June 1997.
[12] S. Chandra, B. Richards, and J.R. Larus, "Teapot: Language Support for Writing Memory Coherence Protocols," Proc. SIGPLAN Conf. Programming Language Design and Implementation, May 1996.
[13] D.E. Culler, A. Dusseau, S.C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. Yelick, "Parallel Programming in Split-C," Supercomputing, 1993.
[14] D. Culler et al., "Assessing Fast Network Interfaces," IEEE Micro, Feb. 1996, pp. 35-43.
[15] S. Dwarkadas, A.L. Cox, and W. Zwaenepoel, "An Integrated Compile-Time/Run-Time Software Distributed Shared Memory System," Proc. Seventh Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS VII), pp. 186-197, Oct. 1996.
[16] R.S. Sandhu, D. Ferraiolo, and R. Kuhn, "The NIST Model for Role-Based Access Control: Towards A Unified Standard," 5th ACM Workshop on Role-Based Access Control, ACM Press, New York, 2000, pp. 47-60.
[17] B. Falsafi and D. Wood, "Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA," Proc. 24th Ann. Int'l Symp. Computer Architecture, ACM Press, New York, 1997, pp. 229-239.
[18] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy, “Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors,” Proc. 17th Ann. Int'l Symp. Computer Architecture, 1990.
[19] R.B. Gillett, "Memory Channel Network for PCI," IEEE Micro, vol. 16, no. 1, pp. 12-18, Feb. 1996.
[20] L. Gwennap, "Intel's P6 Bus Designed for Multiprocessing," Microprocessor Report, vol. 9, no. 7, May30 1995.
[21] E. Hagersten, A. Saulsbury, and A. Landin, "Simple COMA Node Implementations," Proc. 27th Hawaii Int'l Conf. System Sciences (HICSS 27), IEEE Computer Soc. Press, Los Alamitos, Calif., 1994.
[22] J. Heinlein, K. Gharachorloo, S.A. Dresser, and A. Gupta, "Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor," Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS VI), pp. 38-50, Oct. 1994.
[23] M. Heinrich et al., “The Performance Impact of Flexibility in the Stanford FlashMultiprocessor,” Proc. Sixth Int’l Conf. Architectural Support for Programming Languages and OperatingSystems, IEEE Computer Society Press, Los Alamitos, Calif., 1994, pp. 274-284.
[24] M.D. Hill, “Multiprocessors Should Support Simple Memory-Consistency Models,” Computer, vol. 31, no. 8, pp. 28-34, Aug. 1998.
[25] P. Keleher, A.L. Cox, and W. Zwaenepoel, “Lazy Release Consistency for Software Distributed Shared Memory,” Proc. 19th Ann. Int'l Symp. Computer Architecture, pp. 13-21, May 1992.
[26] P. Keleher, S. Dwarkadas, A. Cox, and W. Zwaenepoel, "TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems," Proc. Winter '94 Usenix Conf., pp. 115-131, Jan. 1994.
[27] D. Kranz et al., “Integrating Message Passing and Shared Memory: Early Experience,” Proc. Fourth ACM SIGPlan Symp. Principles and Practice of Parallel Programming, ACM Press, New York, 1963, pp. 54-63.
[28] M. Heinrich et al. “The Stanford FLASH Multiprocessor,” Proc. 21th Int'l Symp. Computer Architecture, pp. 302-313, April 1994.
[29] J.R. Larus and E. Schnarr, "EEL: Machine Independent Executable Editing," Proc. Sigplan Programming Languages, Design, and Implementation, ACM Press, New York, 1995, pp. 291-300.
[30] J. Laudon and D. Lenoski, "The SGI Origin: A cc-NUMA Highly Scalable Server," Proc. 24th Ann. Int'l Symp. Computer Architecture, May 1997.
[31] D. Lenoski et al., “The Stanford DASH Multiprocessor,” Computer, pp. 63-79, Mar. 1992.
[32] K. Li and P. Hudak, "Memory Coherence in Shared Virtual Memory Systems," ACM Trans. Computer Surveys, vol. 7, no. 4, Nov. 1989.
[33] M. Marchetti, L. Kontothanassis, R. Bianchini, and M.L. Scott, "Using Simple Page Placement Policies to Reduce the Cost of Cache Fills in Coherent Shared-Memory Systems," Proc. Ninth Int'l Parallel Processing Symp., Apr. 1995.
[34] M. Martonosi, D. Ofelt, and M. Heinrich, "Integrating Performance Monitoring and Communication in Parallel Computers," Proc. 1996 ACM Sigmetrics Conf. Measurement and Modeling of Computer Systems, pp. 138-147, May 1996.
[35] T. Mowry and A. Gupta, "Tolerating Latency through Software-Controlled Prefetching in Scalable Shared- Memory Multiprocessors," J. Parallel and Distributed. Computing, vol. 12, pp. 87-106, June 1991.
[36] S.S. Mukherjee, S.K. Reinhardt, B. Falsafi, M. Litzkow, S. Huss-Lederman, M.D. Hill, J.R. Larus, and D.A. Wood, "Wisconsin Wind Tunnel II: A Fast and Portable Parallel Architecture Simulator," Proc. Workshop Performance Analysis and Its Impact on Design (PAID), June 1997.
[37] S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, and J. Saltz, “Efficient Support for Irregular Applications on Distributed-Memory Machines,” Principles and Practice of Parallel Programming (PPoPP) 1995, pp. 68-79, July 1995.
[38] A. Nowatzyk, M. Monger, M. Parkin, E. Kelly, M. Browne, G. Aybay, and D. Lee, "S3.mp: A Multiprocessor in a Matchbox," Proc. PASA, 1993.
[39] R.W. Pfile, "Typhoon-Zero Implementation: The Vortex Module," Technical Report 1290, Computer Sciences Dept., Univ. of Wisconsin-Madison, Oct. 1995.
[40] S.K. Reinhardt, "Tempest Interface Specification (Revision 1.2.1)," Technical Report 1267, Computer Sciences Dept., Univ. of Wisconsin-Madison, Feb. 1995.
[41] S.K. Reinhardt, "Mechanisms for Distributed Shared Memory," PhD thesis, Computer Sciences Dept., Univ. of Wisconsin-Madison, Dec. 1996.
[42] S.K. Reinhardt, B. Falsafi, and D.A. Wood, "Kernel Support for the Wisconsin Wind Tunnel," Proc. USENIX Symp. Microkernels and Other Kernel Architectures, pp. 73-89, Sept. 1993.
[43] S.K. Reinhardt, M.D. Hill, J.R. Larus, A.R. Lebeck, J.C. Lewis, and D.A. Wood, "The Wisconsin Wind Tunnel: Virtual Prototyping of Parallel Computers," Proc. ACM SIGMETRICS Conf. Measurement and Modeling of Computer Systems, pp. 48-60, ACM, May 1993.
[44] S.K. Reinhardt, J.R. Larus, and D.A. Wood, “Tempest and Typhoon: User-Level Shared Memory,” Proc. 21st Int'l Symp. Computer Architecture, pp. 325-337, Apr. 1994.
[45] S. Reinhardt, R. Pfile,, and D. Wood, “Decoupled Hardware Support for Distributed Shared Memory,” Proc. 23rd Ann. Int'l Symp. Computer Architecture, pp. 34-43, May 1996.
[46] Ross Technology, Inc. SPARC RISC User's Guide: hyperSPARC Edition, Sept. 1993.
[47] E. Rosti, E. Smirni, T.D. Wagner, A.W. Apon, and L.W. Dowdy, "The KSR1: Experimentation and Modeling of Poststore," Proc. ACM Sigmetrics Conf. Measurement and Modeling of Computer Systems, pp. 74-85, May 1993.
[48] D.J. Scales, K. Gharachorloo, and C.A. Thekkath, "Shasta: A Low Overhead, Software-Only Approach for Supporting Fine-Grain Shared Memory," Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems VII, ACM Press, New York, 1996, pp. 174-185.
[49] I. Schoinas, B. Falsafi, M.D. Hill, J.R. Larus, C.E. Lucas, S.S. Mukherjee, S.K. Reinhardt, E. Schnarr, and D.A. Wood, "Implementing Fine-Grain Distributed Shared Memory on Commodity SMP Workstations," Technical Report 1307, Computer Sciences Dept., Univ. of Wisconsin-Madison, Mar. 1996.
[50] I. Schoinas, B. Falsafi, A.R. Lebeck, S.K. Reinhardt, J.R. Larus, and D.A. Wood, “Fine-Grain Access Control for Distributed Shared Memory,” Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '94), pp. 297-306, Oct. 1994.
[51] J.P. Singh, W.D. Weber, and A. Gupta, "SPLASH: Stanford Parallel Applications for Shared Memory," Proc. 19th Annual Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., May 1992, pp. 5-14.
[52] Sun Microsystems Inc., SPARC MBus Interface Specification, Apr. 1991.
[53] C.A. Thekkath and H.M. Levy, "Hardware and Software Support for Efficient Exception Handling," Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems,San Jose, Calif., Oct. 1994.
[54] Thinking Machines Corp., "The Connection Machine CM-5 Technical Summary," 1991.
[55] T. von Eicken et al., “Active Messages: A Mechanism for Integrated Communication and Computation,” Proc. 19th Int’l Symp. Computer Architecture, Assoc. of Computing Machinery, N.Y., May 1992, pp. 256-266.
[56] S.C. Woo, J.P. Singh, and J.L. Hennessy, "The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors," Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS VI), pp. 219-229, Oct. 1996.
[57] Z. Xu, J.R. Larus, and B.P. Miller, "Shared-Memory Performance Profiling," Proc. Sixth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, June 1997.
[58] Y. Zhou et al., “Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation,” Proc. Sixth ACM Symp. Principles and Practice of Parallel Programming, June 1997.

Index Terms:
Parallel systems, distributed shared memory, cache coherence protocols, fine-grain cache coherence, coherence protocol optimization, workstation clusters.
Citation:
Steven K. Reinhardt, Robert W. Pfile, David A. Wood, "Hardware Support for Flexible Distributed Shared Memory," IEEE Transactions on Computers, vol. 47, no. 10, pp. 1056-1072, Oct. 1998, doi:10.1109/12.729790
Usage of this product signifies your acceptance of the Terms of Use.