This Article 
 Bibliographic References 
 Add to: 
Latency, Occupancy, and Bandwidth in DSM Multiprocessors: A Performance Evaluation
July 2003 (vol. 52 no. 7)
pp. 862-880

Abstract—While the desire to use commodity parts in the communication architecture of a DSM multiprocessor offers advantages in cost and design time, the impact on application performance is unclear. We study this performance impact through detailed simulation, analytical modeling, and experiments on a flexible DSM prototype, using a range of parallel applications. We adapt the logP model to characterize the communication architectures of DSM machines. The l (network latency) and o (controller occupancy) parameters are the keys to performance in these machines, with the g (node-to-network bandwidth) parameter becoming important only for the fastest controllers. We show that, of all the logP parameters, controller occupancy has the greatest impact on application performance. Of the two contributions of occupancy to performance degradation—the latency it adds and the contention it induces—it is the contention component that governs performance regardless of network latency, showing a quadratic dependence on o. As expected, techniques to reduce the impact of latency make controller occupancy a greater bottleneck. Surprisingly, the performance impact of occupancy is substantial, even for highly-tuned applications and even in the absence of latency hiding techniques. Scaling the problem size is often used as a technique to overcome limitations in communication latency and bandwidth. Through experiments on a DSM prototype, we show that there are important classes of applications for which the performance lost by using higher occupancy controllers cannot be regained easily, if at all, by scaling the problem size.

[1] A. Agarwal et al. The MIT Alewife Machine: Architecture and Performance Proc. 22nd Int'l Symp. Computer Architecture, pp. 2-13, June 1995.
[2] A. Bilas and J.P. Singh, The Effects of Communication Parameters on End Performance of Shared Virtual Memory Clusters Proc. 1997 Supercomputing Conf. High Performance Networking and Computing, Nov. 1997.
[3] M.A. Blumrich et al., "Virtual Memory Mapped Network Interface for the SHRIMP Multicomputer," Proc. 21st Int'l Symp. Computer Architecture, Apr. 1994, pp. 142-153.
[4] M. Chaudhuri et al. Latency, Occupancy, and Bandwidth in DSM Multiprocessors: A Performance Evaluation Technical Report CSL-TR-2002-1025, Computer Systems Laboratory, Cornell Univ., Ithaca, N.Y., , July 2002.
[5] F. Chong, R. Barua, F. Dahlgren, J. Kubiatowicz, and A. Agarwal, “The Sensitivity of Communication Mechanisms to Bandwidth and Latency,” Proc. Fourth Int'l Symp. High Performance Computer Architecture, Feb. 1998.
[6] D. Culler,R. Karp,D. Patterson,A. Sahay,K.E. Schauser,E. Santos,R. Subramonian,, and T. von Eicken,“LogP: Towards a realistic model of parallel computation,” Fourth Symp. Principles and Practices Parallel Programming, SIGPLAN’93, ACM, May 1993.
[7] D. Dai and D.K. Panda, Building Efficient Limited Directory-Based DSMs: A Multidestination Message Passing Based Approach Technical Report OSU-CISRC-4/96-TR21, Dept. of Computer and Information Science, Ohio State Univ., Columbus, 1996.
[8] M. Galles, “Spider: A High Speed Network Interconnect” IEEE Micro, vol. 17, no. 1, pp. 34–39 Jan.-Feb. 1997.
[9] K. Gharachorloo et al. Architecture and Design of AlphaServer GS320 Proc. Ninth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 13-24, Nov. 2000.
[10] J. Gibson et al., "FLASH vs. (Simulated) FLASH: Closing the Simulation Loop," Proc. 9th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS 00), ACM Press, New York, 2000, pp. 49-58.
[11] M. Heinrich et al., “The Performance Impact of Flexibility in the Stanford FlashMultiprocessor,” Proc. Sixth Int’l Conf. Architectural Support for Programming Languages and OperatingSystems, IEEE Computer Society Press, Los Alamitos, Calif., 1994, pp. 274-284.
[12] M. Heinrich and E. Speight, Providing Hardware DSM Performance at Software DSM Cost Technical Report CSL-TR-2000-1008, Computer Systems Lab., Cornell Univ., Ithaca, N.Y, 2000.
[13] C. Holt, M. Heinrich, J.P. Singh, E. Rothberg, and J. Hennesy, “The Effects of Latency, Occupancy and Bandwidth on the Performance of Cache-Coherent Multiprocessors,” Technical Report CSL-TR-95, Stanford Univ., Jan. 1995.
[14] M. Heinrich et al. “The Stanford FLASH Multiprocessor,” Proc. 21th Int'l Symp. Computer Architecture, pp. 302-313, April 1994.
[15] Kendall Square Research, KSR1 Technical Summary technical report, Waltham, MA, 1992.
[16] J. Laudon and D. Lenoski, “The SGI Origin: A CC-NUMA Highly Scalable Server,” Proc. 24th Ann. Int'l Symp. Computer Architecture (ISCA '97), May 1997.
[17] D. Lenoski et al., “The Stanford DASH Multiprocessor,” Computer, pp. 63-79, Mar. 1992.
[18] R.P. Martin et al. Effects of Communication Latency, Overhead, and Bandwidth in a Cluster Architecture Proc. 24th Int'l Symp. Computer Architecture, pp. 85-97, June 1997.
[19] M. Michael at al. Coherence Controller Architectures for SMP-Based CC-NUMA Multiprocessors Proc. 24th Int'l Symp. Computer Architecture, pp. 219-228, June 1997.
[20] A.K. Nanda, A.T. Nguyen, M.M. Michael, and D.J. Joseph, “High-Throughput Coherence Controllers,” Proc. Sixth Int'l Symp. High Performance Computer Architecture, pp. 145-155, 2000.
[21] C.C. Niessen and D.G. Meyer, High Performance Network Interfaces Proc. First Midwest Workshop Parallel Processing, Aug. 1999.
[22] A. Nowatzyk et al. The Scalable Shared Memory Multiprocessor Proc. 24th Int'l Conf. Parallel Processing, pp. 1-10, Aug. 1995.
[23] S. Reinhardt, R. Pfile,, and D. Wood, “Decoupled Hardware Support for Distributed Shared Memory,” Proc. 23rd Ann. Int'l Symp. Computer Architecture, pp. 34-43, May 1996.
[24] E. Rothberg, J.P. Singh, and A. Gupta, "Working Sets, Cache Sizes, and Node Granularity Issues for Large-Scale Multiprocessors," Proc. 20th Ann. Int'l Symp. Computer Architecture, pp. 14-25, ACM, May 1993.
[25] I. Schoinas, B. Falsafi, A.R. Lebeck, S.K. Reinhardt, J.R. Larus, and D.A. Wood, “Fine-Grain Access Control for Distributed Shared Memory,” Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '94), pp. 297-306, Oct. 1994.
[26] J.P. Singh et al. Load Balancing and Data Locality in Adaptive Hierarchical N-Body Methods: Barnes-Hut, Fast Multipole and Radiosity J. Parallel and Distributed Computing, vol. 27, no. 2, pp. 118-141, June 1995.
[27] D. Sorin, V. Pai, S. Adve, M. Vernon, and D. Wood, “Analytic Evaluation of Shared-Memory Parallel Systems with ILP Processors,” Proc. 25th Int'l Symp. Computer Architecture, pp. 380-391, June 1998.
[28] C. Stunkel, D. Shea, B. Abali, M. Atkins, C. Bender, D. Grice, P. Hochshild, D. Joseph, B. Nathanson, R. Swetz, R. Stucke, M. Tsao, and P. Varker, “The SP2 High-Performance Switch,” IBM Systems J., vol. 34, no. 2,pp. 185–204, 1995.
[29] H.A. Taha, Operations Research, An Introduction. The Macmillan Company, 1971.
[30] D.A. Wood and M.D. Hill, "Cost-Effective Parallel Computing," Computer, Feb. 1995, pp. 69-72.
[31] S. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta, “The SPLASH-2 Programs: Characterization and Methodological Considerations,” Proc. Int'l Symp. Computer Architecture, pp. 24-36, June 1995.
[32] S.C. Woo, J.P. Singh, and J.L. Hennessy, "The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors," Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS VI), pp. 219-229, Oct. 1996.
[33] Y. Zhou et al., “Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation,” Proc. Sixth ACM Symp. Principles and Practice of Parallel Programming, June 1997.
[34] Z. Zhou, W. Shi, and Z. Tang, A Novel Multicast Scheme to Reduce Cache Invalidation Overheads in DSM Systems Proc. 19th IEEE Int'l Performance, Computing, and Comm. Conf., pp. 597-603, Feb. 2000.

Index Terms:
Occupancy, distributed shared memory multiprocessors, communication controller, latency, bandwidth, queuing model, flexible node controller.
Mainak Chaudhuri, Mark Heinrich, Chris Holt, Jaswinder Pal Singh, Edward Rothberg, John Hennessy, "Latency, Occupancy, and Bandwidth in DSM Multiprocessors: A Performance Evaluation," IEEE Transactions on Computers, vol. 52, no. 7, pp. 862-880, July 2003, doi:10.1109/TC.2003.1214336
Usage of this product signifies your acceptance of the Terms of Use.