| | This Article | |
| |
| |
| | Share | |
| |
| |
| | Bibliographic References | |
| |
| |
| | Add to: | |
| |
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
| |
| | Search | |
| |
| |
| | |
The Impact of Negative Acknowledgments in Shared Memory Scientific Applications
February 2004 (vol. 15 no. 2)
pp. 134-150
Abstract—Negative ACKnowledgments (NACKs) and subsequent retries, used to resolve races and to enforce a total order among shared memory accesses in distributed shared memory (DSM) multiprocessors, not only introduce extra network traffic and contention, but also increase node controller occupancy, especially at the home. In this paper, we present possible protocol optimizations to minimize these retries and offer a thorough study of the performance effects of these messages on six scalable scientific applications running on 64-node systems and larger. To eliminate NACKs, we present a mechanism to queue pending requests at the main memory of the home node and augment it with a novel technique of combining pending read requests, thereby accelerating the parallel execution for 64 nodes by as much as 41 percent (a speedup of 1.41) compared to a modified version of the SGI Origin 2000 protocol. We further design and evaluate a protocol by combining this mechanism with a technique that we call write string forwarding, used in the AlphaServer GS320 and Piranha systems. We find that without careful design considerations, especially regarding atomic read-modify-write operations, this aggressive write forwarding can hurt performance. We identify and evaluate the necessary micro-architectural support to solve this problem. We compare the performance of these novel NACK-free protocols with a base bitvector protocol, a modified version of the SGI Origin 2000 protocol, and a NACK-free protocol that uses dirty sharing and write string forwarding as in the Piranha system. To understand the effects of network speed and topology the evaluation is carried out on three network configurations.
[1] 134 D. Abts, D.J. Lilja, and S. Scott, Towards Complexity-Effective Verification: A Case Study of the Cray SV2 Cache Coherence Protocol Proc. Workshop Complexity-Effective Design, 27th Int'l Symp. Computer Architecture (ISCA), June 2000.[2] L. Barroso, K. Gharachorloo, and E. Bugnion, "Memory System Characterization of Commercial Workloads," Proc. 25th Int'l Symp. Computer Architecture, June 1998, pp. 3-14.[3] L.A. Barroso et al., "Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing," Proc. 27th ACM Int'l Symp. Computer Architecture, ACM Press, 2000, pp. 282-293.[4] D. Chaiken, J. Kubiatowicz, and A. Agarwal, LimitLESS Directories: A Scalable Cache Coherence Scheme Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 224-234, Apr. 1991.[5] M. Chaudhuri et al., Latency, Occupancy, and Bandwidth in DSM Multiprocessors: A Performance Evaluation IEEE Trans. Computers, vol. 52, no. 7, pp. 862-880, July 2003.[6] M. Chaudhuri and M. Heinrich, The Impact of Negative Acknowledgments in Shared Memory Scientific Applications Technical Report CSL-TR-2003-1031, Cornell Computer Systems Lab,http://www.csl.cornell.edu/TRCSL-TR-2003-1031.pdf , Mar. 2003.[7] M. Galles, “Spider: A High Speed Network Interconnect” IEEE Micro, vol. 17, no. 1, pp. 34–39 Jan.-Feb. 1997.[8] K. Gharachorloo et al., Architecture and Design of AlphaServer GS320 Proc. Ninth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 13-24, Nov. 2000.[9] J. Gibson et al., FLASH vs. (Simulated) FLASH: Closing the Simulation Loop Proc. Ninth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 49-58, Nov. 2000.[10] J.R. Goodman, M.K. Vernon, and P.J. Woest, Efficient Synchronization Primitives for Large-Scale Cache-Coherent Multiprocessors Proc. Third Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 64-75, May 1989.[11] A. Gupta, W.-D. Weber, and T. Mowry, Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes Proc. 1990 Int'l Conf. Parallel Processing (ICPP), pp. I.312-I.321, Aug. 1990.[12] E. Hagersten and M. Koster, "WildFire: A Scalable Path for SMPs," Proc. 5th Int'l Symp. High-Performance Computer Architecture, IEEE CS Press, Los Alamitos, Calif., 1999, pp. 172-181.[13] E. Hagersten, A. Landin, and S. Haridi, "DDM—A Cache-Only Memory Architecture," Computer, Sept. 1992, pp. 44-54.[14] M. Heinrich, The Performance and Scalability of Distributed Shared Memory Cache Coherence Protocols PhD dissertation, Stanford Univ., Oct. 1998.[15] M. Heinrich et al., A Quantitatitve Analysis of the Performance and Scalability of Distributed Shared Memory Cache Coherence Protocols IEEE Trans. Computers, vol. 48, no. 2, pp. 205-217, (special issue on cache memory and related problems), Feb. 1999.[16] M. Heinrich et al., The Performance Impact of Flexibility in the Stanford FLASH Multiprocessor Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 274-285, Oct. 1994.[17] M. Heinrich and M. Chaudhuri, Ocean Warning: Avoid Drowning ACM SIGARCH Computer Architecture News, vol. 31, no. 3, pp. 30-32, June 2003.[18] A. Kägi, D. Burger, and J.R. Goodman, "Efficient Synchronization: Let Them Eat QOLB," Proc. Int'l Symp. Computer Architecture (ISCA 97), IEEE CS Press, 1997, pp. 170-180.[19] M. Heinrich et al. “The Stanford FLASH Multiprocessor,” Proc. 21th Int'l Symp. Computer Architecture, pp. 302-313, April 1994.[20] J. Laudon and D. Lenoski, “The SGI Origin: A CC-NUMA Highly Scalable Server,” Proc. 24th Ann. Int'l Symp. Computer Architecture (ISCA '97), May 1997.[21] D. Lenoski et al., "The directory-based cache coherence protocol for the dash multiprocessor," Proc. 17th Int'l Symp. Computer Architecture,Los Alamitos, Calif., pp. 148-159, 1990.[22] D. Lenoski et al., “The Stanford DASH Multiprocessor,” Computer, pp. 63-79, Mar. 1992.[23] T.D. Lovett and R.M. Clapp, STiNG: A CC-NUMA Computer System for the Commercial Marketplace Proc. 23rd Int'l Symp. Computer Architecture (ISCA), pp. 308-317, May 1996.[24] T.D. Lovett, R.M. Clapp, and R.J. Safranek, NUMA-Q: An SCI-Based Enterprise Server Sequent Computer Systems Inc., 1996.[25] A. Nowatzyk et al., The S3. mp Scalable Shared Memory Multiprocessor Proc. 24th Int'l Conf. Parallel Processing (ICPP), pp. I1-I10, Aug. 1995.[26] R. Rajwar, A. Kägi, and J.R. Goodman, Improving the Throughput of Synchronization by Insertion of Delays Proc. Sixth Int'l Symp. High Performance Computer Architecture (HPCA), pp. 168-179, Jan. 2000.[27] P. Ranganathan et al., Performance of Database Workloads on Shared-Memory Systems with Out-of-Order Processors Proc. 10th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 307-318, Oct. 1998.[28] IEEE Std 1596-1992, Scalable Coherent Interface (SCI), IEEE, Piscataway, N.J., 1992.[29] R. Simoni, Cache Coherence Directories for Scalable Multiprocessors PhD dissertation, Stanford Univ., Oct. 1992.[30] S. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta, “The SPLASH-2 Programs: Characterization and Methodological Considerations,” Proc. Int'l Symp. Computer Architecture, pp. 24-36, June 1995.
Index Terms:
Distributed shared memory, cache coherence protocol, negative acknowledgment, node controller occupancy.
Citation:
Mainak Chaudhuri, Mark Heinrich, "The Impact of Negative Acknowledgments in Shared Memory Scientific Applications," IEEE Transactions on Parallel and Distributed Systems, vol. 15, no. 2, pp. 134-150, Feb. 2004, doi:10.1109/TPDS.2004.1264797