This Article 
 Bibliographic References 
 Add to: 
Switch MSHR: A Technique to Reduce Remote Read Memory Access Time in CC-NUMA Multiprocessors
May 2003 (vol. 52 no. 5)
pp. 617-632

Abstract—A remote memory access poses a severe problem for the design of CC-NUMA multiprocessors because it takes an order of magnitude longer than the local memory access. The large latency arises partly due to the increased distance between the processor and remote memory over the interconnection network. In this paper, we develop a new switch architecture, called Switch MSHR (SMSHR), which provides the cache block to the requesting processors without those requests having to go to the home memory. The SMSHR idea is based on providing a few miss status holding registers (MSHRs) in each switch that keep track of read requests to the memory. The SMSHR blocks secondary requests to the same memory block and provides them with a copy of the block when the primary reply returns. The SMSHR design is then extended to include a switch cache, which can temporarily save a copy of the data block for later use. We provide basic block designs for the SMSHR and SMSHR+cache architectures in this paper. We explore the design space by modeling the new switch architectures in a detailed execution-driven simulator and analyze the performance benefits. Our simulation results show that applications with a high degree of data sharing benefit tremendously from the SMSHR and SMSHR+cache techniques.

[1] G. Astfalk and T. Brewer, “An Overview of the HP/Convex Exemplar Hardware,” , 1997.
[2] J.K. Bennett, K.E. Bennett, and W.E. Speight, “The Performance Value of Shared Network Caches in Clustered Multiprocessor Workstations,” Proc. 16th Int'l Conf. Distributed Computing Systems, pp. 64-74, May 1996.
[3] J. Carbonaro and F. Verhoorn, “Cavallino: The Teraflops Router and NIC,” Proc. Symp. High Performance Interconnects (Hot Interconnects 4), pp. 157-160, Aug. 1996.
[4] W.J. Dally and H. Aoki, "Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels," IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 4, pp. 466-475, Apr. 1993.
[5] K.I. Farkas and N.P. Jouppi, Complexity/Performance Tradeoffs with Non-Blocking Loads Proc. 21st Int'l Symp. Computer Architecture (ISCA), pp. 211-222, Apr. 1994.
[6] K.E. Fletcher, W.E. Speight, and J.K. Bennett, “Techniques for Reducing the Impact of Inclusion in Shared Network Cache Multiprocessors,” Technical Report 9413, Dept. of Electrical and Computer Eng., Rice Univ., Dec. 1994.
[7] M. Galles, “Scalable Pipelined Interconnect for Distributed Endpoint Routing: The SGI SPIDER Chip,” Proc. Symp. High Performance Interconnects (Hot Interconnects 4), pp. 141-146, Aug. 1996.
[8] A. Gottlieb, R. Grishman, C.P. Kruskal, K.P. McAuliffe, L. Rudolph, and M. Snir, “The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer,” IEEE Trans. Computers, vol. 32, no. 2, pp. 175-189, Feb. 1983.
[9] IBM Corp., “Blue Gene Project Update,”, Jan. 2002.
[10] R. Iyer and L.N. Bhuyan, “Design and Evaluation of a Switch Cache Architecture for CC-NUMA Multiprocessors,” IEEE Trans. Computers, vol. 49, no. 8, pp. 779-797, Aug. 2000.
[11] R. Iyer, L.N. Bhuyan, and A. Nanda, “Using Switch Directories to Speed Up Cache-to-Cache Transfers in CC-NUMA Multiprocessors,” Proc. 12th Int'l Parallel and Distributed Processing Symp., pp. 721-728, May 2000.
[12] S. Kaxiras, S. Gjessing, and J. Goodman, “A Study of Three Dynamic Approaches to Handle Widely Shared-Memory Multiprocessors,” Proc. Int'l Conf. Supercomputing, pp. 457-464, July 1998.
[13] S. Kaxiras and J. Goodman, “Kiloprocessor Extensions to SCI,” Proc. Int'l Parallel Processing Symp., pp. 166-172, Apr. 1996.
[14] Kendall Square Research, “KSR1 Principles of Operations,” Waltham, Mass., rev. 5.5 ed., Oct. 1991.
[15] D. Kroft, “Lockup-Free Instruction Fetch/Prefetch Cache Organization,” Proc. Eighth Ann. Int'l Symp. Computer Architecture, pp. 81-87, May 1981.
[16] J. Laudon and D. Lenoski, “The SGI Origin: A CC-NUMA Highly Scalable Server,” Proc. 24th Ann. Int'l Symp. Computer Architecture (ISCA '97), May 1997.
[17] D. Lenoski et al., “The Stanford DASH Multiprocessor,” Computer, pp. 63-79, Mar. 1992.
[18] T. Lovett and R. Clapp, “STiNG: A CC-NUMA Computer System for the Commercial Marketplace,” Proc. 23rd Ann. Int'l Symp. Computer Architecture, pp. 308-317, May 1996.
[19] M. Michael and A. Nanda, “Design and Performance of Directory Caches for Scalable Shared Memory Multiprocessors,” Proc. Fifth Int'l Symp. High Performance Computer Architecture, pp. 142-151, Jan. 1999.
[20] A. Moga and M. Dubois, "The Effectiveness of SRAM Network Caches in Clustered DSMs," Proc. 4th Int'l Symp. High-Performance Computer Architecture, IEEE CS Press, Los Alamitos, Calif., 1998, pp. 103-112.
[21] A. Nanda and L. Bhuyan, “Design and Analysis of Cache Coherent Multistage Interconnection Networks,” IEEE Trans. Computers, vol. 42, no. 4, pp. 458-470, Apr. 1993.
[22] B.A. Nayfeh and K. Olukotun, "Exploring the Design Space for a Shared-Cache Multiprocessor," Proc. 21st Ann. Int'l Symp. Computer Architecture, pp. 166-175, Apr. 1994.
[23] A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, D. Lee, and M. Parkin, “The Scalable Shared Memory Multiprocessor,” Proc. Int'l Conf. Parallel Processing, pp. 1-10, Aug. 1995.
[24] V.S. Pai, P. Ranganathan, and S.V. Adve, “RSIM Reference Manual,” Technical Report 9705, Dept. of Electrical and Computer Eng., Rice Univ., Aug. 1997.
[25] S.K. Reinhardt, J.R. Larus, and D.A. Wood, “Tempest and Typhoon: User-Level Shared Memory,” Proc. 21st Int'l Symp. Computer Architecture, pp. 325-337, Apr. 1994.
[26] A. Saulsbury et al., "An Argument for Simple COMA," Proc. 1st Symp. High-Performance Computer Architecture, IEEE CS Press, Los Alamitos, Calif., 1995, pp. 276-285.
[27] H. Wang, “Efficient Memory Management and Interconnection Schemes for CC-NUMA Multiprocessors,” PhD dissertation, Dept. of Computer Science, Texas A&M Univ., May 2002.
[28] W.-D. Weber, S. Gold, P. Helland, T. Shimizu, T. Wicki, and W. Wilcke, “The Mercury Interconnect Architecture: A Cost-Effective Infrastructure for High-Performance Servers,” Proc. 24th Ann. Int'l Symp. Computer Architecture, pp. 98-107, June 1997.

Index Terms:
CC-NUMA multiprocessor, memory latency problem, interconnection network, miss status holding register, execution-driven simulation.
Laxmi Narayan Bhuyan, Hujun Wang, "Switch MSHR: A Technique to Reduce Remote Read Memory Access Time in CC-NUMA Multiprocessors," IEEE Transactions on Computers, vol. 52, no. 5, pp. 617-632, May 2003, doi:10.1109/TC.2003.1197128
Usage of this product signifies your acceptance of the Terms of Use.