This Article 
 Bibliographic References 
 Add to: 
Design and Evaluation of a Switch Cache Architecture for CC-NUMA Multiprocessors
August 2000 (vol. 49 no. 8)
pp. 779-797

Abstract—Cache coherent nonuniform memory access (CC-NUMA) multiprocessors provide a scalable design for shared memory. But, they continue to suffer from large remote memory access latencies due to comparatively slow memory technology and large data transfer latencies in the interconnection network. In this paper, we propose a novel hardware caching technique, called switch cache, to improve the remote memory access performance of CC-NUMA multiprocessors. The main idea is to implement small fast caches in crossbar switches of the interconnect medium to capture and store shared data as they flow from the memory module to the requesting processor. This stored data acts as a cache for subsequent requests, thus reducing the need for remote memory accesses tremendously. The implementation of a cache in a crossbar switch needs to be efficient and robust, yet flexible for changes in the caching protocol. The design and implementation details of a CAche Embedded Switch ARchitecture, CAESAR, using wormhole routing with virtual channels is presented. We explore the design space of switch caches by modeling CAESAR in a detailed execution driven simulator and analyze the performance benefits. Our results show that the CAESAR switch cache is capable of improving the performance of CC-NUMA multiprocessors by up to 45 percent reduction in remote memory accesses for some applications. By serving remote read requests at various stages in the interconnect, we observe improvements in execution time as high as 20 percent for these applications. We conclude that switch caches provide a cost-effective solution for designing high performance CC-NUMA multiprocessors.

[1] G. Astfalk and T. Brewer, “An Overview of the HP/Convex Exemplar Hardware,” , accessed in Dec. 1997.
[2] BBN Laboratories Inc., “Butterfly Parallel Processor Overview, version 1,” Cambridge, Mass., Dec. 1985.
[3] S. Bhattacharjee, K.L. Calvert, and E.W. Zegura, “Self-Organizing Wide-Area Network Caches,” Proc. IEEE INFOCOM '98, pp. 600-608, Mar. 1998.
[4] L.N. Bhuyan, R.R. Iyer, T. Askar, A.K. Nanda, and M. Kumar, “Performance of Multistage Bus Networks for a Distributed Shared Memory Multiprocessor,” IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 1, pp. 82-95, Jan. 1997.
[5] L.N. Bhuyan, H.J. Wang, R. Iyer, and A. Kumar, “Impact of Switch Design on the Application Performance of Cache-Coherent Multiprocessors,” Proc. 12th IEEE Int'l Parallel Processing Symp., Apr. 1998.
[6] J. Carbonaro and F. Verhoorn, “Cavallino: The Teraflops Router and NIC,” Proc. Symp. High Performance Interconnects (Hot Interconnects 4), Aug. 1996.
[7] L.M. Censier and P. Feautrier, “A New Solution to Coherence Problems in Multicache Systems,” IEEE Trans. Computers, vol. 27, no. 12, pp. 1,112-1,118, Dec. 1978.
[8] M. Galles, “Scalable Pipelined Interconnect for Distributed Endpoint Routing: The SGI SPIDER Chip,” Proc. Symp. High Performance Interconnects (Hot Interconnects 4), Aug. 1996.
[9] M. Horowitz, S. Przybylski, and M. Smith, “Tutorial on Recent Trends in Processor Design: Reclimbing the Complexity Curve,” Stanford, Calif.: Western Inst. of Computer Science, Stanford Univ., 1992.
[10] M. Horowitz, “High Frequency Clock Distribution,” Proc. 1996 Symp. VLSI Circuits, June 1996.
[11] R. Iyer and L.N. Bhuyan, “Switch Cache: A Framework for Improving the Remote Memory Access Latency of CC-NUMA Multiprocessors,” Proc. Fifth Int'l Conf. High Performance Computer Architecture (HPCA-5), pp. 152-160, Jan. 1999.
[12] J. Laudon and D. Lenoski, "The SGI Origin: A cc-NUMA Highly Scalable Server," Proc. 24th Ann. Int'l Symp. Computer Architecture, May 1997.
[13] C.E. Leiserson, Z. Abuhamdeh, D. Douglas, C. Feynman, M. Ganmukhi et al., “The Network Architecture of the Connection Machine,” , J. Parallel and Distributed Computing, vol. 33, no. 2, pp. 145-158, Mar. 1996.
[14] D. Lenoski et al., “The Stanford DASH Multiprocessor,” Computer, pp. 63-79, Mar. 1992.
[15] T. Lovett and R. Clapp, “STiNG: A CC-NUMA Computer System for the Commercial Marketplace,” Proc. 23rd Ann. Int'l Symp. Computer Architecture, pp. 308-317, May 1996.
[16] A. Moga and M. Dubois, "The Effectiveness of SRAM Network Caches in Clustered DSMs," Proc. 4th Int'l Symp. High-Performance Computer Architecture, IEEE CS Press, Los Alamitos, Calif., 1998, pp. 103-112.
[17] P. Mohapatra and C.R. Das, “A Performance Model for Finite-Buffered Multistage Interconnection Networks,” IEEE Trans. Parallel and Distributed Systems, vol. 7, no. 1, pp. 18-25, Jan. 1996.
[18] J.M. Mulder, N.T. Quach, and M.J. Flynn, “An Area Model for On-Chip Memories and its Applications,” IEEE J. Solid State Circuits, vol. 26, no. 2, pp. 98-106, Feb. 1991.
[19] A. Nanda and L. Bhuyan, “Design and Analysis of Cache Coherent Multistage Interconnection Networks,” IEEE Trans. Computers, vol. 42, no. 4, pp. 458-470, Apr. 1993.
[20] B.A. Nayfeh, K. Olukotun, and J.P. Singh, “The Impact of Shared-Cache Clustering in Small-Scale Shared-Memory Multiprocessors,” Proc. Second Int'l Symp. High Performance Computer Architecture (HPCA '96), pp. 74-84, Feb. 1996.
[21] V. Pai, P. Ranganathan, and S.V. Adve, “RSIM Reference Manual, Version 1.0,” Technical Report 9705, Dept. of Electrical and Computer Eng., Rice Univ., July 1997.
[22] J.P. Singh, W.D. Weber, and A. Gupta, "SPLASH: Stanford Parallel Applications for Shared Memory," Proc. 19th Annual Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., May 1992, pp. 5-14.
[23] C. Stunkel, D. Shea, B. Abali, M. Atkins, C. Bender, D. Grice, P. Hochshild, D. Joseph, B. Nathanson, R. Swetz, R. Stucke, M. Tsao, and P. Varker, “The SP2 High-Performance Switch,” IBM Systems J., vol. 34, no. 2,pp. 185–204, 1995.
[24] J. Torrellas and Z. Zheng, “The Performance of the Cedar Multistage Switching Network,” IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 4, pp. 321-336, Apr. 1997.
[25] A.W. Wilson, "Hierarchical Cache/Bus Architecture for Shared Memory Multiprocessors," Proc. 14th Ann. Int'l. Symp. Computer Architecture, pp. 244-252, 1987.
[26] K.M. Wilson and K. Olukotun, “Designing High Bandwidth On-Chip Caches,” Proc. 24th Ann. Int'l Symp. Computer Architecture, pp. 252-262, June 1997.
[27] S. Wilton and N. Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” Technical Report no. 93/5, DEC-Western Research Lab, 1994.
[28] K.C. Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, vol. 16, no. 2, pp. 28–40, Apr. 1996.
[29] Z. Zhang and J. Torrellas, "Reducing Remote Conflict Misses: NUMA with Remote Cache versus COMA," Proc. 3rd Int'l Symp. High-Performance Computer Architecture, IEEE CS Press, Los Alamitos, Calif., 1997, pp. 272-281.

Index Terms:
Crossbar switches, cache architectures, scalable interconnects, wormhole routing, shared memory multiprocessors, execution-driven simulation.
Ravishankar R. Iyer, Laxmi N. Bhuyan, "Design and Evaluation of a Switch Cache Architecture for CC-NUMA Multiprocessors," IEEE Transactions on Computers, vol. 49, no. 8, pp. 779-797, Aug. 2000, doi:10.1109/12.868025
Usage of this product signifies your acceptance of the Terms of Use.