This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
An Architecture for High-Performance Scalable Shared-Memory Multiprocessors Exploiting On-Chip Integration
August 2004 (vol. 15 no. 8)
pp. 755-768
Jos? Gonz?lez, IEEE Computer Society
Jos? Duato, IEEE

Abstract—Recent technology improvements allow multiprocessor designers to put some key components inside the processor chip, such as the memory controller, the coherence hardware, and the network interface/router. In this paper, we exploit such integration scale, presenting a novel node architecture aimed at reducing the long L2 miss latencies and the memory overhead of using directories that characterize cc-NUMA machines and limit their scalability. Our proposal replaces the traditional directory with a novel three-level directory architecture, as well as it adds a small shared data cache to each of the nodes of a multiprocessor system. Due to their small size, the first-level directory and the shared data cache are integrated into the processor chip in every node, which enhances performance by saving accesses to the slower main memory. Scalability is guaranteed by having the second and third-level directories out of the processor chip and using compressed data structures. A taxonomy of the L2 misses, according to the actions performed by the directory to satisfy them, is also presented. Using execution-driven simulations, we show that significant latency reductions can be obtained by using the proposed node architecture, which translates into reductions of more than 30 percent in several cases in the application execution time.

[1] M.E. Acacio, J. González, J.M. García, and J. Duato, A New Scalable Directory Architecture for Large-Scale Multiprocessors Proc. Seventh Int'l Symp. High Performance Computer Architecture, pp. 97-106, Jan. 2001.
[2] M.M. Martin, D.J. Sorin, A. Ailamaki, A.R. Alameldeen, R.M. Dickson, C.J. Mauer, K.E. Moore, M. Plakal, M.D. Hill, and D.A. Wood, Timestamp Snooping: An Approach for Extending SMPS Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 25-36, Nov. 2000.
[3] H. Hadimioglu, D. Kaeli, and F. Lombardi, Introduction to the Special Issue on High Performance Memory Systems IEEE Trans. Computers, vol. 50, no. 11, pp. 1103-1105, Nov. 2001.
[4] A. Charlesworth,"Starfire: Extending the SMP Envelope," IEEE Micro, vol. 18, no. 1, Jan.-Feb. 1998, pp. 39-49.
[5] L. Gwennap, Alpha 21364 to Ease Memory Bottleneck Microprocessor Report, vol. 12, no. 14, pp. 12-15, Oct. 1998.
[6] T. Lovett and R. Clapp, Sting: A cc-Numa Computer System for the Commercial Marketplace Proc. 23rd Int'l Symp. Computer Architecture, pp. 308-317, 1996.
[7] M.E. Acacio, J. González, J.M. García, and J. Duato, A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors Proc. 16th Int'l Parallel and Distributed Processing Symp., Apr. 2002.
[8] The BlueGene/L Team, An Overview of the Bluegene/L Supercomputer Proc. Int'l SC2002 High Performance Networking and Computing Conf., Nov. 2002.
[9] A. Ahmed, P. Conway, B. Hughes, and F. Weber, AMD Opteron Shared Memory MP Systems Proc. 14th HotChips Symp., Aug. 2002.
[10] J. Torrellas, L. Yang, and A.T. Nguyen, Toward a Cost-Effective DSM Organization that Exploits Processor-Memory Integration Proc. Sixth Int'l Symp. High Performance Computer Architecture, pp. 15-25, Jan. 2000.
[11] L.A. Barroso et al., "Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing," Proc. 27th ACM Int'l Symp. Computer Architecture, ACM Press, 2000, pp. 282-293.
[12] L. Hammond, B. Hubbert, M. Siu, M. Prabhu, M. Chen, and K. Olukotun, The Stanford Hydra CMP IEEE Micro, vol. 20, no. 2, pp. 71-84, Mar./Apr. 2000.
[13] J. Tendler, J. Dodson, J. Fields, H. Le, and B. Sinharoy, Power4 System Microarchitecture IBM J. Research and Development, vol. 46, no. 1, pp. 5-25, Jan. 2002.
[14] P. Stenström et al., "Boosting the Performance of Shared Memory Multiprocessors," Computer, July 1997, pp. 63-70.
[15] R. Iyer and L.N. Bhuyan, “Switch Cache: A Framework for Improving the Remote Memory Access Latency of CC-NUMA Multiprocessors,” Proc. Fifth Int'l Conf. High Performance Computer Architecture (HPCA-5), pp. 152-160, Jan. 1999.
[16] R. Iyer, L.N. Bhuyan, and A. Nanda, “Using Switch Directories to Speed Up Cache-to-Cache Transfers in CC-NUMA Multiprocessors,” Proc. 12th Int'l Parallel and Distributed Processing Symp., pp. 721-728, May 2000.
[17] M.E. Acacio, J. González, J.M. García, and J. Duato, Owner Prediction for Accelerating Cache-to-Cache Transfer Misses in cc-Numa Multiprocessors Proc. Int'l SC2002 High Performance Networking and Computing Conf., Nov. 2002.
[18] S. Kaxiras and J.R. Goodman, “Improving CC-NUMA Performance Using Instruction-Based Prediction,” Proc. Int'l Symp. High Performance Computer Architecture, Jan. 1999.
[19] A.C. Lai and B. Falsafi, Selective, Accurate, and Timely Self-Invalidation Using Last-Touch Prediction Proc. 27th Int'l Symp. Computer Architecture, pp. 139-148, May 2000.
[20] M.M. Martin, P.J. Harper, D.J. Sorin, M.D. Hill, and D.A. Wood, Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared Memory Multiprocessors Proc. 30th Int'l Symp. Computer Architecture, June 2003.
[21] D. Lenoski et al., “The Stanford DASH Multiprocessor,” Computer, pp. 63-79, Mar. 1992.
[22] A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, W. Radke, and S. Vishin, The s3.mp Scalable Shared Memory Multiprocessor Proc. Int'l Conf. Parallel Processing, pp. 1-10, July 1995.
[23] A. Gupta, W.-D. Weber, and T. Mowry, Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes Proc. Int'l Conf. Parallel Processing, pp. 312-321, Aug. 1990.
[24] B.W. O'Krafka and A.R. Newton, “An Empirical Evaluation of Two Memory-Efficient Directory Methods,” Proc. 17th Ann. Int'l Symp. Computer Architecture, pp. 138-147, 1990.
[25] M. Heinrich et al. “The Stanford FLASH Multiprocessor,” Proc. 21th Int'l Symp. Computer Architecture, pp. 302-313, April 1994.
[26] M. Michael and A. Nanda, “Design and Performance of Directory Caches for Scalable Shared Memory Multiprocessors,” Proc. Fifth Int'l Symp. High Performance Computer Architecture, pp. 142-151, Jan. 1999.
[27] A.K. Nanda, A.-T. Nguyen, M.M. Michael, and D.J. Joseph, High-Throughput Coherence Control and Hardware Messaging in Everest IBM J. Research and Development, vol. 45, no. 2, pp. 229-244, Mar. 2001.
[28] A. Agarwal, R. Simoni, J. Hennessy, and M. Horowitz, “An Evaluation of Directory Schemes for Cache Coherence,” Proc. 15th Ann. Int'l Symp. Computer Architecture, pp. 280-289, 1988.
[29] D. Chaiken, J. Kubiatowicz, and A. Agarwal, Limitless Directories: A Scalable Cache Coherence Scheme Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 224-234, Apr. 1991.
[30] R. Simoni and M. Horowitz, Dynamic Pointer Allocation for Scalable Cache Coherence Directories Proc. Int'l Symp. Shared Memory Multiprocessing, pp. 72-81, Apr. 1991.
[31] J. Laudon and D. Lenoski, “The SGI Origin: A CC-NUMA Highly Scalable Server,” Proc. 24th Ann. Int'l Symp. Computer Architecture (ISCA '97), May 1997.
[32] A. Gupta and W.-D. Weber, "Cache Invalidation Patterns in Shared-Memory Multiprocessors," IEEE Trans. Computers, vol. 41, no. 7, pp. 794-810, July 1992.
[33] D.E. Culler, J.P. Singh, and A. Gupta, Parallel Computer Architecture: A Hardware/Software Approach. Kaufmann Publishers, Inc., 1999.
[34] V. Pai, P. Ranganathan, and S. Adve, Rsim Reference Manual Version 1.0 Technical Report 9705, Dept. of Electrical and Computer Eng., Rice Univ., Aug. 1997.
[35] M.D. Hill, “Multiprocessors Should Support Simple Memory-Consistency Models,” Computer, vol. 31, no. 8, pp. 28-34, Aug. 1998.
[36] S. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta, “The SPLASH-2 Programs: Characterization and Methodological Considerations,” Proc. Int'l Symp. Computer Architecture, pp. 24-36, June 1995.
[37] J. Singh, W.-D. Weber, and A. Gupta, Splash: Stanford Parallel Applications for Shared-Memory Computer Architecture News, vol. 20, no. 1, pp. 5-44, Mar. 1992.

Index Terms:
cc-NUMA multiprocessor, directory memory overhead, L2 miss latency, three-level directory, shared data cache, on-processor-chip integration.
Citation:
Manuel E. Acacio, Jos? Gonz?lez, Jos? M. Garc?, Jos? Duato, "An Architecture for High-Performance Scalable Shared-Memory Multiprocessors Exploiting On-Chip Integration," IEEE Transactions on Parallel and Distributed Systems, vol. 15, no. 8, pp. 755-768, Aug. 2004, doi:10.1109/TPDS.2004.27
Usage of this product signifies your acceptance of the Terms of Use.