This Article 
 Bibliographic References 
 Add to: 
Architectural Support for Uniprocessor and Multiprocessor Active Memory Systems
March 2004 (vol. 53 no. 3)
pp. 288-307

Abstract—We introduce an architectural approach to improve memory system performance in both uniprocessor and multiprocessor systems. The architectural innovation is a flexible active memory controller backed by specialized cache coherence protocols that permit the transparent use of address remapping techniques. The resulting system shows significant performance improvement across a spectrum of machine configurations, from uniprocessors through single-node multiprocessors (SMPs) to distributed shared memory clusters (DSMs). Address remapping techniques exploit the data access patterns of applications to enhance their cache performance. However, they create coherence problems since the processor is allowed to refer to the same data via more than one address. While most active memory implementations require cache flushes, we present a new approach to solve the coherence problem. We leverage and extend the cache coherence protocol so that our techniques work transparently to efficiently support uniprocessor, SMP and DSM active memory systems. We detail the coherence protocol extensions to support our active memory techniques and present simulation results that show uniprocessor speedup from 1.3 to 7.6 on a range of applications and microbenchmarks. We also show remarkable performance improvement on small to medium-scale SMP and DSM multiprocessors, allowing some parallel applications to continue to scale long after their performance levels off on normal systems.

[1] M.C. Carlisle and A. Rogers, Software Caching and Computation Migration in Olden Proc. Fifth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 29-38, July 1995.
[2] M. Chaudhuri, D. Kim, and M. Heinrich, Cache Coherence Protocol Design for Active Memory Systems Proc. Int'l Conf. Parallel and Distributed Processing Techniques and Applications, pp. 83-89, June 2002.
[3] DIS Benchmark Suite, 2002.
[4] J.J. Dongarra et al., A Set of Level 3 Basic Linear Algebra Subprograms ACM Trans. Math. Software, vol. 16, no. 1, pp. 1-17, Mar. 1990.
[5] J. Drapper et al., The Architecture of the DIVA Processing-in-Memory Chip Proc. 16th ACM Int'l Conf. Supercomputing, pp. 14-25, June 2002.
[6] M. Frigo and S.G. Johnson, “FFTW: An Adaptive Software Architecture for the FFT,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, vol. 3, p. 1381, 1998.
[7] M.J. Garzaran et al., Architectural Support for Parallel Reductions in Scalable Shared-Memory Multiprocessors Proc. 10th Int'l Conf. Parallel Architectures and Compilation Techniques, Sept. 2001.
[8] K. Gharachorloo et al., Architecture and Design of AlphaServer GS314 Proc. Ninth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 13-24, Nov. 2000.
[9] J. Gibson et al., FLASH vs. (Simulated) FLASH: Closing the Simulation Loop Proc. Ninth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 49-58, Nov. 2000.
[10] B. Goeman, H. Vandierendonck, and K. de Bosschere, Differential FCM: Increasing Value Prediction Accuracy by Improving Table Usage Efficiency Proc. Int'l Symp. High-Performance Computer Architecture, pp. 207-216, Jan. 2001.
[11] M. Heinrich, E. Speight, and M. Chaudhuri, Active Memory Clusters: Efficient Multiprocessing on Commodity Clusters Proc. Fourth Int'l Symp. High-Performance Computing, pp. 78-92, May 2002.
[12] InfiniBand Architecture Specification, Volume 1.0, Release 1.0 InfiniBand Trade Assoc., 24 Oct. 2000.
[13] Y. Kang, M. Huang, S. Yoo, Z. Ge, D. Keen, V. Lam, P. Pattnaik, and J. Torrellas, “FlexRAM: Toward an Advanced Intelligent Memory System,” Proc. Int'l Conf. Computer Design, Oct. 1999.
[14] D. Keen et al., Cache Coherence in Intelligent Memory Systems Proc. ISCA 2000 Solving the Memory Wall Problem Workshop, June 2000.
[15] D. Kim, M. Chaudhuri, and M. Heinrich, Leveraging Cache Coherence in Active Memory Systems Proc. 16th ACM Int'l Conf. Supercomputing, pp. 2-13, June 2002.
[16] M. Heinrich et al. “The Stanford FLASH Multiprocessor,” Proc. 21th Int'l Symp. Computer Architecture, pp. 302-313, April 1994.
[17] A.-C. Lai and B. Falsafi, Memory Sharing Predictor: The Key to a Speculative Coherent DSM Proc. 26th Int'l Symp. Computer Architecture, pp. 172-183, May 1999.
[18] J. Laudon and D. Lenoski, “The SGI Origin: A CC-NUMA Highly Scalable Server,” Proc. 24th Ann. Int'l Symp. Computer Architecture (ISCA '97), May 1997.
[19] C.-K. Luk and T.C. Mowry, "Memory Forwarding: Enabling Aggressive Layout Optimizations by Guaranteeing the Safety of Data Relocation," Proc. 26th Ann. Int'l Symp. Computer Architecture, ACM Press, New York, 1999, pp. 88-99.
[20] R. Manohar and M. Heinrich, A Case for Asynchronous Active Memories Proc. ISCA 2000 Solving the Memory Wall Problem Workshop, June 2000.
[21] B.K. Mathew, S.A. McKee, J.B. Carter, and A. Davis, “Design of a Parallel Vector Access Unit for SDRAM Memory Systems,” Proc. Sixth Int'l Symp. High-Performance Computer Architecture, Jan. 2000.
[22] Netlib Sparse BLAS /, 2002.
[23] NIST Sparse BLAS http://math.nist.govspblas/, 2002.
[24] D.R. O'Hallaron, Spark98: Sparse Matrix Kernels for Shared Memory and Message Passing Systems Technical Report CMU-CS-97-178, Oct. 1997.
[25] D.R. O'Hallaron, J.R. Shewchuk, and T. Gross, Architectural Implications of a Family of Irregular Applications Proc. Fourth IEEE Int'l Symp. High Performance Computer Architecture, pp. 80-89, Feb. 1998.
[26] M. Oskin, F. Chong, and T. Sherwood, “Active Pages: A Computation Model for Intelligent Memory,” Proc. 25th Ann. Int'l Symp. Computer Architecture, pp. 192-203, June 1998.
[27] M. Prvulovic, Z. Zhang, and J. Torrellas, ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Proc. 29th Int'l Symp. Computer Architecture, pp. 111-122, May 2002.
[28] SGI 3000 Family Reference Guide, 2002.
[29] Sun Microsystems, Sun Enterprise 10000 Server Technical White Paper, 2002.
[30] Third Generation I/O Architecture, 2002.
[31] S. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta, “The SPLASH-2 Programs: Characterization and Methodological Considerations,” Proc. Int'l Symp. Computer Architecture, pp. 24-36, June 1995.
[32] W.A. Wulf and S.A. McKee, Hitting the Memory Wall: Implications of the Obvious Computer Architecture News, vol. 23, no. 1, pp. 14-24, Mar. 1995.
[33] L. Zhang et al., The Impulse Memory Controller IEEE Trans. Computers, special issue on advances in high-performance memory systems, vol. 50, no. 11, pp. 1117-1132, Nov. 2001.

Index Terms:
Active memory systems, address remapping, cache coherence protocol, distributed shared memory, flexible memory controller architecture.
Daehyun Kim, Mainak Chaudhuri, Mark Heinrich, Evan Speight, "Architectural Support for Uniprocessor and Multiprocessor Active Memory Systems," IEEE Transactions on Computers, vol. 53, no. 3, pp. 288-307, March 2004, doi:10.1109/TC.2004.1261836
Usage of this product signifies your acceptance of the Terms of Use.