This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
The Impulse Memory Controller
November 2001 (vol. 50 no. 11)
pp. 1117-1132

Abstract—Impulse is a memory system architecture that adds an optional level of address indirection at the memory controller. Applications can use this level of indirection to remap their data structures in memory. As a result, they can control how their data is accessed and cached, which can improve cache and bus utilization. The Impulse design does not require any modification to processor, cache, or bus designs since all the functionality resides at the memory controller. As a result, Impulse can be adopted in conventional systems without major system changes. We describe the design of the Impulse architecture and how an Impulse memory system can be used in a variety of ways to improve the performance of memory-bound applications. Impulse can be used to dynamically create superpages cheaply, to dynamically recolor physical pages, to perform strided fetches, and to perform gathers and scatters through indirection vectors. Our performance results demonstrate the effectiveness of these optimizations in a variety of scenarios. Using Impulse can speed up a range of applications from 20 percent to over a factor of 5. Alternatively, Impulse can be used by the OS for dynamic superpage creation; the best policy for creating superpages using Impulse outperforms previously known superpage creation policies.

[1] Advanced Micro Devices, “AMD Athlon Processor Technical Brief,” http://www.amd.com/-products/cpg/athlon/ techdocs/pdf22054.pdf, 1999.
[2] R. Alverson et al., "The Tera Computer System," Proc. Int'l Conf. Supercomputing, Assoc. of Computing Machinery, N.Y., 1990, pp. 1-6.
[3] D. Bailey et al., “The NAS Parallel Benchmarks,” Technical Report RNR-94-007, NASA Ames Research Center, Mar. 1994.
[4] K. Bala, F. Kaashoek, and W. Weihl, “Software Prefetching and Caching for Translation Buffers,” Proc. First Symp. Operating System Design and Implementation, pp. 243-254, Nov. 1994.
[5] B. Bershad, T.E. Anderson, E.D. Lazowska, and H.M. Levy, Lightweight Remote Procedure Call ACM Trans. Computer Systems, vol. 8, no. 1, pp. 37-55, Feb. 1990.
[6] B. Bershad, D. Lee, T. Romer,, and J. Chen, ``Avoiding Conflict Misses Dynamically in Large Direct-Mapped Caches,'' Proc. Sixth ASPLOS, pp. 158-170, Oct. 1994.
[7] P. Budnik and D. Kuck, “The Organization and Use of Parallel Memories,” ACM Trans. Computers, vol. 20, no. 12, pp. 1566-1569, Dec. 1971.
[8] D. Burger, J.R. Goodman, and A. Kägi, "Memory Bandwidth Limitations of Future Microprocessors," Proc. 23rd Ann. Int'l Symp. Computer Architecture, Association of Computing Machinery, New York, 1996, pp. 79-90.
[9] J.B. Carter, W.C. Hsieh, L.B. Stoller, M.R. Swanson, L. Zhang, E.L. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M.A. Parker, L. Schaelicke, and T. Tateyama, Impulse: Building a Smarter Memory Controller Proc. Fifth Int'l Symp. High Performance Computer Architecture, pp. 70-79, Jan. 1999.
[10] E. Catmull and A.R. Smith, "Three-Dimensional Transformations of Images in Scanline Order," Computer Graphics, vol. 14, no. 3, pp. 279-285, July 1980.
[11] J.B. Chen, A. Borg, and N.P. Jouppi, “A Simulation Based Study of TLB Performance,” Proc. 19th Ann. Int'l Symp. Computer Architecture, pp. 114-123, May 1992.
[12] Compaq Computer Corp., Alpha 21164 Microprocessor Hardware Reference Manual, July 1999.
[13] Z. Fang, L. Zhang, J. Carter, W. Hsieh, and S. McKee, “Revisiting Superpage Promotion with Hardware Support,” Proc. Seventh Ann. Symp. High Performance Computer Architecture, Jan. 2001.
[14] K.I. Farkas, P. Chow, N.P. Jouppi, and Z. Vranesic, “Memory-System Design Considerations for Dynamically-Scheduled Processors,” Proc. 24th Ann. Int'l Symp. Computer Architecture, pp. 133-143, May 1997.
[15] J. Gomes and L. Velho, Image Processing for Computer Graphics. Springer-Verlag, 1997.
[16] HAL Computer Systems, Inc., “SPARC64-GP Processor,” http://mpd.hal.com/products-SPARC64-GP.html , 1999.
[17] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1995.
[18] R. Hintz and D. Tate, “Control Data STAR-100 Processor Design,” Proc. COMPCON '72, Sept. 1972.
[19] A.S. Huang and J.P. Shen, "The Intrinsic Bandwidth Requirements of Ordinary Programs," Proc. Seventh Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, ACM Press, New York, 1996, pp. 105-114.
[20] Intel Corp., Pentium Pro Family Developer's Manual, Jan. 1996.
[21] B. Jacob and T. Mudge, “Software-Managed Address Translation,” Proc. Third Int'l Symp. High Performance Computer Architecture, pp. 156–167, Feb. 1997.
[22] B.L. Jacob and T.N. Mudge, “A Look at Several Memory Management Units, TLB-Refill Mechanisms, and Page Table Organizations,” Proc. Eight Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 295-306, Oct. 1998.
[23] N.P. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully Associative Cache and Prefetch Buffers,” Proc. 17th Int'l Symp. Computer Architecture, pp. 364-373, May 1990.
[24] Y. Khalidi, M. Talluri, M. Nelson, and D. Williams, “Virtual Memory Support for Multiple Page Sizes,” Proc. Fourth Workshop Workstation Operating Systems, pp. 104-109, Oct. 1993.
[25] C. Kozyrakis, S. Perissakis, D. Patterson, T. Anderson, K. Asanovic, N. Cardwell, R. Fromm, J. Golbus, B. Gribstad, K. Keeton, R. Thomas, N. Treuhaft, and K. Yelick, “Scalable Processors in the Billion-Transistor Era: IRAM,” Computer, vol. 30, no. 9, pp. 75-78, Sept. 1997.
[26] P. Lacroute, “Fast Volume Rendering Using a Shear-Warp Factorization of the Viewing Transform,” PhD thesis, Stanford Univ., Stanford, Calif., 1995.
[27] M. Lam, E. Rothberg, and M. Wolf, “The Cache Performance and Optimizations of Blocked Algorithms,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '91), 1991.
[28] J.W. Manke and J. Wu, Data-Intensive System Benchmark Suite Analysis and Specification. Atlantic Aerospace Electronics Corp., June 1999.
[29] S.A. McKee and W.A. Wulf, “Access Ordering and Memory-Conscious Cache Utilization,” Proc. First Int'l Symp. High-Performance Computer Architecture, pp. 253-262, Jan. 1995.
[30] MIPS Technologies, Inc., MIPS R10000 Microprocessor User's Manual, Version 2.0, Dec. 1996.
[31] J. Mogul, “Big Memories on the Desktop,” Proc. Fourth Workshop Workstation Operating Systems, pp. 110-115, Oct. 1993.
[32] M. Talluri and M.D. Hill, “Surpassing the TLB Performance of Superpages with Less Operating System Support,” Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '94), pp. 171-182, Oct. 1994.
[33] D.R. O'Hallaron, “Spark98: Sparse Matrix Kernels for Shared Memory and Message Passing Systems,” Technical Report CMU-CS-97-178, Carnegie Mellon Univ. School of Computer Science, Oct. 1997.
[34] M. Oskin, F. Chong, and T. Sherwood, “Active Pages: A Computation Model for Intelligent Memory,” Proc. 25th Ann. Int'l Symp. Computer Architecture, pp. 192-203, June 1998.
[35] V. Pai, P. Ranganathan, and S. Adve, “RSIM Reference Manual, Version 1.0,” IEEE Technical Committee on Computer Architecture Newsletter, Fall 1997.
[36] S. Palacharla and R.E. Kessler, “Evaluating Stream Buffers as a Secondary Cache Replacement,” Proc. 21st Ann. Int'l Symp. Computer Architecture, pp. 24-33, Apr. 1994.
[37] S. Parker et al., "Interactive Ray Tracing for Isosurface Rendering," Proc. Visualization 98, CD-ROM, ACM Press, New York, Oct. 1998.
[38] S. Rixner et al., "A Bandwidth-Efficient Architecture for Media Processing," Proc. 31st Int'l Symp. Microarchitecture, IEEE Computer Society Press, Los Alamitos, Calif., 1998, pp. 3-13.
[39] T. Romer, “Using Virtual Memory to Improve Cache and TLB Performance,” PhD thesis, Univ. of Washington, May 1998.
[40] T. Romer, W. Ohlrich, A. Karlin, and B. Bershad, “Reducing TLB and Memory Overhead Using Online Superpage Promotion,” Proc. 22nd Ann. Int'l Symp. Computer Architecture, pp. 176-187, June 1995.
[41] A. Saulsbury, F. Dahlgren, and P. Stenstrom, “Recency-Based TLB Preloading,” Proc. 27th Ann. Int'l Symp. Computer Architecture, pp. 117-127, June 2000.
[42] A. Srivastava and A. Eustace, "ATOM: A System for Building Customized Program Analysis Tools," Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, ACM Press, New York, 1994.
[43] SUN Microsystems, Inc., UltraSPARC User's Manual, July 1997.
[44] M. Swanson, L. Stroller, and J.B. Carter, Increasing TLB Reach Using Superpages Backed by Shadow Memory Proc. 25th Ann. Int'l Symp. Computer Architecture (ISCA '98), June 1998.
[45] O. Temam, E.D. Granston,, and W. Jalby, “To Copy or Not to Copy: A Compile-Time Technique for Assessing When Data Copying Should Be Used to Eliminate Cache Conflicts,” Proc. Supercomputing, Nov. 1993.
[46] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal, “Baring It All to Software: Raw Machines,” Computer, pp. 86-93, Sept. 1997.
[47] G. Wolberg, Digital Image Warping, IEEE CS Press, 1990.
[48] L. Zhang, “URSIM Reference Manual,” Technical Report UUCS-00-015, Univ. of Utah, Aug. 2000.
[49] L. Zhang, J.B. Carter, W. Hsieh, and S.A. McKee, “Memory System Support for Image Processing,” Proc. 1999 Int'l Conf. Parallel Architectures and Compilation Techniques, pp. 98-107, 1999.
[50] X. Zhang, A. Dasdan, M. Schulz, R.K. Gupta, and A.A. Chien, “Architectural Adaptation for Application-Specific Locality Optimizations,” Proc. 1997 IEEE Int'l Conf. Computer Design, 1997.

Index Terms:
Computer architecture, memory systems.
Citation:
Lixin Zhang, Zhen Fang, Mike Parker, Binu K. Mathew, Lambert Schaelicke, John B. Carter, Wilson C. Hsieh, Sally A. McKee, "The Impulse Memory Controller," IEEE Transactions on Computers, vol. 50, no. 11, pp. 1117-1132, Nov. 2001, doi:10.1109/12.966490
Usage of this product signifies your acceptance of the Terms of Use.