This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Moving Address Translation Closer to Memory in Distributed Shared-Memory Multiprocessors
July 2005 (vol. 16 no. 7)
pp. 612-623

Abstract—To support a global virtual memory space, an architecture must translate virtual addresses dynamically. In current processors, the translation is done in a TLB (Translation Lookaside Buffer), before or in parallel with the first-level cache access. As processor technology improves at a rapid pace and the working sets of new applications grow insatiably, the latency and bandwidth demands on the TLB are difficult to meet, especially in multiprocessor systems, which run larger applications and are plagued by the TLB consistency problem. We describe and compare five options for virtual address translation in the context of Distributed Shared Memory (DSM) multiprocessors, including CC-NUMAs (Cache-Coherent Non-Uniform Memory Access Architectures) and COMAs (Cache Only Memory Access Architectures). In CC-NUMAs, moving the TLB to shared memory is a bad idea because page placement, migration, and replication are all constrained by the virtual page address, which greatly affects processor node access locality. In the context of COMAs, the allocation of pages to processor nodes is not as critical because memory blocks can dynamically migrate and replicate freely among nodes. As the address translation is done deeper in the memory hierarchy, the frequency of translations drops because of the filtering effect. We also observe that the TLB is very effective when it is merged with the shared-memory, because of the sharing and prefetching effects and because there is no need to maintain TLB consistency. Even if the effectiveness of the TLB merged with the shared memory is very high, we also show that the TLB can be removed in a system with address translation done in memory because the frequency of translations is very low.

[1] A. Agarwal, Analysis of Cache Performance for Operating System and Multiprogramming. Boston: Kluwer Academic Publishers, 1989.
[2] T. Austin and G. Sohi, “High-Bandwidth Address Translation for Multiple-Issue Processors,” Proc. 22nd Ann. Int'l Symp. Computer Architecture (ISCA), pp. 158-167, 1996.
[3] E. Bugnion, J.M. Anderson, T.C. Mowry, M. Rosenblum, and M.S. Lam, “Compiler-Directed Page Coloring for Multiprocessor,” Proc. Seventh Conf. Architecture Support for Programming Languages and Operating Systems (ASPLOS), Oct. 1996.
[4] H. Burkhardt III et al., “Overview of the KSR-1 Computer System,” Technical Report KSR-TR-9202001, Kendall Square Research, Feb. 1992.
[5] M. Cekleov and M. Dubois, “Virtual-Address Caches, Part 1: Problems and Solutions in Uniprocessors,” IEEE Micro, pp. 64-71, Sept./Oct. 1997.
[6] M. Cekleov and M. Dubois, “Virtual-Address Caches, Part 2: Multiprocessor Issues,” IEEE Micro, Nov./Dec. 1997.
[7] J. Chase, H. Levy, and M. Feeley, “Sharing and Protection in a Single-Address-Space Operating System,” ACM Trans. Computer Systems, pp. 271-307, Nov. 1994.
[8] J.B. Chen and A. Borg, “A Simulation Based Study of TLB Performance,” Proc. 19th Ann. Int'l Symp. Computer Architecture (ISCA), pp. 114-123, May 1992.
[9] D.W. Clark and J.S. Emer, “Performance of the VAX-11/780 Translation Buffer: Simulation and Measurement,” ACM Trans. Computer Systems, vol. 3, no. 1, Feb. 1985.
[10] M. Dubois, “Fighting the Memory Wall with Assisted Execution,” Proc. 2004 Computing Frontiers Conference, pp. 168-180, Apr. 2004.
[11] K. Gharachorloo, A. Gupta, and J. Hennessy, “Performance Evaluation of Memory Consistency Models for Shared-Memory Multiprocessors,” Proc. Fourth Conf. Architecture Support for Programming Languages and Operating Systems (ASPLOS), pp. 245-257, 1991.
[12] J.R. Goodman, “Coherency for Multiprocessor Virtual Address Caches,” Proc. Second Conf. Architecture Support for Programming Languages and Operating Systems (ASPLOS), 1987.
[13] L. Gwennap, “Design Concepts for Merced, Forecasting the Inner Workings of the Decade's Most Anticipated Processor,” Microprocessor Report, pp. 9-11, vol. 11, no. 3, Mar. 1997.
[14] E. Hagersten, A. Landin, and S. Haridi, “DDM-A Cache-Only Memory Architecture,” Computer, vol. 25, no. 9, pp. 44-54, Sept. 1992.
[15] J. Huck and J. Hays, “Architecture Support for Translation Table Management in Large Address Space Machines,” Proc. 20th Ann. Int'l Symp. Computer Architecture (ISCA), pp. 39-50, 1993.
[16] B. Jacob and T. Mudge, “Software-Managed Address Translation,” Proc. Third Int'l Symp. High Performance Computer Architecture (HPCA), Feb. 1997.
[17] B. Jacob and T. Mudge, “Uniprocessor Virtual Memory without TLBs,” IEEE Trans. Computers, vol. 50, no. 5, pp. 482-499, May 2001.
[18] T. Joe, “COMA-F: A Non-Hierarchical Cache Only Memory Architecture,” PhD thesis, Stanford Univ., 1995.
[19] E.J. Koldinger, J.S. Chase, and S.J. Eggers, “Architecture Support for Single Address Space Operating System,” Proc. Fifth Conf. Architecture Support for Programming Languages and Operating Systems (ASPLOS), pp. 175-186, Oct. 1992.
[20] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy, “The Stanford FLASH Multiprocessor,” Proc. 21st Ann. Int'l Symp. Computer Architecture (ISCA), pp. 302-313, 1994.
[21] W. Lynch, “The Interaction of Virtual Memory and Cache Memory,” Technical Report CSL-TR-93-587, PhD thesis, Stanford Univ., 1993.
[22] The PowerPC Architecture: A Specification for a New Family of RISC Processors, C. May, E. Silha, R. Simpson, and H. Warren, eds. San Francisco: Morgan Kaufmann Publishers, 1994.
[23] A. Moga, A. Gefflaut, and M. Dubois, “Hardware vs. Software Implementation of COMA,” Proc. 1997 Int'l Conf. Parallel Processing, pp. 248-256, Aug. 1997.
[24] D. Nagle, R. Uhlig, T. Stanley, S. Sechrest, T. Mudge, and R. Brown, “Design Tradeoffs for Software-Managed TLBs,” Proc. 20th Ann. Int'l Symp. Computer Architecture (ISCA), pp. 27-38, 1993.
[25] X. Qiu and M. Dubois, “Options for Dynamic Address Translation for COMAs,” Proc. 25th Ann. Int'l Symp. Computer Architecture (ISCA), pp. 214-225, 1998.
[26] X. Qiu and M. Dubois, “Tolerating Late Memory Traps for ILP Processors,” Proc. 26th Ann. Int'l Symp. Computer Architecture (ISCA), pp. 76-87, 1999.
[27] X. Qiu, “Towards Virtually-Addressed Memory Hierarchies,” PhD thesis, Dept. of Electrical Eng. Systems, Univ. of Southern California, Aug. 2000.
[28] X. Qiu and M. Dubois, “Towards Virtually-Addressed Memory Hierarchies,” Proc. Seventh Int'l Symp. High Performance Computer Architecture (HPCA), pp. 51-62, Jan. 2001.
[29] S. Ritchie, “TLB for Free: In-Cache Address Translation for a Multiprocessor Workstation,” Technical Report UCB/CSD 85/233, Univ. of California at Berkeley, May 1985.
[30] T.H. Romer, W.H. Ohlrich, and A.R. Karlin, “Reducing TLB and Memory Overhead Using Online Promotion,” Proc. 22nd Ann. Int'l Symp. Computer Architecture (ISCA), pp. 176-187, 1995.
[31] M. Talluri and M.D. Hill, “Surpassing the TLB Performance of Superpages with Less Operating System Support,” Proc. Sixth Conf. Architecture Support for Programming Languages and Operating Systems (ASPLOS), 1994.
[32] M. Talluri, S. Kong, M.D. Hill, and D.A. Patterson, “Tradeoffs in Supporting Two Page Sizes,” Proc. 19th Ann. Int'l Symp. Computer Architecture (ISCA), pp. 415-424, May 1992.
[33] P. Teller and A. Gottlieb, “Locating Multiprocessor TLBs at Memory,” Proc. 27th Ann. Hawaii Int'l Conf. System Science, pp. 554-563, 1994.
[34] M. Tremblay and J.M. O'Connor, “Ultrasparc I: A Four-Issue Processor Supporting Multimedia,” IEEE Micro, pp. 42-50, Apr. 1996.
[35] P. Stenström, T. Joe, and A. Gupta, “Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures,” Proc. 19th Ann. Int'l Symp. Computer Architecture, pp. 80-91, May 1992.
[36] W.H. Wang, J.-L. Baer, and H.M. Levy, “Organization and Performance of a Two-Level Virtual-Real Cache Hierarchy,” Proc. 16th Ann. Int'l Symp. Computer Architecture (ISCA), pp. 140-148, June 1989.
[37] H. Wang, T. Sun, and Q. Yang, “CAT— Caching Address Tags, A Technique for Reducing Area Cost of On-Chip Caches,” Proc. 22nd Ann. Int'l Symp. Computer Architecture (ISCA), pp. 381-390, 1995.
[38] S.C. Woo, M. Ohara, and E. Torrie, “The SPLASH-2 Programs: Characterization and Methodological Considerations,” Proc. 22nd Ann. Int'l Symp. Computer Architecture (ISCA), pp. 24-36, 1995.
[39] K.C. Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, pp. 28-40, Apr. 1996.
[40] C. Zilles, J. Emer, and G. Sohi, “The Use of Multithreading for Exception Handling,” Proc. 32nd Ann. Int'l Symp. Microarchitecture (Micro-32), 1999.

Index Terms:
Multiprocessors, distributed shared memory, virtual memory, simulations, dynamic address translation, virtual-address caches.
Citation:
Xiaogang Qiu, Michel Dubois, "Moving Address Translation Closer to Memory in Distributed Shared-Memory Multiprocessors," IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 7, pp. 612-623, July 2005, doi:10.1109/TPDS.2005.84
Usage of this product signifies your acceptance of the Terms of Use.