The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - March (2010 vol.59)
pp: 332-344
Yuho Jin , University of Southern California, Los Angeles
Eun Jung Kim , Texas A&M University, College Station
Ki Hwan Yum , University of Texas at San Antonio, San Antonio
ABSTRACT
Switched networks have been adopted in on-chip communication for their scalability and efficient resource sharing. However, using a general network for a specific domain may result in unnecessary high cost and low performance when the interconnects are not optimized for the domain. Designing an optimal network for the specific domain is challenging because in-depth knowledge of interconnects and the application domain is required. Recently proposed Nonuniform Cache Architectures (NUCAs) use wormhole-routed 2D mesh networks in L2 caches. We observe that in NUCAs, network resources are underutilized with the considerable area cost (41 percent of cache) and the network delay is significantly large (63 percent of cache access time). Motivated by our observations, we investigate both router architecture and network topology for communication behaviors in large-scale cache systems. We present Fast-LRU replacement, where cache replacement overlaps with data request delivery. Next, we propose a deadlock-free XYX routing algorithm in a mesh network and present a new halo network topology to reduce the required links. Finally, we introduce a single-cycle multicast router that needs small modification of the unicast router design. Simulation results show that our design improves the average IPC by 38 percent over the mesh design with Multicast Promotion replacement and uses 12 percent of the interconnection area of the mesh network.
INDEX TERMS
On-chip interconnection networks, nonuniform cache architecture, domain-specific design.
CITATION
Yuho Jin, Eun Jung Kim, Ki Hwan Yum, "Design and Analysis of On-Chip Networks for Large-Scale Cache Systems", IEEE Transactions on Computers, vol.59, no. 3, pp. 332-344, March 2010, doi:10.1109/TC.2009.130
REFERENCES
[1] V. Agarwal, M.S. Hrishikesh, S.W. Keckler, and D. Burger, “Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures,” Proc. Int'l Symp. Computer Architecture (ISCA), pp. 248-259, 2000.
[2] R. Ho, K. Mai, and M. Horowitz, “The Future of Wires,” Proc. IEEE, vol. 89, no. 4, pp. 490-504, Apr. 2001.
[3] M. Lajolo, M.S. Reorda, and M. Violante, “Early Evaluation of Bus Interconnects Dependability for System-On-Chip Designs,” Proc. 14th Int'l Conf. VLSI Design, pp. 371-376, 2001.
[4] W.J. Dally and B. Towles, “Route Packets, Not Wires: On-Chip Interconnection Networks,” Proc. Design Automation Conf. (DAC), pp. 684-689, 2001.
[5] L. Benini and G.D. Micheli, “Networks on Chips: A New SoC Paradigm,” Computer, vol. 35, no. 1, pp. 70-78, Jan. 2002.
[6] G. Varatkar and R. Marculescu, “Traffic Analysis for On-Chip Networks Design of Multimedia Applications,” Proc. Design Automation Conf. (DAC), pp. 795-800, 2002.
[7] M. Forsell, “A Scalable High-Performance Computing Solution for Networks on Chips,” IEEE Micro, vol. 22, no. 5, pp. 46-55, Sept./Oct. 2002.
[8] M.B. Taylor, W. Lee, S.P. Amarasinghe, and A. Agarwal, “Scalar Operand Networks,” IEEE Trans. Parallel Distributed Systems, vol. 16, no. 2, pp. 145-162, Feb. 2005.
[9] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S.W. Keckler, and C.R. Moore, “Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture,” Proc. Int'l Symp. Computer Architecture (ISCA), pp. 422-433, 2003.
[10] K. Mai, T. Paaske, N. Jayasena, R. Ho, W.J. Dally, and M. Horowitz, “Smart Memories: A Modular Reconfigurable Architecture,” Proc. Int'l Symp. Computer Architecture (ISCA), pp. 161-171, 2000.
[11] C. Kim, D. Burger, and S.W. Keckler, “An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches,” Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 211-222, 2002.
[12] B.M. Beckmann and D.A. Wood, “Managing Wire Delay in Large Chip-Multiprocessor Caches,” Proc. Int'l Symp. Microarchitecture (MICRO), pp. 319-330, 2004.
[13] Z. Chishti, M.D. Powell, and T.N. Vijaykumar, “Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures,” Proc. Int'l Symp. Microarchitecture (MICRO), pp. 55-66, 2003.
[14] B.M. Beckmann and D.A. Wood, “TLC: Transmission Line Caches,” Proc. Int'l Symp. Microarchitecture (MICRO), pp. 43-54, 2003.
[15] N. Muralimanohar and R. Balasubramonian, “Interconnect Design Considerations for Large NUCA Caches,” Proc. Int'l Symp. Computer Architecture (ISCA), pp. 369-380, 2007.
[16] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0,” Proc. Int'l Symp. Microarchitecture (MICRO), pp. 3-14, 2007.
[17] Z. Chishti, M.D. Powell, and T.N. Vijaykumar, “Optimizing Replication, Communication, and Capacity Allocation in CMPs,” Proc. Int'l Symp. Computer Architecture (ISCA), pp. 357-368, 2005.
[18] J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S.W. Keckler, “A NUCA Substrate for Flexible CMP Cache Sharing,” Proc. Int'l Conf. Supercomputing (ICS), pp. 31-40, 2005.
[19] W.H. Ho and T.M. Pinkston, “A Methodology for Designing Efficient On-Chip Interconnects on Well-Behaved Communication Patterns,” Proc. Int'l Symp. High-Performance Computer Architecture (HPCA), pp. 377-388, 2003.
[20] L.-S. Peh and W.J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. Int'l Symp. High-Performance Computer Architecture (HPCA), pp. 255-266, 2001.
[21] R.D. Mullins, A. West, and S.W. Moore, “Low-Latency Virtual-Channel Routers for On-Chip Networks,” Proc. Int'l Symp. Computer Architecture (ISCA), pp. 188-197, 2004.
[22] C.B. Stunkel, R. Sivaram, and D.K. Panda, “Implementing Multidestination Worms in Switch-Based Parallel Systems: Architectural Alternatives and Their Impact,” Proc. Int'l Symp. Computer Architecture (ISCA), pp. 50-61, 1997.
[23] X. Lin and L.M. Ni, “Deadlock-Free Multicast Wormhole Routing in Multicomputer Networks,” Proc. Int'l Symp. Computer Architecture (ISCA), pp. 116-125, 1991.
[24] C.J. Glass and L.M. Ni, “The Turn Model for Adaptive Routing,” Proc. Int'l Symp. Computer Architecture (ISCA), pp. 278-287, 1992.
[25] R.D. Mullins, A. West, and S.W. Moore, “The Design and Implementation of a Low-Latency On-Chip Network,” Proc. Asia and South Pacific Design Automation Conf. (ASP-DAC), pp. 164-169, 2006.
[26] A. Kumar, L.-S. Peh, P. Kundu, and N.K. Jha, “Express Virtual Channels: Towards the Ideal Interconnection Fabric,” Proc. Int'l Symp. Computer Architecture (ISCA), pp. 150-161, 2007.
[27] Semiconductor Industry Association (SIA), “International Technology Roadmap for Semiconductors,” http:/public.itrs.net, 2003.
[28] M. Galles, “Scalable Pipelined Interconnect for Distributed Endpoint Routing: The SGI SPIDER Chip,” Proc. Hot Interconnect, pp.141-146, 1996.
[29] R. Desikan, D. Burger, S. Keckler, and T. Austin, “Sim-alpha: A Validated, Execution-Driven Alpha 21264 Simulator,” Technical Report TR-01-23, Dept. of Computer Sciences, The Univ. of Texas at Austin, 2001.
[30] R.E. Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro, vol. 19, no. 2, pp. 24-36, Mar. 1999.
[31] P. Shivakumar and N.P. Jouppi, “Cacti 3.0: An Integrated Cache Timing, Power and Area Model,” Technical Report WRL-2001-2, Compaq Computer Corporation 2001.
[32] H. Wang, X. Zhu, L.-S. Peh, and S. Malik, “Orion: A Power-Performance Simulator for Interconnection Networks,” Proc. Int'l Symp. Microarchitecture (MICRO), pp. 294-305, 2002.
[33] Y. Jin, E.J. Kim, and K.H. Yum, “A Domain-Specific On-Chip Network Design for Large Scale Cache Systems,” Proc. Int'l Symp. High-Performance Computer Architecture (HPCA), pp. 318-327, 2007.
28 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool