The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.06 - June (2012 vol.23)
pp: 1038-1046
Ahmed Abousamra , University of Pittsburgh, Pittsburgh
Alex K. Jones , University of Pittsburgh, Pittsburgh
Rami Melhem , University of Pittsburgh, Pittsburgh
ABSTRACT
Reducing data access latency is vital to achieving performance improvements in computing. For chip multiprocessors (CMPs), data access latency depends on the organization of the memory hierarchy, the on-chip interconnect, and the running workload. Several network-on-chip (NoC) designs exploit communication locality to reduce communication latency by configuring special fast paths or circuits on which communication is faster than the rest of the NoC. However, communication patterns are directly affected by the cache organization and many cache organizations are designed in isolation of the underlying NoC or assume a simple NoC design, thus possibly missing optimization opportunities. In this work, we take a codesign approach of the NoC and cache organization. First, we propose a hybrid circuit/packet-switched NoC that exploits communication locality through periodic configuration of the most beneficial circuits. Second, we design a Unique Private (UP) caching scheme targeting the class of interconnects which exploit communication locality to improve communication latency. The Unique Private cache stores the data that are mostly accessed by each processor core in the core's locally accessible cache bank, while leveraging dedicated high-speed circuits in the interconnect to provide remote cores with fast access to shared data. Simulations of a suite of scientific and commercial workloads show that our proposed design achieves a speedup of 15.2 and 14 percent on a 16-core and a 64-core CMP, respectively, over the state-of-the-art NoC-Cache codesigned system that also exploits communication locality in multithreaded applications.
INDEX TERMS
Multicore/single-chip multiprocessors, circuit-switching networks, cache memories.
CITATION
Ahmed Abousamra, Alex K. Jones, Rami Melhem, "Codesign of NoC and Cache Organization for Reducing Access Latency in Chip Multiprocessors", IEEE Transactions on Parallel & Distributed Systems, vol.23, no. 6, pp. 1038-1046, June 2012, doi:10.1109/TPDS.2011.238
REFERENCES
[1] J. Kim, J.D. Balfour, and W.J. Dally, "Flattened Butterfly Topology for On-Chip Networks," Proc. IEEE/ACM 40th Ann. Int'l Symp. Microarchitecture (MICRO), pp. 172-182, 2007.
[2] J.D. Balfour and W.J. Dally, "Design Tradeoffs for Tiled cmp On-Chip Networks," Proc. 20th Ann. Int'l Conf. Supercomputing (ICS), pp. 187-198, 2006.
[3] S. Bourduas and Z. Zilic, "A Hybrid Ring/Mesh Interconnect for Network-on-Chip Using Hierarchical Rings for Global Routing," Proc. First Int'l Symp. Networks-on-Chip (NOCS), pp. 195-204, 2007.
[4] R. Das, S. Eachempati, A.K. Mishra, N. Vijaykrishnan, and C.R. Das, "Design and Evaluation of a Hierarchical On-Chip Interconnect for Next-Generation CMPs," Proc. IEEE 15th Int'l Symp. High Performance Computer Architecture (HPCA), pp. 175-186, 2009.
[5] Y. Xu, Y. Du, B. Zhao, X. Zhou, Y. Zhang, and J. Yang, "A Low-Radix and Low-Diameter 3D Interconnection Network Design," Proc. IEEE 15th Int'l Symp. High Performance Computer Architecture (HPCA), pp. 30-42, 2009.
[6] B. Grot, J. Hestness, S.W. Keckler, and O. Mutlu, "Express Cube Topologies for On-Chip Interconnects," Proc. IEEE 15th Int'l Symp. High Performance Computer Architecture (HPCA), pp. 163-174, 2009.
[7] A. Kumar, L.-S. Peh, P. Kundu, and N.K. Jha, "Express Virtual Channels: Towards the Ideal Interconnection Fabric," Proc. 34th Ann. Int'l Symp. Computer Architecture (ISCA), pp. 150-161, 2007.
[8] N.D.E. Jerger, L.-S. Peh, and M.H. Lipasti, "Circuit-Switched Coherence," Proc. ACM/IEEE Second Int'l Symp. Networks-on-Chip (NOCS), pp. 193-202, 2008.
[9] E. Bolotin, Z. Guz, I. Cidon, R. Ginosar, and A. Kolodny, "The Power of Priority: NoC Based Distributed Cache Coherency," Proc. First Int'l Symp. Networks-on-Chip (NOCS), pp. 117-126, 2007.
[10] C. Kim, D. Burger, and S.W. Keckler, "Nonuniform Cache Architectures for Wire-Delay Dominated On-Chip Caches," IEEE, Micro, vol. 23, no. 6, pp. 99-107, Nov./Dec. 2003.
[11] J.A. Brown, R. Kumar, and D.M. Tullsen, "Proximity-Aware Directory-Based Coherence for Multi-Core Processor Architectures," Proc. 19th Ann. ACM Symp. Parallel Algorithms and Architectures (SPAA), pp. 126-134, 2007.
[12] Z. Guz, I. Keidar, A. Kolodny, and U.C. Weiser, "Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture," Proc. 20th Ann. ACM Symp. Parallel Algorithms and Architectures (SPAA), pp. 1-10, 2008.
[13] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches," Proc. 36th Ann. Int'l Symp. Computer Architecture (ISCA), pp. 184-195, 2009.
[14] Z. Chishti, M.D. Powell, and T.N. Vijaykumar, "Optimizing Replication, Communication, and Capacity Allocation in CMPs," Proc. 32nd Ann. Int'l Symp. Computer Architecture (ISCA), pp. 357-368, 2005.
[15] M. Zhang and K. Asanovic, "Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors," Proc. 32nd Ann. Int'l Symp. Computer Architecture (ISCA), pp. 336-345, 2005.
[16] J. Chang and G.S. Sohi, "Cooperative Caching for Chip Multiprocessors," Proc. 33rd Ann. Int'l Symp. Computer Architecture (ISCA), pp. 264-276, 2006.
[17] B.M. Beckmann, M.R. Marty, and D.A. Wood, "ASR: Adaptive Selective Replication for CMP Caches," Proc. IEEE/ACM 39th Ann. Int'l Symp. Microarchitecture (MICRO), pp. 443-454, 2006.
[18] J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S.W. Keckler, "A NUCA Substrate for Flexible CMP Cache Sharing," Proc. 19th Ann. Int'l Conf. Supercomputing (ICS), pp. 31-40, 2005.
[19] J. Laudon and D. Lenoski, "The SGI Origin: A ccNUMA Highly Scalable Server," Proc. 24th Ann. Int'l Symp. Computer Architecture (ISCA), pp. 241-251, 1997.
[20] K.J. Barker et al., "On the Feasibility of Optical Circuit Switching for High Performance Computing Systems," Proc. ACM/IEEE Conf. Supercomputing (SC), 2005.
[21] S. Cho and L. Jin, "Managing Distributed, Shared l2 Caches Through Os-Level Page Allocation," Proc. IEEE/ACM 39th Ann. Int'l Symp. Microarchitecture (MICRO), pp. 455-468, 2006.
[22] M. Awasthi, K. Sudan, R. Balasubramonian, and J.B. Carter, "Dynamic Hardware-Assisted Software-Controlled Page Placement to Manage Capacity Allocation and Sharing within Large Caches," Proc. IEEE 15th Int'l Symp. High Performance Computer Architecture (HPCA), pp. 250-261, 2009.
[23] R.D. Mullins, A. West, and S.W. Moore, "Low-Latency Virtual-Channel Routers for On-Chip Networks," Proc. 31st Ann. Int'l Symp. Computer Architecture (ISCA), pp. 188-197, 2004.
[24] L.-S. Peh and W.J. Dally, "A Delay Model and Speculative Architecture for Pipelined Routers," Proc. Seventh Int'l Symp. High Performance Computer Architecture (HPCA), pp. 255-266, 2001.
[25] A. Kumar, P. Kundu, A.P. Singh, L.-S. Peh, and N.K. Jha, "A 4.6tbits/s 3.6GHz Single-Cycle Noc Router with a Novel Switch Allocator in 65nm cmos," Proc. 25th Int'l Conf. Computer Design (ICCD), pp. 63-70, 2007.
[26] "SPLASH-2," http://www-flash.stanford.edu/appsSPLASH /, 2011.
[27] C. Bienia, S. Kumar, J.P. Singh, and K. Li, "The Parsec Benchmark Suite: Characterization and Architectural Implications," Proc. 17th Int'l Conf. Parallel Architectures and Compilation Techniques, Oct. 2008.
[28] J. Shalf, S. Kamil, L. Oliker, and D. Skinner, "Analyzing Ultra-Scale Application Communication Requirements for a Reconfigurable Hybird Interconnect," Proc. ACM/IEEE Conf. Supercomputing (SC), 2005.
[29] F. Cappello and C. Germain, "Toward High Communication Performance through Compiled Communications on a Circuit Switched Interconnection Network," Proc. Int'l Symp. High Performance Computer Architecture (HPCA), pp. 44-53, 1995.
[30] S. Shao, A.K. Jones, and R. Melhem, "Compiler Techniques for Efficient Communications in Circuit Switched Networks for Multiprocessor Systems," IEEE Trans. Parallel and Distributed Systems, vol. 20, no. 3, pp. 331-345, Mar. 2009.
[31] Y. Li, A. Abousamra, R. Melhem, and A.K. Jones, "Compiler-Assisted Data Distribution for Chip Multiprocessors," Proc. 19th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT), 2010.
[32] M.K. Qureshi, "Adaptive Spill-Receive for Robust High-Performance Caching in CMPs," Proc. IEEE 15th Int'l Symp. High Performance Computer Architecture (HPCA), pp. 45-54, 2009.
[33] C. Kim, D. Burger, and S.W. Keckler, "An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches," Proc. 10th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 211-222, 2002.
[34] M.T. Kandemir, F. Li, M.J. Irwin, and S.W. Son, "A Novel Migration-Based NUCA Design for Chip Multiprocessors," Proc. ACM/IEEE Conf. Supercomputing (SC), pp. 1-12, 2008.
[35] B.M. Beckmann and D.A. Wood, "Managing Wire Delay in Large Chip-Multiprocessor Caches," Proc. 37th Int'l Symp. Microarchitecture (MICRO), pp. 319-330, 2004.
[36] J. Heinlein, J. Bosch, R.P.K. Gharachorloo, M. Rosenblum, and A. Gupta, "Coherent Block Data Transfer in the Flash Multiprocessor," Proc. 11th Int'l Parallel Processing Symp., pp. 18-27, Apr. 1997.
[37] J. Kuskin et al., "The Stanford Flash Multiprocessor," Proc. 21st Ann. Int'l Symp. Computer Architecture, pp. 302-313, Apr. 1994.
16 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool