This Article 
 Bibliographic References 
 Add to: 
Balancing Performance and Cost in CMP Interconnection Networks
March 2012 (vol. 23 no. 3)
pp. 452-459
Pablo Abad, Universidad de Cantabria, Santander
Valentin Puente, Universidad de Cantabria, Santander
José Angel Gregorio, Universidad de Cantabria, Santander
This paper presents an innovative router design, called Rotary Router, which successfully addresses CMP cost/performance constraints. The router structure is based on two independent rings, which force packets to circulate either clockwise or counterclockwise, traveling through every port of the router. These two rings constitute a completely decentralized arbitration scheme that enables a simple, but efficient way to connect every input port to every output port. The proposed router is able to avoid network deadlock, livelock, and starvation without requiring data-path modifications. The organization of the router permits the inclusion of throughput enhancement techniques without significantly penalizing the implementation cost. In particular, the router performs adaptive routing, eliminates HOL blocking, and carries out implicit congestion control using simple arbitration and buffering strategies. Additionally, the proposal is capable of avoiding end-to-end deadlock at coherence protocol level with no physical or virtual resource replication, while guaranteeing in-order packet delivery. This facilitates router management and improves storage utilization. Using a comprehensive evaluation framework that includes full-system simulation and hardware description, the proposal is compared with two representative router counterparts. The results obtained demonstrate the Rotary Router's substantial performance and efficiency advantages.

[1] P. Abad, V. Puente, J.A. Gregorio, and P. Prieto, "Rotary Router: An Efficient Architecture for CMP Interconnection Networks," Proc. 34th Int'l Symp. Computer Architecture (ISCA), 2007.
[2] P. Abad, V. Puente, J.A. Gregorio, and P. Prieto, "Reducing the Interconnection Network Cost of Chip Multiprocessors," Proc. IEEE Second ACM Int'l Symp. Networks-On-Chip (NOCS), 2008.
[3] A.R. Alameldeen, C.J. Mauer, M. Xu, P.J. Harper, M.M.K. Martin, D.J. Sorin, M.D. Hill, and D.A. Wood, "Evaluating Non-Deterministic Multi-Threaded Commercial Workloads," Proc. Fifth Workshop Computer Architecture Evaluation Using Commercial Workloads, pp. 30-38, 2002.
[4] J. Balfour and W. Dally, "Design Tradeoffs for Tiled CMP On-Chip Networks," Int'l Conf. Supercomputing (ICS), 2006.
[5] S. Bell et al., "TILE64TM Processor: A 64-Core SoC with Mesh Interconnect," Proc. IEEE Int'l Solid-State Circuits Conf. Digest of Technical Papers (ISSCC), 2008.
[6] D.M. Brooks et al., "Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors," IEEE Micro, vol. 20, no. 6, pp. 26-44, Nov. 2000.
[7] D. Burger, S. Keckler, K. McKinley, M. Dahlin, L. John, C. Lin, C. Moore, J. Burrill, R. McDonald, and W. Yoder, "Scaling to the End of Silicon with EDGE Architectures," Computer, vol. 37, no 7, pp.44-55, July 2004.
[8] L. Cheng, N. Muralimanohar, K. Ramani, R. Balasubramonian, and J.B. Carter, "Interconnect-Aware Coherence Protocols for Chip Multiprocessors," Proc. 33rd Ann. Int'l Symp. Computer Architecture (ISCA), 2006.
[9] W. Dally and B. Towles, "Route Packets, Not Wires: On-Chip Interconnection Networks," Proc. 38th Ann. Design Automation Conf. (DAC), 2001.
[10] W. Dally and B. Towles, Principles and Practices of Interconnection Networks. Morgan Kaufmann, 2004.
[11] H. Jin, M. Frumkin, and J. Yan, "The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance," Technical Report NAS-99-01, NASA Ames Research Center, 1999.
[12] R. Gonzalez and M. Horowitz, "Energy Dissipation in General Purpose Microprocessors," IEEE J. Solid-State Circuits, vol. 31, no. 9, pp. 1277-1284, Sept. 1996.
[13] M. Hayenga, N.E. Jerger, and M. Lipasti, "SCARAB: A Single Cycle Adaptive Routing and Bufferless Network," Proc. IEEE/ACM 42nd Ann. Int'l Symp. Microarchitecture (MICRO-42), Dec. 2009.
[14] Intel, An Introduction to Quick Path Interconnect, http://www. introduction.pdf, 2011.
[15] P. Kermani and L. Kleinrock, "Virtual Cut-Through: A New Computer Communication Switching Technique," Computer Networks, vol. 3, pp. 267-286, Sept. 1979.
[16] C. Kim, D. Burger, and S.W. Keckler, "An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches," Proc. 10th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2002.
[17] J. Kim, "Low-Cost Router Microarchitecture for On-Chip Networks," Proc. IEEE/ACM 42nd Ann. Int'l Symp. Microarchitecture, Dec. 2009.
[18] J. Kim, C. Nicopoulos, D. Park, V. Narayanan, M.S. Yousif, and C.R. Das, "A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks," Proc. 33rd Ann. Int'l Symp. Computer Architecture (ISCA), 2006.
[19] P. Kongetira, K. Aingaran, and K. Olukotun, "Niagara: A 32-Way Multithreaded SPARC Processor," IEEE Micro, vol. 25, no. 2, pp. 21-29, Mar./Apr. 2005.
[20] S. Konstantinidou and L. Snyder, "The Chaos Router," Computers, vol. 43, no. 12, pp. 1386-1397, Dec. 1994.
[21] A. Kumar, P. Kundu, A.P. Singh, L-S. Peh, and N.K. Jha, "A 4.6Tbits/s 3.6GHz Single-Cycle NoC Router with a Novel Switch Allocator in 65nm CMOS," Proc. Int'l Conf. Computer Design, Oct. 2007.
[22] S.E. Lee and N. Bagherzadeh, "Increasing the Throughput of an Adaptive Router in Network-on-Chip (NoC)," Proc. Fourth Int'l Conf. Hardware/Software Codesign and System Synthesis (CODES+ISSS '06), 2006.
[23] D. Lenoski, J. Laudon, K. Gharachorloo, W. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam, "The Stanford Dash Multiprocessor," Computer, vol. 25, no. 3, pp. 63-79, Mar. 1992.
[24] J. Laudon and D. Lenosky, "The SGI Origin: A ccNUMA Highly Scalable Server," Proc. 24th Ann. Int'l Symp. Computer Architecture (ISCA), 1997.
[25] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, F. Larsson, A. Moestedt, and B. Werner, "Simics: A Full System Simulation Platform," Computer, vol. 35, no.2, pp. 50-58, Feb. 2002.
[26] M. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu, A. Alameldeen, K. Moore, M. Hill, and D. Wood, "Multifacet's General Execution-Driven Multiprocessor Simulator (GEMS) Toolset," SIGARCH Computer Architecture News, vol. 33, no. 4, pp. 92-99, Nov. 2005.
[27] M. Martin, M. Hill, and D. Wood, "Token Coherence: Decoupling Performance and Correctness," Proc. 30th Ann. Int'l Symp. Computer Architecture (ISCA), June 2003.
[28] T. Moscibroda and O. Mutlu, "A Case for Bufferless Routing in On-Chip Networks," Proc. 36th Ann. Int'l Symp. Computer Architecture (ISCA), 2009.
[29] G. Michelogiannakis, J. Balfour, and W.J. Dally, "Elastic-Buffer Flow Control for On-Chip Networks," Proc. IEEE 15th Int'l Symp. High-Performance Computer Architecture (HPCA), 2009.
[30] S. Mukherjee, P. Bannon, S. Lang, A. Spink, and D. Webb, "The Alpha 21364 Network Architecture," IEEE Micro, vol. 22, no. 1, pp 26-35, Jan./Feb. 2002.
[31] R. Mullins, A. West, and S. Moore, "Low-Latency Virtual-Channel Routers for On-Chip Networks," Proc. 31st Ann. Int'l Symp. Computer Architecture (ISCA), 2004.
[32] C. Nicopoulos, D. Park, J. Kim, N. Vijaykrishnan, M.S. Yousif, and C.R. Das, "ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip Routers," Proc. IEEE/ACM 39th Ann. Int'l Microarchitecture Symp. (MICRO), 2006.
[33] K. Olukotun and L. Hammond, "The Future of Microprocessors" Queue—Multiprocessors, vol. 3, no. 7, pp. 26-29, Sept. 2005.
[34] L. Peh and W. Dally, "A Delay Model and Speculative Architecture for Pipelined Routers," Proc. Seventh Int'l Symp. High-Performance Computer Architecture (HPCA) 2001.
[35] H. Hofstee, "Power Efficient Processor Architecture and the Cell Processor," Proc. 11th Int'l Symp. High-Performance Computer Architecture (HPCA), 2005.
[36] P. Pande, C. Grecu, M. Jones, A. Ivanov, and R.A. Saleh, "Performance Evaluation and Design Trade-Offs for Network-on-Chip Interconnect Architectures," Computers, vol. 54, no. 8, pp. 1025-1040, Feb. 2005.
[37] V. Puente, C. Izu, R. Beivide, J.A. Gregorio, F. Vallejo, and J.M. Prellezo, "The Adaptive Bubble Router," J. Parallel and Distributed Computing, vol. 61, no. 9, pp. 1180-1208, Sept. 2001.
[38] V. Puente, J.A. Gregorio, and R. Beivide, "SICOSYS: An Integrated Framework for Studying Interconnection Network in Multiprocessor Systems," Proc. 10th Euromicro Conf. Parallel, Distributed and Network-Based Processing, 2002.
[39] SPEC2000, http://www.spec.orgcpu2000/, 2011.
[40] S. Scott and G. Thorson, "The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus," Proc. Hot Interconnects IV, Aug. 1996.
[41] Y.H. Song and T.M. Pinkston, "A Progressive Approach to Handling Message-Dependent Deadlock in Parallel Computer Systems," IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 3, pp 259-275, Mar. 2003.

Index Terms:
Rotary Router, router architecture, interconnection networks, chip multiprocessors, coherence protocol, routing deadlock, coherence protocol deadlock.
Pablo Abad, Valentin Puente, José Angel Gregorio, "Balancing Performance and Cost in CMP Interconnection Networks," IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 3, pp. 452-459, March 2012, doi:10.1109/TPDS.2011.173
Usage of this product signifies your acceptance of the Terms of Use.