loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Heterogeneous Interconnects for Energy-Efficient Message Management in CMPs
January 2010 (vol. 59 no. 1)
pp. 16-28
Antonio Flores, University of Murcia, Murcia
Juan L. Aragón, University of Murcia, Murcia
Manuel E. Acacio, Universidad de Murcia, Murcia
Continuous improvements in integration scale have made major microprocessor vendors to move to designs that integrate several processing cores on the same chip. Chip multiprocessors (CMPs) constitute a good alternative to traditional monolithic designs for several reasons, among others, better levels of performance, scalability, and performance/energy ratio. On the other hand, higher clock frequencies and the increasing transistor density have revealed power dissipation and temperature as critical design issues in current and future architectures. Previous studies have shown that the interconnection network of a Chip Multiprocessor (CMP) has significant impact on both overall performance and energy consumption. Moreover, wires used in such interconnect can be designed with varying latency, bandwidth, and power characteristics. In this work, we show how messages can be efficiently managed, from the point of view of both performance and energy, in tiled CMPs using a heterogeneous interconnect. Our proposal consists of two approaches. The first is Reply Partitioning, a technique that splits replies with data into a short Partial Reply message that carries a subblock of the cache line that includes the word requested by the processor plus an Ordinary Reply with the full cache line. This technique allows all messages used to ensure coherence between the L1 caches of a CMP to be classified into two groups: critical and short, and noncritical and long. The second approach is the use of a heterogeneous interconnection network composed of low-latency wires for critical messages and low-energy wires for noncritical ones. Detailed simulations of 8 and 16-core CMPs show that our proposal obtains average savings of 7 percent in execution time and 70 percent in the Energy-Delay squared Product (ED^2P) metric of the interconnect over previous works (from 24 to 30 percent average ED^2P improvement for the full CMP). Additionally, the sensitivity analysis shows that although the execution time is minimized for subblocks of 16 bytes, the best choice from the point of view of the ED^2P metric is the 4-byte subblock configuration with an additional improvement of 2 percent over the 16-byte one for the ED^2P metric of the full CMP.

[1] M.B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal, “The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs,” IEEE Micro, vol. 22, no. 2, pp. 25-35, Mar./Apr. 2002.
[2] M. Zhang and K. Asanovic, “Victim Replication: Maximizing Capacity While Hiding Wire Delay in Tiled Chip Multiprocessors,” Proc. 32nd Int'l Symp. Computer Architecture (ISCA-32), pp. 336-345, June 2005.
[3] H. Wang, L.-S. Peh, and S. Malik, “Power-Driven Design of Router Microarchitectures in On-Chip Networks,” Proc. 36th Int'l Symp. Microarchitecture (MICRO-36), pp. 105-111, Dec. 2003.
[4] N. Magen, A. Kolodny, U. Weiser, and N. Shamir, “Interconnect-Power Dissipation in a Microprocessor,” Proc. Sixth Int'l Workshop System Level Interconnect Prediction (SLIP-6), pp. 7-13, Feb. 2004.
[5] A. Flores, J.L. Aragón, and M.E. Acacio, “Sim-PowerCMP: A Detailed Simulator for Energy Consumption Analysis in Future Embedded CMP Architectures,” Proc. Fourth Int'l Symp. Embedded Computing (SEC-4), pp. 752-757, May 2007.
[6] A. Flores, J.L. Aragón, and M.E. Acacio, “An Energy Consumption Characterization of On-Chip Interconnection Networks for Tiled CMP Architectures,” The J. Supercomputing, vol. 45, no. 3, pp. 341-364, 2008.
[7] L. Shang, L. Peh, and N. Jha, “Dynamic Voltage Scaling with Links for Power Optimization of Interconnection Networks,” Proc. Ninth Int'l Symp. High-Performance Computer Architecture (HPCA-9), pp. 91-102, Feb. 2003.
[8] H.-S. Wang, L.-S. Peh, and S. Malik, “A Power Model for Routers: Modeling Alpha 21364 and InfiniBand Routers,” IEEE Micro, vol. 23, no. 1, pp. 26-35, Jan./Feb. 2003.
[9] K. Banerjee and A. Mehrotra, “A Power-Optimal Repeater Insertion Methodology for Global Interconnects in Nanometer Designs,” IEEE Trans. Electron Devices, vol. 49, no. 11, pp. 2001-2007, Nov. 2002.
[10] L. Cheng, N. Muralimanohar, K. Ramani, R. Balasubramonian, and J. Carter, “Interconnect-Aware Coherence Protocols for Chip Multiprocessors,” Proc. 33rd Int'l Symp. Computer Architecture (ISCA-33), pp. 339-351, June 2006.
[11] A. Flores, J.L. Aragón, and M.E. Acacio, “Efficient Message Management in Tiled cmp Architectures Using a Heterogeneous Interconnection Network,” Proc. 14th Int'l Conf. High Performance Computing (HiPC '07), pp. 133-146, 2007.
[12] R. Ho, K. Mai, and M. Horowitz, “The Future of Wires,” Proc. IEEE, vol. 89, no. 4, pp. 490-504, Apr. 2001.
[13] R. Kumar, V. Zyuban, and D.M. Tullsen, “Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling,” Proc. 32nd Int'l Symp. Computer Architecture (ISCA-32), pp. 408-419, June 2005,
[14] B.M. Beckmann and D.A. Wood, “TLC: Transmission Line Caches,” Proc. 36th Int'l Symp. Microarchitecture (MICRO-36), pp. 43-54, Dec. 2003.
[15] B.M. Beckmann and D.A. Wood, “Managing Wire Delay in Large Chip-Multiprocessor Caches,” Proc. 37th Int'l Symp. Microarchitecture (MICRO-37), pp. 319-330, Dec. 2004.
[16] N. Nelson, G. Briggs, M. Haurylau, G. Chen, H. Chen, D. Albonesi, E. Friedman, and P. Fauchet, “Alleviating Thermal Constraints While Maintaining Performance via Silicon-Based On-Chip Optical Interconnects,” Proc. Workshop Unique Chips and Systems (UCAS-1), pp. 339-351, Mar. 2005.
[17] R. Balasubramonian, N. Muralimanohar, K. Ramani, and V. Venkatachalapathy, “Microarchitectural Wire Management for Performance Power in Partitioned Architectures,” Proc. 11th Int'l Symp. High-Performance Computer Architecture (HPCA-11), pp. 28-39, Feb. 2005.
[18] N. Muralimanohar and R. Balasubramonian, “The Effect of Interconnect Design on the Performance of Large L2 Caches,” Proc. Third IBM Watson Conf. Interaction between Architecture, Circuits, and Compilers (P=ac2), Oct. 2006.
[19] C. Kim, D. Burger, and S.W. Keckler, “An Adaptive Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches,” Proc. 10th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS-10), pp. 211-222, Nov. 2002.
[20] I. Walter, I. Cidon, and A. Kolodny, “BENoC: A Bus-Enhanced Network On-Chip for a Power Efficient CMP,” IEEE Computer Architecture Letters, vol. 7, no. 2, pp. 61-64, July-Dec. 2008.
[21] E. Bolotin, Z. Guz, I. Cidon, R. Ginosar, and A. Kolodny, “The Power of Priority: Noc Based Distributed Cache Coherency,” Proc. First Int'l Symp. Networks-on-Chip (NOCS '07), pp. 117-126, 2007.
[22] J. Balfour and W.J. Dally, “Design Tradeoffs for Tiled CMP On-Chip Networks,” Proc. 20th Int'l Conf. Supercomputing (ICS-20), pp. 187-198, June 2006.
[23] L. Zhao, R. Iyer, S. Makineni, J. Moses, R. Illikkal, and D. Newell, “Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design,” Proc. First Workshop Chip Multiprocessor Memory Systems and Interconnects (CMP-MSI'07), in conjunction with HPCA-13, Feb. 2007.
[24] C. Liu, A. Sivasubramaniam, and M. Kandemir, “Organizing the Last Line of Defense before Hitting the Memory Wall for CMPs,” Proc. 10th Int'l Symp. High Performance Computer Architecture (HPCA-10), pp. 176-185, Feb. 2004.
[25] J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach, fourth ed. Morgan Kaufmann, 2006.
[26] D. Kroft, “Lockup-Free Instruction Fetch/Prefetch Cache Organization,” Proc. Eighth Int'l Symp. Computer Architecture (ISCA-8), pp. 81-87, May 1981.
[27] C.J. Hughes, V.S. Pai, P. Ranganathan, and S.V. Adve, “RSIM: Simulating Shared-Memory Multiprocessors with ILP Processors,” Computer, vol. 35, no. 2, pp. 40-49, Feb. 2002.
[28] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: A Framework for Architectural-Level Power Analysis and Optimizations,” Proc. 27th Int'l Symp. Computer Architecture (ISCA-27), pp. 83-94, June 2000.
[29] P. Shivakumar and N.P. Jouppi, “Cacti 3.0: An Integrated Cache Timing, Power and Area Model,” technical report, Western Research Lab (WRL), 2001.
[30] Y. Zhang, D. Parikh, K. Sankaranarayanan, K. Skadron, and M. Stan, “HotLeakage: A Temperature-Aware Model of Subthreshold and Gate Leakage for Architects,” technical report, Univ. of Virginia, 2003.
[31] H.-S. Wang, X. Zhu, L.-S. Peh, and S. Malik, “Orion: A Power-Performance Simulator for Interconnection Networks,” Proc. 35th Int'l Symp. Microarchitecture (MICRO-35), pp. 294-305, Nov. 2002.
[32] J. Singh, W.-D. Weber, and A. Gupta, “SPLASH: Stanford Parallel Applications for Shared-Memory,” Computer Architecture News, vol. 20, no. 1, pp. 5-44, 1992.
[33] S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta, “The SPLASH-2 Programs: Characterization Methodological Considerations,” Proc. 22nd Int'l Symp. Computer Architecture (ISCA-22), pp. 24-36, June 1995.

Index Terms:
Tiled chip multiprocessor, energy-efficient architectures, cache coherence protocol, heterogeneous on-chip interconnection network, parallel scientific applications.
Citation:
Antonio Flores, Juan L. Aragón, Manuel E. Acacio, "Heterogeneous Interconnects for Energy-Efficient Message Management in CMPs," IEEE Transactions on Computers, vol. 59, no. 1, pp. 16-28, Jan. 2010, doi:10.1109/TC.2009.129
Usage of this product signifies your acceptance of the Terms of Use.