The Community for Technology Leaders
RSS Icon
Issue No.03 - March (2014 vol.63)
pp: 543-556
Hyungjun Kim , Texas A&M University, College Station
Boris Grot , EPFL, Lausanne
Paul V. Gratz , Texas A&M University, College Station
Daniel A. Jimenez , The University of Texas at San Antonio, San Antonio
As processor chips become increasingly parallel, an efficient communication substrate is critical for meeting performance and energy targets. In this work, we target the root cause of network energy consumption through techniques that reduce link and router-level switching activity. We specifically focus on memory subsystem traffic, as it comprises the bulk of NoC load in a CMP. By transmitting only the flits that contain words predicted useful using a novel spatial locality predictor, our scheme seeks to reduce network activity. We aim to further lower NoC energy through microarchitectural mechanisms that inhibit datapath switching activity for unused words in individual flits. Using simulation-based performance studies and detailed energy models based on synthesized router designs and different link wire types, we show that 1) the prediction mechanism achieves very high accuracy, with an average rate of false-unused prediction of just 2.5 percent; 2) the combined NoC energy savings enabled by the predictor and microarchitectural support is 36 percent, on average, and up to 57 percent in the best case; and 3) there is no system performance penalty as a result of this technique.
Radiation detectors, Encoding, Wires, Energy consumption, Switches, Vectors, Memory management,interconnections, Power management, cache memory, spatial locality
Hyungjun Kim, Boris Grot, Paul V. Gratz, Daniel A. Jimenez, "Spatial Locality Speculation to Reduce Energy in Chip-Multiprocessor Networks-on-Chip", IEEE Transactions on Computers, vol.63, no. 3, pp. 543-556, March 2014, doi:10.1109/TC.2012.238
[1] Int'l Technology Roadmap for Semiconductors (ITRS) Working Group, "Int'l Technology Roadmap for Semiconductors (ITRS)," , 2009.
[2] W.J. Dally and B. Towles, "Route Packets, Not Wires: On-Chip Interconnection Networks," Proc. 38th Int'l Design Automation Conf. (DAC), 2001.
[3] Y. Hoskote, S. Vangal, A. Singh, N. Borkar, and S. Borkar, "A 5-GHz Mesh Interconnect for a Teraflops Processor," IEEE Micro, vol. 27, no. 5, pp. 51-61, Sept./Oct. 2007.
[4] M. Taylor, M.B. Taylor, W. Lee, S. Amarasinghe, and A. Agarwal, "Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures," Proc. IEEE Int'l Symp. High Performance Computer Architecture (HPCA), 2002.
[5] D. Molka, D. Hackenberg, R. Schone, and M.S. Muller, "Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System," Proc. 18th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT), 2009.
[6] Advanced Micro Devices (AMD) Inc., "AMD Opteron Processors for Servers: AMD64-Based Server Solutions for x86 Computing," 0,,30_118_8796,00.h tml, 2012.
[7] P. Pujara and A. Aggarwal, "Cache Noise Prediction," IEEE Trans. Computers, vol. 57, no. 10, pp. 1372-1386, Oct. 2008.
[8] C. Bienia, S. Kumar, J.P. Singh, and K. Li, "The PARSEC Benchmark Suite: Characterization and Architectural Implications," Proc. 17th Int'l Conf. Parallel Architectures and Compilation Techniques, 2008.
[9] P. Gratz, C. Kim, R. McDonald, S.W. Keckler, and D. Burger, "Implementation and Evaluation of On-Chip Network Architectures," Proc. IEEE Int'l Conf. Computer Design (ICCD), 2006.
[10] P. Gratz, K. Sankaralingam, H. Hanson, P. Shivakumar, R. McDonald, S.W. Keckler, and D. Burger, "Implementation and Evaluation of a Dynamically Routed Processor Operand Network," Proc. ACM/IEEE First Int'l Symp. Networks-on-Chip (NOCS), 2007.
[11] L. Shang, L.-S. Peh, and N. Jha, "Dynamic Voltage Scaling with Links for Power Optimization of Interconnection Networks," Proc. High-Performance Computer Architecture, 2003.
[12] A.B. Kahng, B. Li, L.-S. Peh, and K. Samadi, "ORION 2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration," Proc. Design, Automation and Test in Europe Conf. and Exhibition (DATE), 2009.
[13] A. Banerjee, R. Mullins, and S. Moore, "A Power and Energy Exploration of Network-on-Chip Architectures," Proc. First Int'l Symp. Networks-on-Chip, 2007.
[14] R. Das, A. Mishra, C. Nicopoulos, D. Park, V. Narayanan, R. Iyer, M. Yousif, and C. Das, "Performance and Power Optimization through Data Compression in Network-on-Chip Architectures," Proc. IEEE 14th Int'l Symp. High Performance Computer Architecture, 2008.
[15] Y. Jin, K.H. Yum, and E.J. Kim, "Adaptive Data Compression for High-Performance Low-Power On-Chip Networks," Proc. IEEE/ACM 41st Ann. Int'l Symp. Microarchitecture (MICRO), 2008.
[16] H. Zhang, V. George, and J. Rabaey, "Low-Swing On-Chip Signaling Techniques: Effectiveness and Robustness," IEEE Trans. Very Large Scale Integration Systems, vol. 8, no. 3, pp. 264-272, June 2000.
[17] D. Schinkel, E. Mensink, E. Klumperink, E. van Tuijl, and B. Nauta, "Low-Power, High-Speed Transceivers for Network-on-Chip Communication," IEEE Trans. Very Large Scale Integration Systems, vol. 17, no. 1, pp. 12-21, Jan. 2009.
[18] R. Das, S. Eachempati, A. Mishra, V. Narayanan, and C. Das, "Design and Evaluation of a Hierarchical on-Chip Interconnect for Next-Generation CMPS," Proc. IEEE 15th Int'l Symp. High Performance Computer Architecture, 2009.
[19] K. Hale, B. Grot, and S. Keckler, "Segment Gating for Static Energy Reduction in Networks-on-Chip," Proc. Second Int'l Workshop Network on Chip Architectures, 2009.
[20] J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers Inc., 2003.
[21] S. Carr, K.S. McKinley, and C.-W. Tseng, "Compiler Optimizations for Improving Data Locality," ACM SIGPLAN Notices, vol. 29, no. 11, pp. 252-262, 1994.
[22] B. Calder, C. Krintz, S. John, and T. Austin, "Cache-Conscious Data Placement," Proc. Eighth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), 1998.
[23] T.M. Chilimbi, B. Davidson, and J.R. Larus, "Cache-Conscious Structure Definition," ACM SIGPLAN Notices, vol. 34, no. 5, pp. 13-24, 1999.
[24] T. Chilimbi, M. Hill, and J. Larus, "Making Pointer-Based Data Structures Cache Conscious," Computer, vol. 33, no. 12, pp. 67-74, Dec. 2000.
[25] J.S. Liptay, "Structural Aspects of the System/360 Model 85, II: The Cache," IBM Systems J., vol. 7, no. 1, pp. 15-21, 1968.
[26] D.H. Yoon, M.K. Jeong, and M. Erez, "Adaptive Granularity Memory Systems: A Tradeoff between Storage Efficiency and Throughput," Proc. 38th Ann. Int'l Symp. Computer Architecture, 2011.
[27] M.K. Qureshi, M.A. Suleman, and Y.N. Patt, "Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines," Proc. Int'l Symp. High-Performance Computer Architecture, 2007.
[28] A.-C. Lai, C. Fide, and B. Falsafi, "Dead-Block Prediction & Dead-Block Correlating Prefetchers," SIGARCH Computer Architecture News, vol. 29, no. 2, pp. 144-154, 2001.
[29] C.F. Chen, S. hyun Yang, and B. Falsafi, "Accurate and Complexity-Effective Spatial Pattern Prediction," Proc. 10th Int'l Symp. High Performance Computer Architecture (HPCA-10), 2004.
[30] H. Kim, P. Ghoshal, B. Grot, P.V. Gratz, and D.A. Jimènez, "Reducing Network-on-Chip Energy Consumption through Spatial Locality Speculation," Proc. ACM/IEEE Fifth Int'l Symp. Networks-on-Chip (NOCS), 2011.
[31] C. Kim, D. Burger, and S.W. Keckler, "An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches," ACM SIGPLAN Notices, vol. 37, pp. 211-222, 2002.
[32] S.G. Abraham, R.A. Sugumar, D. Windheiser, B.R. Rau, and R. Gupta, "Predictability of Load/Store Instruction Latencies," Proc. 26th Ann. Int'l Symp. Microarchitecture, 1993.
[33] E. Jacobsen, E. Rotenberg, and J.E. Smith, "Assigning Confidence to Conditional Branch Predictions," Proc. 29th Ann. Int'l Symp. Microarchitecture, 1996.
[34] Intel, "Intel Atom Processor Z510," Product.aspx?id=35469&processor=Z510&spec-codes=SLB2C , 2012.
[35] Jon Stokes, "IBM's 8-Core POWER7: Twice the Muscle, Half the Transistors," 09ibms-8-core-power7-twice-the-mu scle-half-the-transistors.ars , 2012.
[36] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, "Optimizing Nuca Organizations and Wiring Alternatives for Large Caches with Cacti 6.0," Proc. IEEE/ACM 40th Ann. Int'l Symp. Microarchitecture, 2007.
[37] N.L. Binkert, R.G. Dreslinski, L.R. Hsu, K.T. Lim, A.G. Saidi, and S.K. Reinhardt, "The M5 Simulator: Modeling Networked Systems," IEEE Micro, vol. 26, no. 4, pp. 52-60, July/Aug. 2006.
[38] M. Gebhart, J. Hestness, E. Fatehi, P. Gratz, and S.W. Keckler, "Running PARSEC 2.1 on M5," technical report, Dept. of Computer Science, The Univ. of Texas at Austin, 2009.
[39] J. Hestness and S.W. Keckler, "Netrace: Dependency-Tracking Traces for Efficient Network-on-Chip Experimentation," technical report, Dept. of Computer Science, The Univ. of Texas at Austin, 2011.
20 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool