Subscribe
Issue No.01 - January (2008 vol.57)
pp: 69-81
ABSTRACT
Performance and power act as opposing constraints for optimal pipeline depth of a processor. While increasing the pipeline depth may enable performance improvement, the higher clock speed associated with a deeper pipeline also increases the power dissipation. As simultaneous multi-threading (SMT) becomes increasingly important for modern high-end processors, there is a need to quantify the optimal power-performance pipeline depth for SMT. While previous work has shown that SMT retains the performance-optimal pipeline depth in near-future technologies, this result does not take power into account. The intricate interplay between the relative impacts of changing pipeline depth on power and performance makes it difficult to predict the scaling trends for optimal SMT pipeline depths considering both power and performance. Using simulations, we quantify the optimal SMT pipeline depths based on the well-known power-performance metric PD3. Our analysis is novel and provides the following key results about the scaling trends for SMT pipelines considering both power and performance: (1) SMT has a deeper PD3-optimal pipeline as compared to superscalar. (2) The PD3-optimal SMT pipeline depth increases with an increase in the number of programs. (3) The PD3-optimal SMT pipeline becomes shallower with technology for a given number of programs.
INDEX TERMS
Multithreaded processors, Power Management, Performance of Systems
CITATION
Zeshan Chishti, "Optimal Power/Performance Pipeline Depth for SMT in Scaled Technologies", IEEE Transactions on Computers, vol.57, no. 1, pp. 69-81, January 2008, doi:10.1109/TC.2007.70771
REFERENCES
[1] V. Agarwal, M. Hrishikesh, S. Keckler, and D. Burger, “Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures,” Proc. 27th Ann. Int'l Symp. Computer Architecture, pp.248-259, June 2000.
[2] S.I. Assoc., “Int'l Technology Roadmap for Semiconductors,” http://www.itrs.netreports.html, 2003.
[3] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch, A Framework for Architectural-Level Power Analysis and Optimizations,” Proc. 27th Ann. Int'l Symp. Computer Architecture, pp. 83-94, June 2000.
[4] D. Brooks, P. Bose, S.E. Schuster, P.N. Kudva, H. Jacobson, A. Buyuktosunoglu, J.D. Wellman, V. Zyuban, M. Gupta, and P.W. Cook, “Power-Aware Microarchitecture: Design and Modeling Challenges for the Next Generation Microprocessors,” IEEE Micro, vol. 20, no. 6, pp. 26-44, Nov./Dec. 2000.
[5] D. Burger and T. Austin, “The SimpleScalar Tool Set Version 2.0,” technical report, Univ. of Wisconsin-Madison, 1997.
[6] Z. Chishti and T.N. Vijaykumar, “Wire Delay Is Not a Problem for SMT (in the Near Future),” Proc. 31st Ann. Int'l Symp. Computer Architecture, pp. 40-51, 2004.
[7] P. Dubey and M. Flynn, “Optimal Pipelining,” J. Parallel and Distributed Computing, vol. 8, pp. 10-19, 1990.
[8] R. Gonzalez and M. Horowitz, “Energy Dissipation in General Purpose Microprocessors,” IEEE J. Solid State Circuits, vol. 31, no. 9, pp. 1277-1284, Sept. 1996.
[9] A. Hartstein and T. Puzak, “Optimum Power/Performance Pipeline Depth,” Proc. 36th Int'l Symp. Microarchitecture, pp. 117-125, Dec. 2003.
[10] R. Ho, K. Mai, and M. Horowitz, “The Future of Wires,” Proc. IEEE, vol. 89, no. 4, pp. 490-504, Apr. 2001.
[11] M. Hrishikesh, D. Burger, N. Jouppi, S. Keckler, K. Farkas, and P. Shivakumar, “The Optimal Logic Depth per Pipeline Stage Is 6 to 8 FO4 Inverter Delays,” Proc. 29th Ann. Int'l Symp. Computer Architecture, pp. 14-24, June 2002.
[12] R. Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro, vol. 19, no. 2, pp. 24-36, Mar./Apr. 1999.
[13] S.P. Kunkel and J.E. Smith, “Optimal Pipelining in Supercomputers,” Proc. 13th Ann. Int'l Symp. Computer Architecture, pp. 404-411, 1986.
[14] B. Lee and D. Brooks, “Effects of Pipeline Complexity on SMT/CMP Power-Performance Efficiency,” Proc. Workshop Complexity-Effective Design, June 2005.
[15] Y. Li, D. Brooks, Z. Hu, and K. Skadron, “Performance, Energy, and Thermal Considerations for SMT and CMP Architectures,” Proc. 11th Int'l Symp. High-Performance Computer Architecture, pp.71-82, Feb. 2005.
[16] Y. Li, D. Brooks, Z. Hu, K. Skadron, and P. Bose, “Understanding the Energy Efficiency of Simultaneous Multithreading,” Proc. 2004 Int'l Symp. Low Power Electronics and Design, pp. 44-49, Aug. 2004.
[17] Y. Li, B. Lee, D. Brooks, Z. Hu, and K. Skadron, “CMP Design Space Exploration Subject to Physical Constraints,” Proc. 12th Int'l Symp. High-Performance Computer Architecture, pp. 17-28, Feb. 2006.
[18] E. Perelman, G. Hamerly, and B. Calder, “Picking Statistically Valid and Early Simulation Points,” Proc. 12th Int'l Conf. Parallel Architectures and Compilation Techniques, pp. 244-255, 2003.
[19] R. Sasanka, S. Adve, Y. Chen, and E. Debes, “The Energy Efficiency of CMP vs SMT for Multimedia Workloads,” Proc. 18th Ann. Int'l Conf. Supercomputing, pp. 196-206, 2004.
[20] P. Shivakumar and N.P. Jouppi, “Cacti 3.0: An Integrated Cache Timing, Power and Area Model,” technical report, Compaq Computer Corp., Aug. 2001.
[21] E. Sprangle and D. Carmean, “Increasing Processor Performance by Implementing Deeper Pipelines,” Proc. 29th Ann. Int'l Symp. Computer Architecture, May 2002.
[22] V. Srinivasan, D. Brooks, M. Gschwind, P. Bose, V. Zyuban, P. Strenski, and P. Emma, “Optimizing Pipelines for Power and Performance,” Proc. 35th Int'l Symp. Microarchitecture, pp. 333-344, Nov. 2002.
[23] D. Tullsen and J. Brown, “Handling Long-Latency Loads in a Simultaneous Multithreading Processor,” Proc. 34th Int'l Symp. Microarchitecture, pp. 318-327, 2001.
[24] D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, and R. Stamm, “Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor,” Proc. 23rd Ann. Int'l Symp. Computer Architecture, pp. 191-202, 1996.
[25] D. Tullsen, S. Eggers, and H. Levy, “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” Proc. 22nd Ann. Int'l Symp. Computer Architecture, pp. 392-403, 1995.