The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.04 - April (2013 vol.62)
pp: 813-826
A. Morari , Pacific Northwest Nat. Lab., Richland, WA, USA
C. Boneti , Schlumberger Brazil Res. & Geoengineering Center (BRGC), Houston, TX, USA
F. J. Cazorla , Barcelona Supercomput. Center, Barcelona, Spain
R. Gioiosa , Barcelona Supercomput. Center, Barcelona, Spain
Chen-Yong Cher , Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
A. Buyuktosunoglu , Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
P. Bose , Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
M. Valero , Barcelona Supercomput. Center, Barcelona, Spain
ABSTRACT
While several hardware mechanisms have been proposed to control the interaction between hardware threads in an SMT processor, few have addressed the issue of software-controllable SMT performance. The IBM POWER5 and POWER6 are the first high-performance processors implementing a software-controllable hardware-thread prioritization mechanism that controls the rate at which each hardware-thread decodes instructions. This paper shows the potential of this basic mechanism to improve several target metrics for various applications on POWER5 and POWER6 processors. Our results show that although the software interface is exactly the same, the software-controlled priority mechanism has a different effect on POWER5 and POWER6. For instance, hardware threads in POWER6 are less sensitive to priorities than in POWER5 due to the in order design. We study the SMT thread malleability to enable user-level optimizations that leverage software-controlled thread priorities. We also show how to achieve various system objectives such as parallel application load balancing, in order to reduce execution time. Finally, we characterize user-level transparent execution on POWER5 and POWER6, and identify the workload mix that best benefits from it.
INDEX TERMS
resource allocation, microprocessor chips, multi-threading, simultaneous multithreading, SMT malleability, IBM POWER5 processor, IBM POWER6 processor, hardware mechanism, hardware thread, software-controllable SMT performance, SMT processor, software-controllable hardware-thread prioritization mechanism, software interface, software-controlled priority mechanism, user-level optimization, parallel application load balancing, execution time, user-level transparent execution, Instruction sets, Hardware, Benchmark testing, Kernel, Linux, IBM POWER6, Malleability, simultaneous multithreading, hardware-thread priorities, IBM POWER5
CITATION
A. Morari, C. Boneti, F. J. Cazorla, R. Gioiosa, Chen-Yong Cher, A. Buyuktosunoglu, P. Bose, M. Valero, "SMT Malleability in IBM POWER5 and POWER6 Processors", IEEE Transactions on Computers, vol.62, no. 4, pp. 813-826, April 2013, doi:10.1109/TC.2012.34
REFERENCES
[1] C. Boneti, F.J. Cazorla, R. Gioiosa, A. Buyuktosunoglu, C.-Y. Cher, and M. Valero, "Software-Controlled Priority Characterization of POWER5 Processor," Proc. 35th Ann. Int'l Symp. Computer Architecture (ISCA '08), pp. 415-426, 2008.
[2] C. Boneti, R. Gioiosa, F.J. Cazorla, J. Corbalán, J. Labarta, and M. Valero, "Balancing HPC Applications through Smart Allocation of Resources in MT Processors," Proc. 22th Int'l Conf. Parallel and Distributed Processing (IPDPS '08), pp. 1-12, 2008.
[3] C. Boneti, R. Gioiosa, F.J. Cazorla, and M. Valero, "A Dynamic Scheduler for Balancing HPC Applications," Proc. 2008 ACM/IEEE Conf. Supercomputing (SC), pp. 41:1-41:12, 2008.
[4] F.J. Cazorla, P.W. Knijnenburg, R. Sakellariou, E. Fernandez, A. Ramirez, and M. Valero, "Predictable Performance in SMT Processors: Synergy between the OS and SMTs," IEEE Trans. Computers, vol. 55, no. 7, pp. 785-799, July 2006.
[5] F.J. Cazorla, A. Ramirez, M. Valero, P.M.W. Knijnenburg, R. Sakellariou, and E. Fernández, "QoS for High-Performance SMT Processors in Embedded Systems," IEEE Micro, vol. 24, no. 4, pp. 24-31, July/Aug. 2004.
[6] M. DeVuyst, R. Kumar, and D.M. Tullsen, "Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors," Proc. 20th Int'l Conf. Parallel and Distributed Processing (IPDPS '06), p. 140, 2006.
[7] G.K. Dorai and D. Yeung, "Transparent Threads: Resource Sharing in SMT Processors for High Single-Thread Performance," Proc. Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '02), p. 30, 2002.
[8] S. Eyerman and L. Eeckhout, "Per-Thread Cycle Accounting in SMT Processors," Proc. 14th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '09), pp. 133-144, 2009.
[9] B. Gibbs, B. Atyam, F. Berres, B. Blanchard, L. Castillo, P. Coelho, N. Guerin, L. Liu, C.D. Maciel, and C. Thirumalai, Advanced POWER Virtualization on IBM eServer p5 ServersP: Architecture and Performance Considerations. IBM Redbook, 2005.
[10] F. Guo, Y. Solihin, L. Zhao, and R. Iyer, "A Framework for Providing Quality of Service in Chip Multi-Processors," Proc. 40th Ann. IEEE/ACM Int'l Symp. Microarchitecture (MICRO 40), pp. 343-355, 2007.
[11] IBM, Power ISA V2.03: Book III, https://www.power.org/resources/downloads PowerISA_203_Final_Public.pdf, 2012.
[12] R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni, D. Newell, Y. Solihin, L. Hsu, and S. Reinhardt, "Qos Policies and Architecture for Cache/Memory in CMP Platforms," SIGMETRICS Performance Evaluation Rev., vol. 35, pp. 25-36, June 2007.
[13] H. Jin and R.V. der Wijngaart, "Performance Characteristics of the Multi-Zone NAS Parallel Benchmarks," J. Parallel Distributed Computing, vol. 66, no. 5, pp. 674-685, 2006.
[14] R. Kalla, B. Sinharoy, and J.M. Tendler, "IBM POWER5 Chip: A Dual-Core Multithreaded Processor," IEEE Micro, vol. 24, pp. 40-47, Mar. 2004.
[15] H.Q. Le, W.J. Starke, J.S. Fields, F.P. O'Connell, D.Q. Nguyen, B.J. Ronchetti, W.M. Sauer, E.M. Schwarz, and M.T. Vaden, "IBM POWER6 Microarchitecture," IBM J. Research and Development, vol. 51, pp. 639-662, Nov. 2007.
[16] C. Luque, M. Moreto, F.J. Cazorla, R. Gioiosa, A. Buyuktosunoglu, and M. Valero, "CPU Accounting in CMP Processors," IEEE Computer Architecture Letters, vol. 8, no. 1, pp. 17-20, Jan. 2009.
[17] C. Luque, M. Moreto, F.J. Cazorla, R. Gioiosa, A. Buyuktosunoglu, and M. Valero, "Itca: Inter-Task Conflict-Aware CPU Accounting for CMPs," Proc. Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '09), pp. 203-213, 2009.
[18] M. Moreto, F.J. Cazorla, A. Ramirez, R. Sakellariou, and M. Valero, "FlexDCP: A QoS Framework for CMP Architectures," SIGOPS Operating Systems Rev., vol. 43, pp. 86-96, Apr. 2009.
[19] NASA, NAS Parallel Benchmarks, http://www.nas.nasa.gov/Resources/Software npb.html. 2012.
[20] K.J. Nesbit, J. Laudon, and J.E. Smith, "Virtual Private Caches," Proc. 34th Ann. Int'l Symp. Computer Architecture (ISCA '07), pp. 57-68, 2007.
[21] N. Rafique, W.-T. Lim, and M. Thottethodi, "Architectural Support for Operating System-Driven CMP Cache Management," Proc. 15th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '06), pp. 2-12, 2006.
[22] B. Sinharoy, R.N. Kalla, J.M. Tendler, R.J. Eickemeyer, and J.B. Joyner, "POWER5 System Microarchitecture," IBM J. Research and Development, vol. 49, pp. 505-521, July 2005.
[23] A. Snavely, D.M. Tullsen, and G. Voelker, "Symbiotic Jobscheduling with Priorities for a Simultaneous Multithreading Processor," Proc. ACM SIGMETRICS Int'l Conf. Measurement and Modeling of Computer Systems (SIGMETRICS '02), pp. 66-76, 2002.
[24] Standard Performance Evaluation Corporation, SPEC CPU2006, http://www.spec.orgbenchmarks.html, 2012.
[25] N. Tuck and D.M. Tullsen, "Initial Observations of the Simultaneous Multithreading Pentium 4 Processor," Proc. 12th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '03), pp. 26-34, 2003.
[26] D.M. Tullsen and J.A. Brown, "Handling Long-Latency Loads in a Simultaneous Multithreading Processor," Proc. 34th Ann. ACM/IEEE Int'l Symp. Microarchitecture (MICRO 34), pp. 318-327, 2001.
[27] J. Vera, F.J. Cazorla, A. Pajuelo, O.J. Santana, E. Fern, and M. Valero, "Measuring the Performance of Multithreaded Processors," Proc. SPEC Benchmark Workshop, 2007.
[28] J. Vera, F.J. Cazorla, A. Pajuelo, O.J. Santana, E. Fernandez, and M. Valero, "Fame: Fairly Measuring Multithreaded Architectures," Proc. 16th Int'l Conf. Parallel Architecture and Compilation Techniques (PACT '07), pp. 305-316, 2007.
38 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool