The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.10 - October (2008 vol.19)
pp: 1396-1410
ABSTRACT
Computing has recently reached an inflection point with the introduction of multi-core processors. On-chip thread-level parallelism is doubling approximately every other year. Concurrency lends itself naturally to allowing a program to trade performance for power savings by regulating the number of active cores, however in several domains users are unwilling to sacrifice performance to save power. We present a prediction model for identifying energy-efficient operating points of concurrency in well-tuned multithreaded scientific applications, and a runtime system which uses live program analysis to optimize applications dynamically. We describe a dynamic, phase-aware performance prediction model that combines multivariate regression techniques with runtime analysis of data collected from hardware event counters to locate optimal operating points of concurrency. Using our model, we develop a prediction-driven, phase-aware runtime optimization scheme that throttles concurrency so that power consumption can be reduced and performance can be set at the knee of the scalability curve of each program phase. The use of prediction reduces the overhead of searching the optimization space while achieving near-optimal performance and power savings. A thorough evaluation of our approach shows a reduction in power consumption of 10.8% simultaneous with an improvement in performance of 17.9%, resulting in energy savings of 26.7%.
INDEX TERMS
Energy-aware systems, Modeling and prediction, Application-aware adaptation
CITATION
Filip Blagojevic, Matthew Curtis-Maury, Dimitrios S. Nikolopoulos, "Prediction-Based Power-Performance Adaptation of Multithreaded Scientific Codes", IEEE Transactions on Parallel & Distributed Systems, vol.19, no. 10, pp. 1396-1410, October 2008, doi:10.1109/TPDS.2007.70804
REFERENCES
[1] T. Anderson, B. Bershad, E. Lazowska, and H. Levy, “Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism,” ACM Trans. Computer Systems, vol. 10, no. 1, pp. 53-79, Feb. 1992.
[2] U. Andersson and P. Mucci, “Analysis and Optimization of Yee Bench Using Hardware Performance Counters,” Proc. Int'l Conf. Parallel Computing (ParCo), 2005.
[3] R. Balasubramonian, S. Dwarkadas, and D.H. Albonesi, “Dynamically Managing the Communication-Parallelism Trade-Off in Future Clustered Processors,” Proc. 30th Int'l Symp. Computer Architecture (ISCA), 2003.
[4] S.Y. Borkar, “Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation,” IEEE Micro, vol. 25, no. 6, pp. 10-16, Sept. 2005.
[5] C. Cascaval, E. Duesterwald, P. Sweeney, and R. Wisniewski, “Multiple Page Size Modeling and Optimization,” Proc. 14th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT), 2005.
[6] M. Curtis-Maury, J. Dzierwa, C. Antonopoulos, and D. Nikolopoulos, “Online Power-Performance Adaptation of Multithreaded Programs Using Hardware Event-Based Prediction,” Proc. 20th ACM Int'l Conf. Supercomputing (ICS), 2006.
[7] M. Curtis-Maury, J. Dzierwa, C. Antonopoulos, and D. Nikolopoulos, “Online Strategies for High-Performance Power-Aware Thread Execution on Emerging Multiprocessors,” Proc. Second Workshop High-Performance Power-Aware Computing (HP-PAC), 2006.
[8] M. Curtis-Maury, T. Wang, C. Antonopoulos, and D. Nikolopoulos, “Integrating Multiple Forms of Multithreaded Execution on a Multi-SMT System: A Study with Scientific Workloads,” Proc. Second IEEE Int'l Conf. Quantitative Evaluation of Systems (QEST), 2005.
[9] L. Eeckhout and K. De Bosschere, “Statistical Simulation of Superscalar Architectures Using Commercial Workloads,” Proc. Fourth Workshop Computer Architecture Evaluation Using Commercial Workloads (CAECW), 2001.
[10] L. Eeckhout, S. Nussbaum, J. Smith, and K. De Bosschere, “Statistical Simulation: Adding Efficiency to the Computer Designer's Toolbox,” IEEE Micro, vol. 23, no. 5, Sept. 2003.
[11] K. Asanovic, “The Landscape of Parallel Computing Research: A View from Berkeley,” Technical Report UCB/EECS-2006-183, Electrical Eng. and Computer Science Dept., Univ. of California, Berkeley, Dec. 2006.
[12] N. Adiga et al., “An Overview of the BlueGene/L Supercomputer,” Proc. IEEE/ACM Supercomputing: Int'l Conf. High-Performance Networking and Computing (SC), 2002.
[13] W. Feng and C. Hsu, “The Origin and Evolution of Green Destiny,” Proc. Seventh IEEE Int'l Symp. Low-Power and High-Speed Chips (COOL Chips), 2004.
[14] V. Freeh, F. Pan, D. Lowenthal, N. Kappiah, R. Springer, B. Rountree, and M. Femal, “Analyzing the Energy-Time Tradeoff in High-Performance Computing Applications,” IEEE Trans. Parallel and Distributed Systems, vol. 18, no. 6, pp. 835-848, June 2007.
[15] R. Ge, X. Feng, and K. Cameron, “Improvement of Power-Performance Efficiency for High-End Computing,” Proc. 19th Int'l Parallel and Distributed Processing Symp. (IPDPS), 2005.
[16] G.A. Grell, J. Dudhia, and D.R. Stauffer, A Description of the Fifth-Generation Penn State/NCAR Mesoscale Model (MM5), NCAR Technical Note NCAR/TN-398 + STR, Nat'l Center for Atmospheric Research, June 1995.
[17] M. Hall and M. Martonosi, Adaptive Parallelism in Compiler-Parallelized Code. Wiley, Aug. 1997.
[18] E. Ipek, S. McKee, B. de Supinski, M. Schulz, and R. Caruana, “Efficiently Exploring Architectural Design Spaces via Predictive Modeling,” Proc. 12th Int'l Conf. Architectural Supportfor ProgrammingLanguages and Operating Systems (ASPLOS), 2006.
[19] C. Isci and M. Martonosi, “Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data,” Proc. 26th ACM/IEEE Ann. Int'l Symp. Microarchitecture (MICRO), 2003.
[20] H. Jin, M. Frumkin, and J. Yan, “The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance,” Technical Report NAS-99-011, NASA Ames Research Center, Oct. 1999.
[21] P. Joseph, K. Vaswani, and M. Thazhuthaveetil, “Efficiently Exploring Architectural Design Spaces via Predictive Modeling,” Proc. 39th Int'l Symp. Microarchitecture (MICRO), 2006.
[22] C. Jung, D. Lim, J. Lee, and S. Han, “Adaptive Execution Techniques for SMT Multiprocessor Architectures,” Proc. 10th ACM Symp. Principles and Practice of Parallel Programming (PPOPP), 2005.
[23] R. Kalla, B. Sinharoy, and J. Tendler, “IBM POWER5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro, vol. 24, no. 2, pp. 40-47, Mar. 2004.
[24] M. Kandemir, W. Zhang, and M. Karakoy, “Runtime Code Parallelization on Chip Multiprocessors,” Proc. Design, Automation, and Test in Europe Conf. (DATE), 2003.
[25] N. Kappiah, V. Freeh, and D. Lowenthal, “Just in Time Dynamic Voltage Scaling: Exploiting Inter-Node Slack to Save Energy in MPI Programs,” Proc. IEEE/ACM Supercomputing: Int'l Conf. High-Performance Computing, Networking Storage, and Analysis (SC), 2005.
[26] T.S. Karkhanis and J.E. Smith, “A First-Order Superscalar Processor Model,” Proc. 31st Int'l Symp. Computer Architecture (ISCA), 2004.
[27] S. Kirkpatrick, C. Gelatt, and M. Vecchi, “Optimization by Simulated Annealing,” Science, vol. 220, no. 4598, pp. 671-680, 1983.
[28] P. Kongetira, K. Aingaran, and K. Olukotun, “Niagara: A 32-Way Multithreaded Sparc Processor,” IEEE MICRO, vol. 25, no. 2, pp.21-29, Mar./Apr. 2005.
[29] B. Lee and D. Brooks, “Accurate and Efficient Regression Modelling for Microarchitectural Performance and Power Prediction,” Proc. 12th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2006.
[30] J. Li and J. Martínez, “Dynamic Power-Performance Adaptation of Parallel Computation on Chip Multiprocessors,” Proc. 12th Int'l Symp. High-Performance Computer Architecture (HPCA), 2006.
[31] J. Lu, H. Chen, P. Yew, and W. Hsu, “Design and Implementation of a Lightweight Dynamic Optimization System,” J. Instruction-Level Parallelism, vol. 6, pp. 1-24, 2004.
[32] J. Marathe and F. Mueller, “Hardware Profile-Guided Automatic Page Placement for ccNUMA Systems,” Proc. 11th ACM Symp. Principles and Practice of Parallel Programming (PPOPP '06), Mar. 2006.
[33] T. Moseley, J. Kim, D. Connors, and D. Grunwald, “Methods for Modeling Resource Contention on Simultaneous Multithreaded Processors,” Proc. 23rd Int'l Conf. Computer Design (ICCD), 2005.
[34] A. Settle, J. Kihm, A. Janiszewski, and D. Connors, “Architectural Support for Enhanced SMT Job Scheduling,” Proc. 13th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT), 2004.
[35] S. Sharma, C. Hsu, and W. Feng, “Making a Case for a Green500 List,” Proc. Second Workshop High-Performance Power-Aware Computing (HP-PAC), 2006.
[36] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “AutomaticallyCharacterizing Large Scale Program Behavior,” Proc. 12th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2002.
[37] R. Springer, D. Lowenthal, B. Rountree, and V. Freeh, “Minimizing Execution Time in MPI Programs on an Energy-Contstrained, Power-Scalable Cluster,” Proc. 11th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPOPP), 2006.
[38] A. Tucker and A. Gupta, “Process Control and Scheduling Issues for Multiprogrammed Shared-Memory Multiprocessors,” Proc. 12th ACM Symp. Operating Systems Principles (SOSP), 1989.
[39] M. Voss and R. Eigenmann, “Reducing Parallel Overheads through Dynamic Serialization,” Proc. 13th Int'l Parallel Processing Symp. and 10th Symp. Parallel and Distributed Processing (IPPS/SPDP '99), pp. 88-92, Apr. 1999.
[40] L. Yang, X. Ma, and F. Mueller, “Cross-Platform Performance Prediction of Parallel Applications Using Partial Execution,” Proc. IEEE/ACM Supercomputing: Int'l Conf. High-Performance Networking and Computing (SC), 2005.
[41] K. Yue and D. Lilja, “An Effective Processor Allocation Strategy for Multiprogrammed Shared-Memory Multiprocessors,” IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 12, pp. 1246-1258, Dec. 1997.
[42] Y. Zhang and M. Voss, “Runtime Empirical Selection of Loop Schedulers on Hyperthreaded SMPs,” Proc. 19th IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS), 2005.
25 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool