This Article 
 Bibliographic References 
 Add to: 
Optimizing Array-Intensive Applications for On-Chip Multiprocessors
May 2005 (vol. 16 no. 5)
pp. 396-411

Abstract—With energy consumption becoming one of the first-class optimization parameters in computer system design, compilation techniques that consider performance and energy simultaneously are expected to play a central role. In particular, compiling a given application code under performance and energy constraints is becoming an important problem. In this paper, we focus on an on-chip multiprocessor architecture and present a set of code optimization strategies. We first evaluate an adaptive loop parallelization strategy (i.e., a strategy that allows each loop nest to execute using a different number of processors if doing so is beneficial) and measure the potential energy savings when unused processors during execution of a nested loop are shut down (i.e., placed into a power-down or sleep state). Our results show that shutting down unused processors can lead to as much as 67 percent energy savings at the expense of up to 17 percent performance loss in a set of array-intensive applications. To eliminate this performance penalty, we also discuss and evaluate a processor preactivation strategy based on compile-time analysis of nested loops. Based on our experiments, we conclude that an adaptive loop parallelization strategy combined with idle processor shut down and preactivation can be very effective in reducing energy consumption without increasing execution time. We then generalize our strategy and present an application parallelization strategy based on integer linear programming (ILP). Given an array-intensive application, our optimization strategy determines the number of processors to be used in executing each loop nest based on the objective function and additional compilation constraints provided by the user/programmer. Our initial experience with this constraint-based optimization strategy shows that it is very successful in optimizing array-intensive applications on on-chip multiprocessors under multiple energy and performance constraints.

[1] S.P. Amarasinghe, J.M. Anderson, M.S. Lam, and C.W. Tseng, “The SUIF Compiler for Scalable Parallel Machines,” Proc. Seventh SIAM Conf. Parallel Processing for Scientific Computing, Feb. 1995.
[2] R.I. Bahar and S. Manne, “Power and Energy Reduction via Pipeline Balancing,” Proc. 28th Ann. Int'l Symp. Computer Architecture, pp. 218-229, 2001.
[3] U. Banerjee, Loop Parallelization. Boston: Kluwer Academic Publishers, 1994.
[4] L. Benini, G. De Micheli, “System-Level Power Optimization: Techniques and Tools,” ACM Trans. Design Automation of Electronic Systems, vol. 5, no. 2, pp. 115-192, 2000.
[5] L. Benini, G. Castelli, A. Macii, E. Macii, M. Poncino, and R. Scarsi, “Extending Lifetime of Portable Systems by Battery Scheduling,” Proc. Design, Automation, and Test in Europe Conf., pp. 197-201, Mar. 2001.
[6] M. Berry et al., “The PERFECT Club Benchmarks: Effective Performance Evaluation of Supercomputers,” The Int'l J. Supercomputer Applications, 1988.
[7] F. Bodin, Z. Chamski, C. Eisenbeis, E. Rohou, A. Seznec, “GCDS: A Compiler Strategy for Trading Code Size against Performance in Embedded Applications,” Technical Report RR-3346, INRIA, Rocquencourt, France, Jan. 1998.
[8] J.A. Butts and G. Sohi, “A Static Power Model for Architects,” Proc. Int'l Symp. Microarchitecture, Dec. 2000.
[9] N. Carriero, D. Gelernter, D. Kaminsky, and J. Westbrook, “Adaptive Parallelism with Piranha,” Technical Report 954, Yale Univ., Feb. 1993.
[10] A. Chandrakasan, W.J. Bowhill, and F. Fox, Design of High-Performance Microprocessor Circuits. IEEE Press, 2001.
[11] Y. Chen, “Architectural Level Power Estimation for Systems-on-a-Chip,” PhD thesis, Pennsylvania State Univ., May 1999.
[12] Chip Multiprocessing, print0,1797,32080,00.html, 2004.
[13] Chip Multiprocessing, ITWorld.Com, /, 2004.
[14] T.M. Conte, K.N. Menezes, S.W. Sathaye, and M.C. Toburen, “System-Level Power Consumption Modeling and Tradeoff Analysis Techniques for Superscalar Processor Design,” IEEE Trans. VLSI Systems, vol. 8, no. 2, Apr. 2000.
[15] D.E. Culler and J.P. Singh, Parallel Computer Architecture: A Hardware-Software Approach. Morgan Kaufmann, 1999.
[16] “Design Methodologies Meet Network Applications” and “System on Chip Design,” Proc. Design Automation Conf. '02 Sessions, June 2002.
[17] V. Delaluz, M. Kandemir, N. Vijaykrishnan, A. Sivasubramaniam, and M.J. Irwin, “DRAM Energy Management Using Software and Hardware Directed Power Mode Control,” Proc. Seventh Int'l Conf. High Performance Computer Architecture, Jan. 2001.
[18] D. Duarte, N. Vijaykrishnan, and M.J. Irwin, “A Clock Power Model to Evaluate Impact of Architectural and Technology Optimizations,” IEEE Trans. VLSI, to appear.
[19] K. Flautner, N. Kim, S. Martin, D. Blaauw, and T. Mudge, “Drowsy Caches: Simple Techniques for Reducing Leakage Power,” Proc. Int'l Symp. Computer Architecture, June 2002.
[20] D. Gannon, W. Jalby, and K. Gallivan, “Strategies for Cache and Local Memory Management by Global Program Transformations,” J. Parallel & Distributed Computing, vol. 5, no. 5, pp. 587-616, Oct. 1988.
[21] J.P. Halter and F. Najm, “A Gate-Level Leakage Power Reduction Method for Ultra-Low-Power CMOS Circuits,” Proc. IEEE Custom Integrated Circuits Conf., pp. 475-478, 1997.
[22] I. Kadayif, M. Kandemir, and U. Sezer, “An Integer Linear Programming Based Approach for Parallelizing Applications in On-Chip Multiprocessors,” Proc. Design Automation Conf., June 2002.
[23] I. Kadayif, M. Kandemir, and M. Karakoy, “An Energy Saving Strategy Based on Adaptive Loop Parallelization,” Proc. Design Automation Conf., June 2002.
[24] I. Kadayif, I. Kolcu, M. Kandemir, N. Vijaykrishnan, and M.J. Irwin, “Exploiting Processor Workload Heterogeneity for Reducing Energy Consumption in Chip Multiprocessor,” Proc. Seventh Design Automation and Test in Europe Conf., Feb. 2004.
[25] M. Kamble and K. Ghose, “Analytical Energy Dissipation Models for Low Power Caches,” Proc. Int'l Symp. Low Power Electronics and Design, Aug. 1997.
[26] S. Kaxiras, Z. Hu, and M. Martonosi, “Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power,” Proc. 28th Int'l Symp. Computer Architecture, 2001.
[27] V. Krishnan and J. Torrellas, “A Chip Multiprocessor Architecture with Speculative Multi-Threading,” IEEE Trans. Computers, special issue on multithreaded architecture, vol. 48, no. 9, Sept. 1999.
[28] R. Kumar, K.I. Farkas, N.P. Jouppi, P. Ranganathan, and D.M. Tullsen, “Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction,” Proc. 36th Ann. IEEE/ACM Int'l Symp. Microarchitecture, p. 81, 2003.
[29] T. Kuroda and T. Sakurai, “Threshold-Voltage Control Schemes through Substrate-Bias for Low-Power High-Speed CMOS LSI Design,” J. VLSI Signal Processing Systems, vol. 13, nos. 2/3, pp. 191-201, Aug. 1996.
[30] A.R. Lebeck, X. Fan, H. Zeng, and C.S. Ellis, “Power-Aware Page Allocation,” Proc. Ninth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, Nov. 2000.
[31] L. Li, I. Kadayif, Y.-F. Tsai, N. Vijaykrishnan, M. Kandemir, M.J. Irwin, and A. Sivasubramaniam, “Leakage Energy Management in Cache Hierarchies,” Proc. 11th Int'l Conf. Parallel Architectures and Compilation Techniques, 2002.
[32] J. Li, J. Martinez, and M. Huang, “The Thrifty Barrier: Energy-Efficient Synchronization in Shared-Memory Multiprocessors,” Proc. High Performance Computer Architecture, pp. 14-23, 2004.
[33] L. Macchiarulo, E. Macii, and M. Poncino, “Low-Energy Encoding for Deep-Submicron Address Buses,” Proc. Int'l Symp. Low Power Electronics and Design, Aug. 2001.
[34] MAJC-5200, 5200wp.html, 2004.
[35] D. Marculescu, “Profile-Driven Code Execution for Low Power Dissipation,” Proc. Int'l Symp. Low Power Electronics and Design, pp. 253-255, 2000.
[36] T. Mowry, “Tolerating Latency through Software-Controlled Data Prefetching,” PhD thesis, Stanford Univ., Computer Systems Laboratory, Mar. 1994.
[37] MP98: A Mobile Processor,, 2004.
[38] B.A. Nayfeh, L. Hammond, and K. Olukotun, “Evaluating Alternatives for a Multiprocessor Microprocessor,” Proc. 23rd Int'l Symp. Computer Architecture, pp. 66-77, 1996.
[39] G. Nemhauser and L. Wolsey, Integer and Combinatorial Optimization. New York: Wiley-Interscience Publications, John Wiley & Sons, 1988.
[40] K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, and K. Chang, “The Case for a Single Chip Multiprocessor,” Proc. Seventh Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 2-11, 1996.
[41] C. Polychronopoulos and D. Kuck, “Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers,” IEEE Trans. Computers, vol. 36, pp. 1425-1439, 1987.
[42] D. Rypl and Z. Bittnar, “Mesh Generation Techniques for Sequential and Parallel Processing,” Aspects in Modern Computational Structural Analysis, pp. 257-276, 1997.
[43] R. Sasanka, S.V. Adve, Y.-K. Chen, and E. Debes, “The Energy Efficiency of CMP vs. SMT for Multimedia Workloads,” Proc. 18th Ann. Int'l Conf. Supercomputing, pp. 196-206, 2004.
[44] R.K. Scannell, “A 480-MFLOP MCM Based on the SHARC DSP Chip Breaks the MCM Cost Barrier,” Proc. Int'l Conf. Multichip Modules, 1996.
[45] H. Schwab, lp_solve Mixed Integer Linear Program Solver,, 2004.
[46] W.-T. Shiue and C. Chakrabarti, “Memory Exploration for Low-Power Embedded Systems,” Proc. Design Automation Conf., 1999.
[47] T. Simunic, L. Benini, P.W. Glynn, and G. De Micheli, “Dynamic Power Management for Portable Systems,” Proc. MOBICOM, pp. 11-19, 2000.
[48] TI Military Multimedia Video Processor (MVP) 320C8X, processr320c8x.htm, 2004.
[49] I. Verbauwhede and C. Nicol, “Low Power DSPS for Wireless Communications,” Proc. Int'l Symp. Low Power Electronics and Design, 2000.
[50] N. Vijaykrishnan, M. Kandemir, M.J. Irwin, H.Y. Kim, and W. Ye, “Energy-Driven Integrated Hardware-Software Optimizations Using Simplepower,” Proc. Int'l Symp. Computer Architecture, June 2000.
[51] M. Wolfe, High Performance Compilers for Parallel Computing. Addison-Wesley Publishing Company, 1996.
[52] S. Wilton and N. Jouppi, “Cacti: An Enhanced Cache Access and Cycle Time Model,” IEEE J. Solid-State Circuits, May 1996.
[53] W. Zhang, N. Vijaykrishnan, M. Kandemir, M.J. Irwin, D. Duarte, and Y. Tsai, “Exploiting VLIW Schedule Slacks for Dynamic and Leakage Energy Reduction,” Proc. 34th Ann. Int'l Symp. Microarchitecture, Dec. 2001.
[54] V. Zivojnovic, J. Velarde, and C. Schlager, “DSPstone: A DSP-Oriented Benchmarking Methodology,” Proc. Fifth Int'l Conf. Signal Processing Applications and Technology, Oct. 1994.

Index Terms:
On-chip multiprocessor, constrained optimization, embedded systems, energy consumption, adaptive loop parallelization, integer linear programming.
Ismail Kadayif, Mahmut Kandemir, Guilin Chen, Ozcan Ozturk, Mustafa Karakoy, Ugur Sezer, "Optimizing Array-Intensive Applications for On-Chip Multiprocessors," IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 5, pp. 396-411, May 2005, doi:10.1109/TPDS.2005.57
Usage of this product signifies your acceptance of the Terms of Use.