The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.09 - September (2008 vol.19)
pp: 1201-1214
ABSTRACT
One of the critical goals in code optimization for Multi-Processor-System-on-a-Chip (MPSoC) architectures is to minimize the number of off-chip memory accesses. This is because such accesses can be extremely costly from both performance and power angles. While conventional data locality optimization techniques can be used for improving data access pattern of each processor independently, such techniques usually do not consider locality for shared data. This paper proposes a strategy that reduces the number of off-chip references due to shared data. It achieves this goal by restructuring a parallelized application code in such a fashion that a given data block is accessed by parallel processors within the same time frame, so that its reuse is maximized while it is in the on-chip memory space. This tends to minimize the number of off-chip references since the accesses to a given data block are clustered within a short period of time during execution. Our approach employs a polyhedral tool that helps us isolate computations that manipulate a given data block. In order to test the effectiveness of our approach, we implemented it using a publicly-available compiler infrastructure and conducted experiments with twelve data-intensive embedded applications. Our results show that optimizing data locality for shared data elements is very useful in practice.
INDEX TERMS
Memory management, Compilers, Optimization, Multiprocessor Systems
CITATION
Guilin Chen, Mahmut Kandemir, "Compiler-Directed Code Restructuring for Improving Performance of MPSoCs", IEEE Transactions on Parallel & Distributed Systems, vol.19, no. 9, pp. 1201-1214, September 2008, doi:10.1109/TPDS.2007.70760
REFERENCES
[1] J.M. Anderson, S.P. Amarasinge, and M.S. Lam, “Data and Computation Transformations for Multiprocessors,” Proc. Fifth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPOPP '95), pp. 166-178, 1995.
[2] A. Agarwal, D. Kranz, and V. Natrajan, “Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors,” IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 9, pp. 943-962, Sept. 1995.
[3] L.A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese, “Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing,” Proc. 27th Ann. Int'l Symp. Computer Architecture (ISCA '00), 2000.
[4] S. Carr, K.S. McKinley, and C. Tseng, “Compiler Optimizations for Improving Data Locality,” Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '94), 1994.
[5] F. Catthoor, S. Wuytack, E.D. Greef, F. Balasa, L. Nachtergaele, and A. Vandecappelle, Custom Memory Management Methodology: Exploration of Memory Organization for Embedded Multimedia System Design. Kluwer Academic Publishers, June 1998.
[6] G. Chen and M. Kandemir, “Code Restructuring for Improving Cache Performance of MPSoCs,” Proc. IEEE/ACM Int'l Conf. Computer Aided Design (ICCAD '05), pp. 271-274, Nov. 2005.
[7] T.S. Chen and J.P. Sheu, “Communication-Free Data Allocation Techniques for Parallelizing Compilers on Multicomputers,” IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 9, pp. 924-938, Sept. 1994.
[8] S. Coleman and K. McKinley, “Tile size Selection Using Cache Organization and Data Layout,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI '95), 1995.
[9] E. D'Hollander, “Partitioning and Labeling of Loops by Unimodular Transformations,” IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 4, pp. 465-476, July 1992.
[10] B. Franke and M.F.P. O'Boyle, “A Complete Compiler Approach to Auto-Parallelizing C Programs for Multi-DSP Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 16, no. 3, pp. 234-245, Mar. 2005.
[11] M. Gomaa, C. Scarbrough, T.N. Vijaykumar, and I. Pomeranz, “Transient-Fault Recovery for Chip Multiprocessors,” Proc. 30th Int'l Symp. Computer Architecture (ISCA '03), 2003.
[12] M. Gschwind, P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki, “A Novel SIMD Architecture for the Cell Heterogeneous Chip-Multiprocessor,” Proc. 2005 Symp. High-Performance Chips (HOT CHIPS 17), Aug. 2005.
[13] R. Gupta, S. Pande, K. Psarris, and V. Sakar, “Compilation Techniques for Parallel Systems,” Parallel Computing, vol. 25, nos.13-14, pp. 1741-1783, 1999.
[14] M.W. Hall, J.M. Anderson, S.P. Amarasinghe, B.R. Murphy, S.-W. Liao, E. Bugnion, and M.S. Lam, “Maximizing Multiprocessor Performance with the SUIF Compiler,” Computer, vol. 29, no. 12, pp. 84-89, Dec. 1996.
[15] L. Hammond, B.A. Nayfeh, and K. Olukotun, “A Single-Chip Multiprocessor,” Computer, special issue on billion-transistor processors, Sept. 1997.
[16] R. Hetheringtonh, “The UltraSPARC T1 Processor: Power-Efficient Throughput Computing,” Sun White Paper, Dec. 2005.
[17] C.-H. Huang and P. Sadayappan, “Communication-Free Hyperplane Partitioning of Nested Loops,” J. Parallel and Distributed Computing, vol. 19, pp. 90-102, 1993.
[18] F. Irigoin and R. Triolet, “Super-Node Partitioning,” Proc. 15th Ann. ACM Symp. Principles of Programming Languages (POPL '88), pp. 319-329, Jan. 1988.
[19] I. Kadayif, M. Kandemir, and U. Sezer, “An Integer Linear Programming Based Approach for Parallelizing Applications in On-Chip Multiprocessors,” Proc. 39th Design Automation Conf. (DAC '02), June 2002.
[20] I. Kodukula and K. Pingali, “Data-Centric Transformations for Locality Enhancement,” Int'l J. Parallel Programming, Oct. 2001.
[21] I. Kolcu, Private Communication, 2008.
[22] M. Lam, E. Rothberg, and M. Wolf, “The Cache Performance of Blocked Algorithms,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '91), 1991.
[23] W. Li and K. Pingali, “A Singular Loop Transformation Framework Based on Non-Singular Matrices,” Proc. Fifth Workshop Languages and Compilers for Parallel Computing (LCPC '92), Aug. 1992.
[24] MAJC-5200, http://www.sun.com/microelectronics/MAJC 5200wp.html, 2007.
[25] MediaBench, http://cares.icsl.ucla.eduMediaBench/, 2007.
[26] MiBench, http://www.eecs.umich.edumibench/, 2007.
[27] MP98: A Mobile Processor, http://www.labs.nec.co.jp/MP98top-e.htm, 2007.
[28] B.A. Nayfeh and K. Olukotun, “Exploring the Design Space for a Shared-Cache Multiprocessor,” Proc. 21st Int'l Symp. Computer Architecture (ISCA '94), 1994.
[29] K. Olukotun and L. Hammond, “The Future of Microprocessors,” ACM Queue Magazine, Sept. 2005.
[30] Omega Library, http://www.cs.umd.edu/projectsomega, 2007.
[31] S. Pande, “A Compile Time Partitioning Method for DOALL Loops on Distributed Memory Systems,” Proc. Int'l Conf. Parallel Processing (ICPP '96), vol. 3, pp. 35-44, 1996.
[32] “POWER4 System Microarchitecture,” white paper, http://www-1.ibm.com/servers/eserver/pseries/ hardware/whitepaperspower4.html, 2007.
[33] W. Pugh and D. Wonnacott, “An Exact Method for Analysis of Value-Based Array Data Dependences,” Proc. Sixth Ann. Workshop Programming Languages and Compilers for Parallel Computing (PLCPC '93), 1993.
[34] J. Ramanujam and P. Sadayappan, “Compile-TimeTechniques for Data Distribution in Distributed Memory Machines,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, pp. 472-482, Oct. 1991.
[35] S. Richardson, “MPOC: A Chip Multiprocessor for Embedded Systems,” Technical Report HPL-2002-186, Hewlett-Packard Laboratories, 2002.
[36] SIMICS, http://www.virtutech.com/simicssimics.html , 2007.
[37] SUIF Compiler Infrastructure, http:/suif.stanford.edu/, 2007.
[38] J. Tims, R. Gupta, and M.L. Sofa, “Dataflow Analysis Driven Dynamic Data Partitioning,” Proc. Fourth Workshop Languages, Compilers, and Run-time Systems for Scalable Computers (LCR '98), pp. 75-90, May 1998.
[39] M. Wolfe, “Iteration Space Tiling for Memory Hierarchies,” Proc. Third SIAM Conf. Parallel Processing for Scientific Computing (PPSC '87), 1987.
[40] M. Wolfe, High-Performance Compilers for Parallel Computing. Addison-Wesley Publishing, 1996.
[41] M. Wolfe, “Iteration Space Tiling for Memory Hierarchies,” Proc. Third SIAM Conf. Parallel Processing for Scientific Computing (PPSC '87), pp. 357-361, 1987.
[42] M.E. Wolf and M.S. Lam, “A Data Locality Optimizing Algorithm,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI '91), June 1991.
[43] W. Wolf, “The Future of Multiprocessor Systems-on-Chips,” Proc. 45th Design Automation Conf. (DAC '04), 2004.
[44] J. Xue and C.-H. Huang, “Reuse-Driven Tiling for Data Locality,” Proc. 11th Workshop Languages and Compilers for Parallel Computing (LCPC '98), 1998.
[45] N.D. Zervas, K. Masselos, and C. Goutis, “Code Transformations for Embedded Multimedia Applications: Impact on Power and Performance,” Proc. ISCA Power-Driven Microarchitecture Workshop '98, 1998.
27 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool