This Article 
 Bibliographic References 
 Add to: 
An OpenMP Compiler for Efficient Use of Distributed Scratchpad Memory in MPSoCs
February 2012 (vol. 61 no. 2)
pp. 222-236
Andrea Marongiu, University of Bologna, Bologna
Luca Benini, University of Bologna, Bologna
Most of today's state-of-the-art processors for mobile and embedded systems feature on-chip scratchpad memories. To efficiently exploit the advantages of low-latency high-bandwidth memory modules in the hierarchy, there is the need for programming models and/or language features that expose such architectural details. On the other hand, effectively exploiting the limited on-chip memory space requires the programmer to devise an efficient partitioning and distributed placement of shared data at the application level. In this paper, we propose a programming framework that combines the ease of use of OpenMP with simple, yet powerful, language extensions to trigger array data partitioning. Our compiler exploits profiled information on array access count to automatically generate data allocation schemes optimized for locality of references.

[1] G. Blake, R. Dreslinski, and T. Mudge, "A Survey of Multicore Processors," IEEE Signal Processing Magazine, vol. 26, no. 6, pp. 26-37, Nov. 2009.
[2] S. Yang, W. Wolf, and N. Vijaykrishnan, "Power and Performance Analysis of Motion Estimation Based on Hardware and Software Realizations," IEEE Trans. Computers, vol. 54, no. 6, pp. 714-726, June 2005.
[3] L. Karam, I. AlKamal, A. Gatherer, G. Frantz, D. Anderson, and B. Evans, "Trends in Multicore DSP Platforms," IEEE Signal Processing Magazine, vol. 26, no. 6, pp. 38-49, Nov. 2009.
[4] N. AbouGhazaleh, B. Childers, D. Mosse, and R. Melhem, "Near-Memory Caching for Improved Energy Consumption," IEEE Trans. Computers, vol. 56, no. 11, pp. 1441-1455, Nov. 2007.
[5] H. woo Park, H. Oh, and S. Ha, "Multiprocessor SoC Design Methods and Tools," IEEE Signal Processing Magazine, vol. 26, no. 6, pp. 72-79, Nov. 2009.
[6] W. Wolf, "Multiprocessor System-on-Chip Technology," IEEE Signal Processing Magazine, vol. 26, no. 6, pp. 50-54, Nov. 2009.
[7] Tilera Corp. Product Brief "Tilepro64 Processor," 2008.
[8] ARM, Ltd., White Paper, "The Arm Cortex-a9 Processors," , 2007.
[9] XMOS, Ltd. "XMOS XS1 Family Product Brief,", 2010.
[10] Element CXI Inc. Product Brief "Eca-64 Elemental Computing Array," doc , 2008.
[11] Silicon Hive Databrief "HiveFlex CSP2500 Series," www.silicon , 2009.
[12] T. Chen, R. Raghavan, J.N. Dale, and E. Iwata, "Cell Broadband Engine Architecture and Its First Implementation: A Performance View," IBM J. Research and Development, vol. 51, no. 5, pp. 559-572, 2007.
[13] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel, "Scratchpad Memory: Design Alternative for Cache On-Chip Memory in Embedded Systems," Proc. 10th Int'l Symp. Hardware/Software Codesign (CODES '02), pp. 73-78, 2002.
[14] P.R. Panda, N.D. Dutt, and A. Nicolau, "On-Chip vs. Off-Chip Memory: The Data Partitioning Problem in Embedded Processor-Based Systems," ACM Trans. Design Automation of Electronic Systems, vol. 5, no. 3, pp. 682-704, 2000.
[15] H. Kim and R. Bond, "Multicore Software Technologies," IEEE Signal Processing Magazine, vol. 26, no. 6, pp. 80-89, Nov. 2009.
[16] S. Schneider, J.-S. Yeom, and D. Nikolopoulos, "Programming Multiprocessors with Explicitly Managed Memory Hierarchies," Computer, vol. 42, no. 12, pp. 28-34, Dec. 2009.
[17] F. Poletti, A. Poggiali, D. Bertozzi, L. Benini, P. Marchal, M. Loghi, and M. Poncino, "Energy-Efficient Multiprocessor Systems-on-Chip for Embedded Computing: Exploring Programming Models and Their Architectural Support," IEEE Trans. Computers, vol. 56, no. 5, pp. 606-621, May 2007.
[18] "OpenMP C and C++ API v.3.0," 2010.
[19] High Performance Fortran Forum "High Performance Fortran Language Specification,", 1997.
[20] UPC Consortium "UPC Language Specifications v1.2," upc., 2005.
[21] A. Marongiu and L. Benini, "Efficient OpenMP Support and Extensions for MPSoCs with Explicitly Managed Memory Hierarchy," Proc. Design, Automation and Test in Europe Conf. and Exhibition 2009 (DATE '09), pp. 809-814, Apr. 2009.
[22] M. Mehrara, T. Jablin, D. Upton, D. August, K. Hazelwood, and S. Mahlke, "Multicore Compilation Strategies and Challenges," IEEE Signal Processing Magazine, vol. 26, no. 6, pp. 55-63, Nov. 2009.
[23] K. Fatahalian, D.R. Horn, T.J. Knight, L. Leem, M. Houston, J.Y. Park, M. Erez, M. Ren, A. Aiken, W.J. Dally, and P. Hanrahan, "Sequoia: Programming the Memory Hierarchy," Proc. ACM/IEEE Conf. Supercomputing (SC '06), 2006.
[24] NVIDIA Corp. CUDA Documentation "NVIDIA CUDA: Compute Unified Device Architecture," Guide_2.0.pdf, 2008.
[25] S. Lee, S.-J. Min, and R. Eigenmann, "OpenMP to GPGPU: A Compiler Framework for Automatic Translation Optimization," Proc. 14th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP '09), pp. 101-110, 2009.
[26] S.-Z. Ueng, M. Lathara, S.S. Baghsorkhi, and W.-M.W. Hwu, "CUDA-Lite: Reducing GPU Programming Complexity," Lecture Notes in Computer Science, Springer, 2008.
[27] R. Chandra, D.-K. Chen, R. Cox, D.E. Maydan, N. Nedeljkovic, and J.M. Anderson, "Data Distribution Support on Distributed Shared Memory Multiprocessors," Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI '97). pp. 334-345, 1997.
[28] J. Bircsak, P. Craig, R. Crowell, Z. Cvetanovic, J. Harris, C.A. Nelson, and C.D. Offner, "Extending OpenMP for NUMA Machines," Proc. ACM/IEEE Conf. Supercomputing, CDROM, 2000.
[29] B. Chapman, F. Bregier, A. Patil, and A. Prabhakar, "Achieving Performance under OpenMP on ccNUMA and Software Distributed Shared Memory Systems," Concurrency and Computation: Practice and Experience, vol. 14, nos. 8/9, pp. 713-739, 2002.
[30] Y. Ojima, M. Sato, H. Harada, and Y. Ishikawa, "Performance of Cluster-Enabled OpenMP for the SCASH Software Distributed Shared Memory System," Proc. Third Int'l Symp. Cluster Computing and the Grid (CCGRID), p. 450, 2003.
[31] E. Flamand, "Strategic Directions towards Multicore Application Specific Computing," Proc. Conf. Design, Automation and Test in Europe and Exhibition (DATE '09), 2009.
[32] M. Verma, S. Steinke, and P. Marwedel, "Data Partitioning for Maximal Scratchpad Usage," Proc. 2003 Conf. Asia South Pacific Design Automation (ASPDAC), pp. 77-83, 2003.
[33] A. Grösslinger, "Precise Management of Scratchpad Memories for Localising Array Accesses in Scientific Codes," Proc. 18th Int'l Conf. Compiler Construction (CC '09), pp. 236-250, 2009.
[34] R. Baert, E. de Greef, and E. Brockmeyer, "An Automatic Scratch Pad Memory Management Tool and MPEG-4 Encoder Case Study," Proc. 45th Ann. Design Automation Conf. (DAC '08), pp. 201-204, 2008.
[35] A. Dominguez, S. Udayakumaran, and R. Barua, "Heap Data Allocation to Scratch-Pad Memory in Embedded Systems," J. Embedded Computing, vol. 1, no. 4, pp. 521-540, 2005.
[36] S. Steinke, N. Grunwald, L. Wehmeyer, R. Banakar, M. Balakrishnan, and P. Marwedel, "Reducing Energy Consumption by Dynamic Copying of Instructions onto Onchip Memory," Proc. 15th Int'l Symp. System Synthesis (ISSS '02), pp. 213-218, 2002.
[37] S. Udayakumaran, A. Dominguez, and R. Barua, "Dynamic Allocation for Scratch-Pad Memory Using Compile-Time Decisions," ACM Trans. Embedded Computing Systems, vol. 5, no. 2, pp. 472-511, 2006.
[38] M. Kandemir, J. Ramanujam, M. Irwin, N. Vijaykrishnan, I. Kadayif, and A. Parikh, "A Compiler-Based Approach for Dynamically Managing Scratch-Pad Memories in Embedded Systems," IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 23, no. 2, pp. 243-260, Feb. 2004.
[39] L. Li, L. Gao, and J. Xue, "Memory Coloring: A Compiler Approach for Scratchpad Memory Management," Proc. 14th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '05), pp. 329-338, 2005.
[40] G. Chen, O. Ozturk, M. Kandemir, and M. Karakoy, "Dynamic Scratch-Pad Memory Management for Irregular Array Access Patterns," Proc. Conf. Design, Automation and Test in Europe (DATE '06), pp. 931-936, 2006.
[41] A. Yanamandra, B. Cover, P. Raghavan, M. Irwin, and M. Kandemir, "Evaluating the Role of Scratchpad Memories in Chip Multiprocessors for Sparse Matrix Computations," Proc. IEEE Int'l Symp. Parallel and Distributed Processing (IPDPS '08), Apr. 2008.
[42] A.M. Abdelkhalek and T.S. Abdelrahman, "Locality Management Using Multiple SPMs on the Multi-Level Computing Architecture," Proc. IEEE/ACM/IFIP Workshop Embedded Systems for Real Time Multimedia (ESTMED '06), pp. 67-72, 2006.
[43] M. Kandemir, I. Kadayif, A. Choudhary, J. Ramanujam, and I. Kolcu, "Compiler-Directed Scratch Pad Memory Optimization for Embedded Multiprocessors," IEEE Trans. Very Large Scale Integration Systems, vol. 12, no. 3, pp. 281-287, Mar. 2004.
[44] F. Poletti, P. Marchal, D. Atienza, L. Benini, F. Catthoor, and J.M. Mendias, "An Integrated Hardware/Software Approach for Run-Time Scratchpad Management," Proc. 41st Ann. Design Automation Conf. (DAC '04), pp. 238-243, 2004.
[45] S. Martello and P. Toth, "Solution of the Zero-One Multiple Knapsack Problem," European J. Operational Research, vol. 4, no. 4, pp. 276-283, p276-283.html , Apr. 1980.
[46] S. Martello and P. Toth, "Heuristics Algorithms for the Multiple Knapsack Problem," Computing, vol. 27, pp. 93-112, 1981.
[47] M. Loghi, F. Angiolini, D. Bertozzi, L. Benini, and R. Zafalon, "Analyzing On-Chip Communication in a MPSoC Environment," Proc. Conf. Design, Automation and Test in Europe (DATE '04), p. 20752, 2004.
[48] A.J. Dorta, C. Rodriguez, F.d. Sande, and A. Gonzalez-Escribano, "The OpenMP Source Code Repository," Proc. 13th Euromicro Conf. Parallel, Distributed and Network-Based Processing (PDP '05), pp. 244-250, 2005.

Index Terms:
MPSoC, OpenMP, array partitioning, multiple scratchpads, NUMA.
Andrea Marongiu, Luca Benini, "An OpenMP Compiler for Efficient Use of Distributed Scratchpad Memory in MPSoCs," IEEE Transactions on Computers, vol. 61, no. 2, pp. 222-236, Feb. 2012, doi:10.1109/TC.2010.199
Usage of this product signifies your acceptance of the Terms of Use.