This Article 
 Bibliographic References 
 Add to: 
Compiler Techniques for the Distribution of Data and Computation
June 2003 (vol. 14 no. 6)
pp. 545-562

Abstract—This paper presents a new method that can be applied by a parallelizing compiler to find, without user intervention, the iteration and data decompositions that minimize communication and load imbalance overheads in parallel programs targeted at NUMA architectures. One of the key ingredients in our approach is the representation of locality as a Locality-Communication Graph (LCG) and the formulation of the compiler technique as a Mixed Integer Nonlinear Programming (MINLP) optimization problem on this graph. The objective function and constraints of the optimization problem model communication costs and load imbalance. The solution to this optimization problem is a decomposition that minimizes the parallel execution overhead. This paper summarizes the process of how the compiler extracts the locality information from a nonannotated code and focuses on how this compiler can derive the optimization problem, solve it, and generate the parallel code with the automatically selected iteration and data distributions. In addition, we include a discussion about our model and the solutions—the decompositions—that it provides. The approach presented in the paper is evaluated using several benchmarks. The experimental results demonstrate that the MINLP formulation does not increase compilation time significantly and that our framework generates very efficient iteration/data distributions for a variety of NUMA machines.

[1] J. Nieplocha, R.J. Harrison, and R.J. Littlefield, "Global Arrays: A Portable 'Shared-Memory' Programming Model for Distributed Memory Computers," Proc. Supercomputing '94, IEEE CS Press, Los Alamitos, Calif., 1994, pp. 340-349.
[2] D.S. Nikolopoulos, T.S. Papatheodorou, C.D. Polychronopoulos, J. Labarta, and E. Ayguade, A Case for User-Level Dynamic Page Migration Proc. Int'l Conf. Supercomputing, pp. 119-130, May 2000.
[3] Silicon Graphics, Inc., POWER Fortran Accelerator User's Guide, 1997.
[4] OpenMP Architecture Review Board, OpenMP: A Proposed Industry Standard API for Shared Memory Programming,http:/, 1997.
[5] B. Chapman, A. Patil, and A. Prabhakar, Performance Oriented Programming for NUMA Architectures Lecture Notes in Computer Science, vol. 2104, p. 137, 2001.
[6] K. McKinley, S. Carr, and C.W. Tseng, “Improving Data Locality with Loop Transformations,” ACM Trans. Programming Languages and Systems, vol. 18, no. 4, pp. 424-453, July 1996.
[7] I. Kodukula, N. Ahmed, and K. Pingali, “Data-Centric Multi-Level Blocking,” Proc. Programming Language Design and Implementation (PLDI '97), June 1997.
[8] M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee, “A Graph-Based Framework to Detect Optimal Memory Layouts for Improving Data Locality,” Proc. 1999 Int'l Parallel Processing Symp., Apr. 1999.
[9] M. Kandemir, P. Banerjee, A. Choudhary, J. Ramanujam, and E. Ayguade, An Integer Linear Programming Approach for Optimizing Cache Locality Proc. ACM Int'l Conf. Supercomputing, pp. 500-509, June 1999.
[10] D. Bau, I. Kodukula, V. Kotlyar, K. Pingali, and P. Stodghill, “Solving Alignment Using Elementary Linear Algebra,” Proc. Seventh Int'l Workshop Languages and Compilers for Parallel Computing, pp. 46-60, June 1994.
[11] S. Chatterjee, J.R. Gilbert, R. Schreiber, and T.J. Sheffler, Algorithms for Automatic Alignment of Arrays J. Parallel and Distributed Computing, vol. 38, no. 2, pp. 145-157, Nov. 1996.
[12] J. Anderson and M. Lam, "Global Optimizations for Parallelism and Locality on Scalable Parallel Machines," Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 112-125,Albuquerque, N.M., June 1993.
[13] T. LeBlanc and E. Markatos, “Shared Memory vs. Message Passing in Shared-Memory Multiprocessors,” Proc. Fourth Symp. Parallel and Distributed Processing, Dec. 1992.
[14] U. Kremer, "Automatic Data Layout Using 0-1 Integer Programming," Proc. Int'l Conf. Parallel Architectures and Compilation Techniques,Montréal, Canada, Aug. 1994.
[15] J. Garcia, E. Ayguade, and J. Labarta, "Dynamic Data Distribution with Control Flow Analysis," Proc. Supercomputing'96,Pittsburgh, Penn., Nov. 1996.
[16] Y. Paek, J. Hoeflinger, D. Padua, “Simplification of Array Access Patterns for Compiler Optimizations,” Proc. ACM SIGPLAN‘98, PLDI, pp. 60–71, June 1998.
[17] W. Blume, R. Doallo, R. Eigenmann, J. Grout, J. Hoeflinger, T. Lawrence, J. Lee, D. Padua, Y. Paek, B. Pottenger, L. Rauchwerger, and P. Tu,“Parallel Programming with Polaris,” Computer, vol. 29, no. 12, pp. 78-82, Dec. 1996.
[18] A. Navarro, R. Asenjo, E. Zapata, and D. Padua, “Access Descriptor Based Locality Analysis for Distributed-Shared Memory Multiprocessors,” Proc. Int'l Conf. Parallel Processing, pp. 86-94, Sept. 1999.
[19] Y. Paek, A. Navarro, E. Zapata, J. Hoeflinger, and D. Padua, An Advanced Compiler Framework for Noncache-Coherent Multiprocessors IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 3, pp. 241-259, Mar. 2002.
[20] J. Hoeflinger, “Interprocedural Parallelization Using Memory Classification Analysis,” PhD thesis, Dept. of Computer Science, Univ. of Illinois at Urbana-Champaign, Aug. 1998.
[21] R. Triolet, F. Irigoin, and P. Feautrier, "Direct Parallelization of CALL Statements," Proc. SIGPLAN '86 Symp. Compiler Construction, pp. 176-185,Palo Alto, Calif., June 1986.
[22] W. Pugh, “The Omega Test: A Fast and Practical Integer Programming Algorithm for Dependence Analysis,” Comm. ACM, vol. 8, pp. 102–114, Aug. 1992.
[23] Y. Paek, Automatic Parallelization for Distributed Memory Machines Based on Access Region Analysis PhD thesis, Univ. of Illinois at Urbana-Champaign, Dept. of Computer Science, Apr. 1997.
[24] A.G. Navarro and E.L. Zapata, An Automatic Iteration/Data Distribution Method Based on Access Descriptors for DSM Multiprocessors Technical Report UMA-DAC-99/07, Dept. of Computer Architecture, Univ. of Málaga, 1999.
[25] A. Brooke, D. Kendrick, and A. Meeraus, Release 2.25 GAMS: A User's Guide,http://www.gams.comDefault.htm, 1992.
[26] J. Viswanathan, and I.E. Grossmann, A Combined Penalty Function and Outer Approximation Method for MINLP Optimization Computers and Chemical Eng., vol. 14, pp. 769-782, 1990.

Index Terms:
Parallelizing compiler, locality analysis, load balancing, communication pattern, mixed integer nonlinear programming.
Angeles Navarro, Emilio Zapata, David Padua, "Compiler Techniques for the Distribution of Data and Computation," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 6, pp. 545-562, June 2003, doi:10.1109/TPDS.2003.1206503
Usage of this product signifies your acceptance of the Terms of Use.