This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors
September 1995 (vol. 6 no. 9)
pp. 943-962

Abstract—This paper presents a theoretical framework for automatically partitioning parallel loops to minimize cache coherency traffic on shared-memory multiprocessors. While several previous papers have looked at hyperplane partitioning of iteration spaces to reduce communication traffic, the problem of deriving the optimal tiling parameters for minimal communication in loops with general affine index expressions has remained open. Our paper solves this open problem by presenting a method for deriving an optimal hyperparallelepiped tiling of iteration spaces for minimal communication in multiprocessors with caches. We show that the same theoretical framework can also be used to determine optimal tiling parameters for both data and loop partitioning in distributed memory multicomputers. Our framework uses matrices to represent iteration and data space mappings and the notion of uniformly intersecting references to capture temporal locality in array references. We introduce the notion of data footprints to estimate the communication traffic between processors and use linear algebraic methods and lattice theory to compute precisely the size of data footprints. We have implemented this framework in a compiler for Alewife, a distributed shared-memory multiprocessor.

[1] C.D. Polychronopoulos and D.J. Kuck, “Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers,” IEEE Trans. Computers, vol. 36, no. 12, pp. 1425-1439, Dec. 1987.
[2] E. Mohr,D. Kranz,, and R. Halstead,“Lazy task creation: A technique for increasing the granularity of parallelprograms,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 3, pp. 264-280, July 1991.
[3] M. Wolf and M. Lam, “A Data Locality Optimizing Algorithm,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 30-44, June 1991.
[4] D. Gannon, W. Jalby, and K. Gallivan, "Strategies for Cache and Local Memory Management by Global Program Transformations," J. Parallel and Distributed Computing, vol. 5, no. 5, pp. 587-616, Oct. 1988.
[5] H.S. Stone and D. Thiebaut,“Footprints in the cache,” Proc. ACM SIGMETRICS 1986, pp. 4-8, May 1986.
[6] F. Irigoin and R. Triolet, “Supernode Partitioning,” Proc. 15th ACM Symp. Principles of Programming Languages, pp. 319-329, Jan. 1988.
[7] S. Abraham and D. Hudak, "Compile-Time Partitioning of Iterative Parallel Loops to Reduce Cache Coherence Traffic," IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 3, July 1991.
[8] J. Ramanujam and P. Sadayappan, “Compile-Time Techniques for Data Distribution in Distributed Memory Machines,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, pp. 472-482, Oct. 1991.
[9] J. Anderson and M. Lam, "Global Optimizations for Parallelism and Locality on Scalable Parallel Machines," Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 112-125,Albuquerque, N.M., June 1993.
[10] M. Gupta and P. Banerjee, “Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers,” IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 2, pp. 179-193, Mar. 1992.
[11] R. Schreiber and J. Dongarra,“Automatic blocking of nested loops,” Technical report, RIACS, NASA Ames Research Center and Oak Ridge Nat’l Laboratory, May 1990.
[12] J. Ferrante, V. Sarkar, and W. Thrash, “On Estimating and Enhancing Cache Effectiveness,” Proc. Fourth Int'l Workshop Languages and Compilers for Parallel Computing, pp. 328-343, Aug. 1991.
[13] J. Ramanujam and P. Sadayappan,“Tiling multidimensional iteration spaces for nonshared memorymachines,” Proc. Supercomputing’91, IEEE CS Press, 1991.
[14] G.N.S. Prasanna, A. Agarwal, and B.R. Musicus, "Hierarchical Compilation of Macro Dataflow Graphs for Multiprocessors with Local Memory," IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 7, pp. 720-736, July 1994.
[15] R. Barua,D. Kranz,, and A. Agarwal,“Global partitioning of parallel loops and data arrays for caches and distributed memory inmultiprocessors,” Technical Memo MIT-LCS TM-538, Massachusetts Institute of Tech nology, 1995.
[16] A. Agarwal,J.V. Guttag,C.N. Hadjicostis,, and M.C. Papaefthymiou,“Memory assignment for multiprocessor caches through grey coloring,” PARLE’94 Parallel Architectures and Languages Europe, pp. 351-362, Springer Verlag Lecture Notes in Computer Science 817, July 1994.
[17] M. Lam, E. Rothberg, and M. Wolf, “The Cache Performance and Optimizations of Blocked Algorithms,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '91), 1991.
[18] A. Carnevali,V. Natarajan,, and A. Agarwal,“A relationship between the number of lattice points within hyperparallelepipes and theirvolume,” Motorola Cambridge Research Center, in preparation, Aug. 1993.
[19] G. Strang,Linear Algebra and Its Applications, third edition. San Diego, Calif.: Harcourt Brace Jova novich, 1988.
[20] A. Schrijver, Theory of Linear and Integer Programming. John Wiley, 1986.
[21] G. Arfken,Mathematical Methods for Physics. Academic Press, 1985.
[22] A. Agarwal,R. Bianchini,D. Chaiken,K. Johnson,D. Kranz,J. Kubiatowicz,B.-H. Lim,K. Mackenzie,, and D. Yeung,“The MIT Alewife machine: Architecture and performance,” Proc. 22nd Ann. Int’l Symp. Computer Architecture (ISCA’95), June 1995.
[23] P.S. Barth, R.S. Nikhil, and Arvind, "M-Structures: Extending a Parallel, Non-Strict, Functional Language with State," Proc. Fifth Conf. Functional Programming Languages and Computer Architecture, pp. 538-568, Aug. 1991.
[24] B.J. Smith,“Architecture and applications of the HEP multiprocessor computersystem,” Society Photo-Optical Instrumentation Engineers, vol. 298, pp. 241-248, 1981.

Index Terms:
Automatic loop partitioning, shared-memory multiprocessors, compilers, tiling, minimizing communication.
Citation:
Anant Agarwal, David A. Kranz, Venkat Natarajan, "Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors," IEEE Transactions on Parallel and Distributed Systems, vol. 6, no. 9, pp. 943-962, Sept. 1995, doi:10.1109/71.466632
Usage of this product signifies your acceptance of the Terms of Use.