This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
A Compiler Optimization Algorithm for Shared-Memory Multiprocessors
August 1998 (vol. 9 no. 8)
pp. 769-787

Abstract—This paper presents a new compiler optimization algorithm that parallelizes applications for symmetric, shared-memory multiprocessors. The algorithm considers data locality, parallelism, and the granularity of parallelism. It uses dependence analysis and a simple cache model to drive its optimizations. It also optimizes across procedures by using interprocedural analysis and transformations. We validate the algorithm by hand-applying it to sequential versions of parallel, Fortran programs operating over dense matrices. The programs initially were hand-coded to target a variety of parallel machines using loop parallelism. We ignore the user's parallel loop directives, and use known and implemented dependence and interprocedural analysis to find parallelism. We then apply our new optimization algorithm to the resulting program. We compare the original parallel program to the hand-optimized program, and show that our algorithm improves three programs, matches four programs, and degrades one program in our test suite on a shared-memory, bus-based parallel machine with local caches. This experiment suggests existing dependence and interprocedural array analysis can automatically detect user parallelism, and demonstrates that user parallelized codes often benefit from our compiler optimizations, providing evidence that we need both parallel algorithms and compiler optimizations to effectively utilize parallel machines.

[1] F. Allen, M. Burke, P. Charles, J. Ferrante, W. Hsieh, and V. Sarkar, "A Framework for Detecting Useful Parallelism," Proc. Second Int'l Conf. Supercomputing,St. Malo, France, July 1988.
[2] J.R. Allen, D. Callahan, and K. Kennedy, "Automatic Decomposition of Scientific Programs for Parallel Execution," Proc. 14th Ann. ACM Symp. Principles of Programming Languages,Munich, Germany, Jan. 1987.
[3] J.R. Allen and K. Kennedy, "Automatic Loop Interchange," Proc. SIGPLAN '84 Symp. Compiler Construction,Montreal, Canada, June 1984.
[4] R. Allen and K. Kennedy,“Automatic translation of FORTRAN programs to vector form,”ACM Trans. Programm. Lang., Syst. 9, pp. 491–542, Oct. 1987.
[5] J. Anderson, S. Amarasinghe, and M. Lam, “Data and Computation Transformations for Multiprocessors,” Proc. Fifth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, July 1995.
[6] B. Appelbe, S. Doddapaneni, and C. Hardnett, "A New Algorithm for Global Optimization for Parallelism and Locality," Proc. Sixth Workshop Languages and Compilers for Parallel Computing,Portland, Ore., Aug. 1993.
[7] B. Appelbe, C. Hardnett, and S. Doddapaneni, "Program Transformation for Locality Using Affinity Regions," Proc. Sixth Workshop Languages and Compilers for Parallel Computing,Portland, Ore., Aug. 1993.
[8] V. Balasundaram and K. Kennedy, "A Technique for Summarizing Data Access and Its Use in Parallelism Enhancing Transformations," Proc. SIGPLAN '89 Conf. Programming Language Design and Implementation, pp. 41-53,Portland, Ore., June 1989.
[9] U. Banerjee, "A Theory of Loop Permutations," Languages and Compilers for Parallel Computing, D. Gelernter, A. Nicolau, and D. Padua, eds.. MIT Press, 1990.
[10] W. Blume, R. Eigenmann, K. Faigin, J. Grout, J. Hoeflinger, D. Padua, P. Petersen, W. Pottenger, L. Rauchwerger, P. Tu, and S. Weatherford, "Effective Automatic Parallelization With Polaris," Int'l J. Parallel Programming, May 1995.
[11] M. Burke and R. Cytron, "Interprocedural Dependence Analysis and Parallelization," Proc. SIGPLAN '86 Symp. Compiler Construction, pp. 162-175,Palo Alto, Calif., June 1986.
[12] K. Cooper, M.W. Hall, R.T. Hood, K. Kennedy, K.S. McKinley, J.M. Mellor-Crummey, L. Torczon, and S. K. Warren, "The ParaScope Parallel Programming Environment," Proc. IEEE, vol. 81, no. 2, pp. 244-263, Feb. 1993.
[13] K. Cooper, M.W. Hall, and K. Kennedy, "Procedure Cloning," Proc. 1992 IEEE Int'l Conf. Computer Language,Oakland, Calif., Apr. 1992.
[14] J.E. Dennis Jr. and V. Torczon, "Direct Search Methods on Parallel Machines," SIAM J. Optimization, vol. 1, no. 4, pp. 448-474, Nov. 1991.
[15] J. Dongarra, J. Bunch, C. Moler, and G. Stewart, LINPACK User's Guide.Philadelphia: SIAM Publications, 1979.
[16] R. Eigenmann, J. Hoeflinger, G. Jaxon, Z. Li, and D. Padua, "Restructuring Fortran Programs for Cedar," Concurrency: Practice and Experience, vol. 5, no. 7, pp. 553-574, Oct. 1993.
[17] R. Eigenmann, J. Hoeflinger, and D. Padua, "On the Automatic Parallelization of the Perfect Benchmarks," IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 1, pp. 5-23, Jan. 1998.
[18] G. Goff, K. Kennedy, and C. Tseng, "Practical Dependence Testing," Proc. SIGPLAN '91 Conf. Programming Language Design and Implementation, pp. 15-29,Toronto, Canada, June 1991.
[19] M.W. Hall, S.P. Amarasinghe, B.R. Murphy, S. Liao, and M. Lam, "Detecting Coarse-Grain Parallelism Using an Interprocedural Parallelizing Compiler," Proc. Supercomputing '95,San Diego, Calif., Dec. 1995.
[20] M.W. Hall, K. Kennedy, and K.S. McKinley, "Interprocedural Transformations for Parallel Code Generation," Proc. Supercomputing '91, pp. 424-434,Albuquerque, N.M., Nov. 1991.
[21] M.W. Hall, B.R. Murphy, S.P. Amarasinghe, S. Liao, and M. Lam, "Data and Computation Transformations for Multiprocessors," Proc. Eighth Workshop Languages and Compilers for Parallel Computing,Columbus, Ohio, Aug. 1995.
[22] P. Havlak and K. Kennedy, "An Implementation of Interprocedural Bounded Regular Section Analysis," IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 3, pp. 350-360, July 1991.
[23] T. Jeremiassen and S. Eggers, “Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations,” Proc. SIGPLAN Symp. Principles and Practices of Parallel Programming, pp. 179-188, July 1995.
[24] K. Kennedy and K.S. McKinley, "Optimizing for Parallelism and Data Locality," Proc. 1992 ACM Int'l Conf. Supercomputing, pp. 323-334,Washington, D.C., July 1992.
[25] K. Kennedy and K.S. McKinley, "Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution," Languages and Compilers for Parallel Computing, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, eds., pp. 301-321,Portland, Ore., Aug. 1993.
[26] K. Kennedy and K.S. McKinley, "Typed Fusion With Applications to Parallel and Sequential Code Generation," Technical Report TR93-208, Dept. of Computer Science, Rice Univ., Aug. 1993.
[27] K. Kennedy, K.S. McKinley, and C. Tseng, "Analysis and Transformation in an Interactive Parallel Programming Tool," Concurrency: Practice and Experience, vol. 5, no. 7, pp. 575-602, Oct. 1993.
[28] D. Kuck, E. Davidson, D. Lawrie, A. Sameh, C.-Q. Zhu, A. Veidenbaum, J. Konicek, P. Yew, K. Gallivan, W. Jalby, H. Wijshoff, R. Bramley, U.M. Yang, P. Emrath, D. Padua, R. Eigenmann, J. Hoeflinger, G. Jaxon, Z. Li, T. Murphy, J. Andrews, and S. Turner, "The Cedar System and an Initial Performance Study," Proc. 20th Int'l Symp. Computer Architecture,San Diego, Calif., May 1993.
[29] D.J. Kuck,R. Kuhn,D. Padua,B. Leasure,, and M. Wolfe,“Dependence graphs and compiler optimizations,” Proc. 1981 SIGACT-SIGPLAN Symp. Principles of Programming Languages, pp. 207-218, Jan. 1981.
[30] Z. Li and P. Yew, "Efficient Interprocedural Analysis for Program Restructuring for Parallel Programs," Proc. ACM SIGPLAN Symp. Parallel Programming: Experience with Applications, Languages, and Systems (PPEALS),New Haven, Conn., July 1988.
[31] I.J. Lustig and G. Li, "An Implementation of a Parallel Primal-Dual Interior Point Method for Multicommondity Flow Problems," Technical Report CRPC-TR92194, Center for Research on Parallel Computation, Rice Univ., Jan. 1992.
[32] K.S. McKinley, "Dependence Analysis of Arrays Subscripted By Index Arrays," Technical Report TR91-162, Dept. of Computer Science, Rice Univ., Dec. 1990.
[33] K.S. McKinley, "Automatic and Interactive Parallelization," PhD thesis, Dept. of Computer Science, Rice Univ., Apr. 1992.
[34] K. McKinley, S. Carr, and C.W. Tseng, “Improving Data Locality with Loop Transformations,” ACM Trans. Programming Languages and Systems, vol. 18, no. 4, pp. 424-453, July 1996.
[35] K.S. McKinley and O. Temam, “A Quantitative Analysis of Loop Nest Locality,” Proc. Seventh Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 94-104, Oct. 1996.
[36] S.G. Nash and A. Sofer, "A General-Purpose Parallel Algorithm for Unconstrained Optimization," SIAM J. Optimization, vol. 1, no. 4, pp. 530-547, Nov. 1991.
[37] S.G. Nash and A. Sofer, "BTN: Software for Parallel Unconstrained Optimization," ACM Trans. Math. Systems, 1992.
[38] M. O'Boyle and F. Bodin, "Compiler Reduction of Synchronization in Shared Memory Virtual Memory Systems," Proc. 1995 ACM Int'l Conf. Supercomputing, pp. 318-327,Barcelona, Spain, July 1995.
[39] Guide to Parallel Programming on Sequent Computer Systems, A. Osterhaug, ed. San Diego, Calif.: Sequent Technical Publications, 1989.
[40] V. Sarkar, "Automatic Partitioning of a Program Dependence Graph into Parallel Tasks," IBM J. Research and Development, vol. 35, no. 6, pp. 779-804, Nov. 1991.
[41] V. Sarkar and R. Thekkath, "A General Framework for Iteration-Reordering Loop Transformations (technical summary)," Proc. SIGPLAN '92 Conf. Programming Language Design and Implementation, pp. 175-187,San Francisco, June 1992.
[42] J. Singh and J. Hennessy, "An Empirical Investigation of the Effectiveness of and Limitations of Automatic Parallelization," Proc. Int'l Symp. Shared Memory Multiprocessors,Tokyo, Apr. 1991.
[43] J. Singh and J. Hennessy, "Finding and Exploiting Parallelism in an Ocean Simulation Program: Experiences, Results, and Implications," J. Parallel and Distributed Computing, vol. 15, no. 1, pp. 27-48, May 1992.
[44] J. Subhlok, "Analysis of Synchronization in a Parallel Programming Environment," PhD thesis, Dept. of Computer Science, Rice Univ., Aug. 1990.
[45] R. Triolet, F. Irigoin, and P. Feautrier, "Direct Parallelization of CALL Statements," Proc. SIGPLAN '86 Symp. Compiler Construction, pp. 176-185,Palo Alto, Calif., June 1986.
[46] C. Tseng, "Compiler Optimizations for Eliminating Barrier Synchronization," Proc. Fifth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 144-155,Santa Barbara, Calif., July 1995.
[47] M.E. Wolf, “Improving Locality and Parallelism in Nested Loops,” doctoral thesis, Dept. of Computer Science, Stanford Univ., 1992.
[48] M. Wolf and M. Lam, “A Loop Transformation Theory and an Algorithm to Maximize Parallelism,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, Oct. 1991.
[49] M.J. Wolfe "Advanced Loop Interchanging," Proc. 1986 Int'l Conf. Parallel Processing, pp. 536-543,St. Charles, Ill., Aug. 1986.
[50] S.J. Wright, "Parallel Algorithms for Banded Linear Systems," SIAM J. Scientific and Statistical Computation, vol. 12, no. 4, pp. 824-842, July 1991.
[51] S.J. Wright, "Stable Parallel Algorithms for Two-Point Boundary Value Problems," SIAM J. Scientific and Statistical Computation, 1992.

Index Terms:
Program parallelization, parallelization techniques, program optimization, data locality, restructuring compilers, performance evaluation.
Citation:
Kathryn S. McKinley, "A Compiler Optimization Algorithm for Shared-Memory Multiprocessors," IEEE Transactions on Parallel and Distributed Systems, vol. 9, no. 8, pp. 769-787, Aug. 1998, doi:10.1109/71.706049
Usage of this product signifies your acceptance of the Terms of Use.