This Article 
 Bibliographic References 
 Add to: 
Compile Time Barrier Synchronization Minimization
June 2002 (vol. 13 no. 6)
pp. 529-543

This paper presents a new compiler approach to minimizing the number of barriers executed in parallelized programs. A simple procedure is developed to reduce the complexity of barrier placement by eliminating certain data dependences, without affecting optimality. An algorithm is presented which, provably, places the minimal number of barriers in perfect loop nests and in certain imperfect loop nest structures. This scheme is generalized to accept entire, well-structured control-flow programs containing arbitrary nesting of IF constructs, loops, and subroutines. It has been implemented in a prototype parallelizing compiler and applied to several well-known benchmarks where it has been shown to place significantly fewer synchronization points than existing techniques. Experiments indicate that on average the number of barriers executed is reduced by 70 percent and there is a three fold improvement in execution time when evaluated on a 32-processor SGI Origin 2000.

[1] J.R. Allen, D. Callahan, and K. Kennedy, "Automatic Decomposition of Scientific Programs for Parallel Execution," Proc. 14th Ann. ACM Symp. Principles of Programming Languages,Munich, Germany, Jan. 1987.
[2] A.V. Aho and J.D. Ullman, Foundations of Computer Science. W.H. Freeman and Co., 1995.
[3] B. Appelbe, S. Doddapaneni, and C. Hardnett, “A New Algorithm for Global Optimization for Parallelism and Locality,” Proc. Seventh Int'l Workshop Languages and Compilers for Parallel Computing, Aug. 1994.
[4] F. Bodin, L. Kervella, and T. Priol, “Fortran-S: A Fortran Interface for Shared Virtual Memory Architecture,” Proc. Supercomputing, Nov. 1993.
[5] F. Bodin and M.F.P. O'Boyle, “A Compiler Strategy for SVM,” Proc. Third Workshop Languages, Compilers, and Runtime Systems for Scalable Computing, May 1995.
[6] D. Callahan, “A Global Approach to Detection of Parallelism,” PhD dissertation, Dept. of Computer Science, Rice Univ., 1987.
[7] D.-K. Chen, “Compiler Optimizations for Parallel Loops with Fine-Grained Synchronization,” PhD dissertation, Univ. of Illinois at Urbana-Champaign, 1994.
[8] A. Choudhary, G. Fox, S. Hiranandani, K. Kennedy, C. Koelbel, S. Ranka, and C.-W. Tseng, “Unified Compilation of Fortran 77D and 90D,” ACM Letters on Programming Languages and Systems, vol. 2, nos. 1-4, Mar.-Dec. 1993.
[9] R. Cytron, “Compile-Time Scheduling and Optimizations for Asynchronous Machines,” PhD dissertation, Univ. of Illinois at Urbana-Champaign, 1984.
[10] P. Feautrier, “Data Flow Analysis of Array and Scalar References,” Int'l J. Parallel Prog., vol. 20, no. 1, 1991.
[11] M. Gupta and E. Schonberg, “Static Analysis to Reduce Synchronization Costs in Data-Parallel Programs,” Proc. Principles of Programming Languages, Jan. 1996.
[12] H. Han, C.-W. Tseng, and P. Keleher, “Eliminating Barrier Synchronization for Compiler-Parallelized Codes on Software DSMs,” Int'l J. Parallel Programming, vol. 26, no. 5, pp. 591-612, Oct. 1998.
[13] K. Kennedy and K.S. McKinley, "Optimizing for Parallelism and Data Locality," Proc. 1992 ACM Int'l Conf. Supercomputing, pp. 323-334,Washington, D.C., July 1992.
[14] Z. Li, “Compiler Algorithms for Event Variable Synchronization,” Proc. Int'l C. Supercomputing, June 1991.
[15] Z. Li and W. Abu-Sufah, “On Reducing Data Synchronization in Multiprocessed Loops,” IEEE Trans. Computers, vol. 36, no. 1, pp. 105-109, Jan. 1987.
[16] A. Lim and M. Lam, “An Affine Transformation Algorithm to Maximize Parallelism,” Proc. Principles of Programming Languages Conf., Jan. 1997.
[17] K.S. McKinley, “Evaluating Automatic Parallelization for Efficient Execution on Shared-Memory Multiprocessors,” Proc. Int'l Conf. Supercomputing, July 1994.
[18] S.P. Midkiff, “Automatic Generation of Synchronization Instructions for Parallel Processors,” MS Thesis, CSRD Rep. 588, Univ. of Illinois at Urbana-Champaign, May 1986.
[19] S.P. Midkiff and D.A. Padua, “Compiler Generated Synchronizations for Do Loops,” Proc. Int'l Conf. Parallel Processing, Aug. 1986.
[20] ——,“Compiler algorithms for synchronization,”IEEE Trans. Comput., vol. C-36, pp. 1485–1495, Dec. 1987.
[21] M.F.P. O'Boyle, L. Kervella, and F. Bodin, “Synchronization Minimization in a SPMD Execution Model,” J. Parallel and Distributed Computing, vol. 29, pp. 196-210, 1995.
[22] M. O'Boyle and F. Bodin, "Compiler Reduction of Synchronization in Shared Memory Virtual Memory Systems," Proc. 1995 ACM Int'l Conf. Supercomputing, pp. 318-327,Barcelona, Spain, July 1995.
[23] M. Philippsen and E. Heinz, “Automatic Synchronization Elimination in Synchronous FORALLs,” Proc. Fifth Symp. Frontiers of Massively Parallel Computation, Feb. 1995.
[24] D. Prakash, M. Dhagat, and R. Bargodia, “Synchronization Issues in Data-Parallel Languages,” Proc. Int'l Workshop Languages and Compiler for Parallel Computing, pp. 76-95, Aug. 1993.
[25] W. Pugh and D. Wonnacott, “Eliminating False Data Dependences Using the Omega Test,” Proc. Programming Languages Design and Implementation, June 1992.
[26] M. Quinn, P. Hatcher, and B. Seevers, “Implementing a Data Darallel Language on a Tightly Coupled Multiprocessors,” Advances in Languages and Compilers for Parallel Processing, pp. 385-401, 1991.
[27] E.A. Stöhr and M.F.P. O'Boyle, “Barrier Synchronization Reduction,” Proc. Int'l Conf. and Exhibition High-Performance Computing and Networking (HPCN '97), Apr. 1997.
[28] E.A. Stöhr and M.F.P. O'Boyle, “A Graph Based Approach to Barrier Synchronization Minimization,” Proc. Int'l Conf. Supercomputing, July 1997.
[29] E.A. Stöhr and M.F.P. O'Boyle, “First Fast Sink: A Compiler Algorithm for Barrier Placement Optimization,” Future Generation Computer Systems, vol. 13, pp. 397-406, 1998.
[30] P. Tang, P. Yew, and C. Zhu, “Compiler Techniques for Data Synchronization in Nested Parallel Loops,” Proc. 1990 ACM Int'l Conf. Supercomputing, June 1990.
[31] P. Tang and J.N. Zigman, “Reducing Data Communication Overhead for Doacross Loop Nests,” Proc. Int'l Conf. Supercomputing, July 1994.
[32] C. Tseng, "Compiler Optimizations for Eliminating Barrier Synchronization," Proc. Fifth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 144-155,Santa Barbara, Calif., July 1995.
[33] M. Wolf and M. Lam, “A Data Locality Optimizing Algorithm,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 30-44, June 1991.
[34] M.J. Wolfe, “Multiprocessor Synchronization for Concurrent Loops,” IEEE Software, pp. 34-42, Jan. 1988.
[35] H. Zima and B. Chapman, Supercompilers for Parallel and Vector Computers. ACM Press, 1990.

Index Terms:
Compiler optimization, synchronization reduction, efficient parallelization, barrier minimization, graph algorithms.
Michael O'Boyle, Elena Stöhr, "Compile Time Barrier Synchronization Minimization," IEEE Transactions on Parallel and Distributed Systems, vol. 13, no. 6, pp. 529-543, June 2002, doi:10.1109/TPDS.2002.1011394
Usage of this product signifies your acceptance of the Terms of Use.