This Article 
 Bibliographic References 
 Add to: 
On Effective Execution of Nonuniform DOACROSS Loops
May 1996 (vol. 7 no. 5)
pp. 463-476

Abstract—It is extremely difficult to parallelize DOACROSS loops with nonuniform loop-carried dependences. In this paper, we present a static scheduling scheme with an accompanying synchronization strategy that can execute such DOACROSS loops effectively and efficiently. Our approach uses one of the parallelization techniques called Dependence Uniformization, which finds a small set of uniform dependence vectors to cover all possible nonuniform dependences in a DOACROSS loop. It differs from the previous schemes in that we demonstrate a better way to select the uniform dependence vectors. When used with the Static Strip Scheduling scheme, the proposed uniform dependence vector set allows us to enforce dependences with more locality, which reduces the requirement of explicit synchronization considerably while retaining most of the parallelism. This paper describes the uniform dependence vectors selection strategy and the static strip scheduling scheme. The performance analysis and examples are also presented.

[1] J.R. Allen, D. Callahan, and K. Kennedy, "Automatic Decomposition of Scientific Programs for Parallel Execution," Proc. 14th Ann. ACM Symp. Principles of Programming Languages,Munich, Germany, Jan. 1987.
[2] G.M. Amdahl, "Validity of the Single Processor Approach to Achieving Large Scale Computing Capability," Proc. AFIPS Spring Joint Computer Conf., pp. 483-487, Aug. 1967.
[3] U. Banerjee,Dependence Analysis for Supercomputing. Norwell, MA: Kluwer, 1988.
[4] BBN Advanced Computers, Butterfly Products Overview, 1987.
[5] D.-K. Chen, “Compiler Optimizations for Parallel Loops with Fine-Grained Synchronization,” PhD dissertation, Univ. of Illinois at Urbana-Champaign, 1994.
[6] D. Chen, J. Torrellas, and P. Yew, “An Efficient Algorithm for Runtime Parallelization of DOACROSS Loops,” Proc. Supercomputing 94, pp. 815-527, Nov. 1994.
[7] D.-K. Chen and P.-C. Yew, "An Empirical Study of DOACROSS Loops," Proc. Supercomputing 91, pp. 620-632, IEEE CS Press, Nov. 1991. Also available as CSRD Technical Report No. 1140.
[8] Z. Chen and W. Shang, "On Uniformization of Affine Dependence Algorithms," CACS Technical Report No. TR 92-3-3, Ctr. for Advanced Computer Studies, Univ. of Southwestern Louisiana, Sept. 1992.
[9] R. Cytron, "DOACROSS: Beyond Vectorization for Multiprocessors," Proc. Int'l Conf. Parallel Processing, pp. 836-845, Aug. 1986.
[10] L. Lamport, "The Parallel Execution of DO Loops," Comm. ACM, vol. 17, Feb. 1974.
[11] D. Lenoski et al., "The directory-based cache coherence protocol for the dash multiprocessor," Proc. 17th Int'l Symp. Computer Architecture,Los Alamitos, Calif., pp. 148-159, 1990.
[12] ——,“Compiler algorithms for synchronization,”IEEE Trans. Comput., vol. C-36, pp. 1485–1495, Dec. 1987.
[13] Y. Muraoka, "Parallelism Exposure and Exploitation in Programs," PhD thesis, Dept. of Computer Science, Univ. of Illinois at Urbana-Champaign, Report No. 424, Feb. 1971.
[14] D.A. Padua and M.J. Wolfe, "Advanced Compiler Optimizations for Supercomputers," Comm. ACM, vol. 29, Dec. 1986.
[15] D.A. Padua, "Multiprocessors: Discussion of Some Theoretical and Practical Problems," PhD thesis, Dept. of Computer Science, Univ. of Illinois at Urbana-Champaign, Oct. 1979.
[16] "The Parallel Computing Forum," PCF Fortran: Language Definition, first edition, Aug. 1988.
[17] G. Pfister, W. Brantley, D. George, S. Harvey, W. Kleinfelder, K. McAuliffe, E. Melton, V. Norton, and J. Weiss, "The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture," Proc. Int'l Conf. Parallel Processing, pp. 764-771, Aug. 1985.
[18] F.P. Preparata and M.I. Shamos, Computational Geometry. Springer-Verlag, 1985.
[19] Z. Shen, Z. Li, and P.-C. Yew, "An Empirical Study of Fortran Programs for Parallelizing Compilers," IEEE Trans. Parallel and Distributed Systems, vol. 1, no. 3, pp. 356-364, July 1990.
[20] H.-M. Su and P.-C. Yew, "On Data Synchronization for Multiprocessors," Proc. Int'l Symp. Computer Architecture, pp. 416-423, May 1989.
[21] H.-M. Su and P.-C. Yew, "Efficient DOACROSS Execution for Distributed Shared Memory Multiprocessors," Proc. Supercomputing 91, pp. 842-853, Nov. 1991.
[22] T.H. Tzen and L.M. Ni, “Dependence Uniformization: A Loop Parallelization Technique,” IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 5 pp. 547-558, May 1993.
[23] M. Wolfe,“Optimizing Supercompilers for Supercomputers,”Ph.D. dissertation, Dep. Comput. Sci., Univ. Illinois at Urbana-Champaign, 1982.

Index Terms:
Compiler transformation, data dependence, loop parallelization, parallelism, scheduling, synchronization.
Ding-Kai Chen, Pen-Chung Yew, "On Effective Execution of Nonuniform DOACROSS Loops," IEEE Transactions on Parallel and Distributed Systems, vol. 7, no. 5, pp. 463-476, May 1996, doi:10.1109/71.503771
Usage of this product signifies your acceptance of the Terms of Use.