This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Time Stamp Algorithms for Runtime Parallelization of DOACROSS Loops with Dynamic Dependences
May 2001 (vol. 12 no. 5)
pp. 433-450

Abstract—This paper presents a time stamp algorithm for runtime parallelization of general DOACROSS loops that have indirect access patterns. The algorithm follows the INSPECTOR/EXECUTOR scheme and exploits parallelism at a fine-grained memory reference level. It features a parallel inspector and improves upon previous algorithms of the same generality by exploiting parallelism among consecutive reads of the same memory element. Two variants of the algorithm are considered: One allows partially concurrent reads (PCR) and the other allows fully concurrent reads (FCR). Analyses of their time complexities derive a necessary condition with respect to the iteration workload for runtime parallelization. Experimental results for a Gaussian elimination loop, as well as an extensive set of synthetic loops on a 12-way SMP server, show that the time stamp algorithms outperform iteration-level parallelization techniques in most test cases and gain speedups over sequential execution for loops that have heavy iteration workloads. The PCR algorithm performs best because it makes a better trade-off between maximizing the parallelism and minimizing the analysis overhead. For loops with light or unknown iteration loads, an alternative speculative runtime parallelization technique is preferred.

[1] J.R. Allen,K. Kennedy,C. Porterfield,, and J. Warren,“Conversion of control dependence to data dependence,” Proc. 1983 Symp. Principles of Programming Languages, pp. 177-189, Jan. 1983.
[2] D.F. Bacon, S.L. Graham, and O.J. Sharp, “Compiler Transformations for High-Performance Computing,” ACM Computing Surveys vol. 26, pp. 345-420, Dec. 1994.
[3] U. Banerjee, R. Eigenmann, A. Nicolau, and D.A. Padua, "Automatic Program Parallelization," Proc. IEEE, vol. 81, Feb. 1993.
[4] W. Blume and W. Eigenmann, “Nonlinear and Symbolic Data Dependence Testing,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 12, pp. 1180-1194, Dec. 1998.
[5] L. Carter, J. Feo, and A. Snavely, “Performance and Programming Experience on the Tera MTA,” Proc. SIAM Conf. Parallel Processing for Scientific Computing, 1999.
[6] D.-K. Chen and P.-C. Yew, “An Empirical Study on Doacross Loops,” Proc. Supercomputing, pp. 620-632, 1991.
[7] D. Chen and P. Yew, “On Effective Execution of Nonuniform DOACROSS Loops,” IEEE Trans. Parallel and Distributed Systems, vol. 7, no. 5, pp. 463-476, May 1996.
[8] D.-K. Chen and P.-C. Yew, “Redundant Synchronization Elimination for DOACROSS Loops,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 5, pp. 459-470, May 1999.
[9] D. Chen, J. Torrellas, and P. Yew, “An Efficient Algorithm for Runtime Parallelization of DOACROSS Loops,” Proc. Supercomputing 94, pp. 815-527, Nov. 1994.
[10] “Overview of Cray MTA Multithreaded Architecture System,” Cray Research Inc.,http://tera.com/products/systemscraymta, 2000.
[11] D. Culler, J.P. Singh, and A. Gupta, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, San Francisco, 1998.
[12] R. Cytron, “DOACROSS: Beyond Vectorization for Multiiprocessors,” Proc. Int'l Conf. Parallel Processing, pp. 836-844, 1986.
[13] R. Eigenmann, J. Hoeflinger, and D. Padua, "On the Automatic Parallelization of the Perfect Benchmarks," IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 1, pp. 5-23, Jan. 1998.
[14] M. Gupta and R. Nim, “Techniques for Speculative Runtime Parallelization of Loops,” Proc. Supercomputing (SC98), 1998.
[15] J. Hoeflinger, “Run-Time Dependence Testing by Integer Sequence Analysis,” technical report, Center for Supercomputing Research and Development, Univ. of Illinois, Urbana-Champaign, Jan. 1992.
[16] A. Hurson, K. Kavi, and J. Lim, “Cyclic Staggered Scheme: A Loop Allocation Policy for DOACROSS Loops,” IEEE Trans. Computers, vol. 47, no. 2, pp. 251-255, Feb. 1998.
[17] A. Hurson, J. Lim, K. Kavi, and B. Lee, “Parallelization of DOALL and DOACROSS Loops: A Survey,” Advances in Computers, vol. 45, 1997.
[18] Y.-S. Hwang, R. Das, J. Saltz, B. Brooks, and M. Hodoscek, “Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library,” IEEE Computational Science and Eng., vol. 2, no. 2, pp. 18-29, 1995.
[19] J. Ju and V. Chaudhary, “Unique Sets Oriented Partitioning of Nested Loops with Nonuniform Dependences,” Computer J., vol. 40, no. 6, pp. 322-339, 1997.
[20] V. Krishnan and J. Torrellas, “A Chip-Multiprocessor Architecture with Speculative Multithreading,” IEEE Trans. Computers, vol. 48, no. 9, pp. 866-880, Sept. 1999.
[21] V. Kumar, A. Grama, A. Gupta, and G. Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjamin Cummings, 1994.
[22] S. Leung and J. Zahorjan, "Improving the Performance of Runtime Parallelization," Proc. Fourth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPOPP), pp. 83-91, May 1993.
[23] S.T. Leung and J. Zahorjan, “Extending the Applicability and Improving the Peformance of Runtime Parallelization,” technical report, Dept. of Computer Science, Univ. of Washington, 1995.
[24] J.T. Lim, A.R. Hurson, K.M. Kavi, and B. Lee, "A Loop Allocation Policy for DOACROSS Loops," Proc. Symp. Parallel and Distributed Processing, pp. 240-249, 1996.
[25] ——,“Compiler algorithms for synchronization,”IEEE Trans. Comput., vol. C-36, pp. 1485–1495, Dec. 1987.
[26] “Harwell-Boeing Collection of Sparse Matrices,” Nat'l Inst. of Standards and Technology,http://math.nist.gov/matrixMarket/dataHarwell-Boeing , 2000.
[27] J. Oplinger et al. “Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor,” Technical Report CSL-TR-97-715, Stanford Univ. Computer Systems Laboratory, Feb. 1997.
[28] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1995.
[29] L. Rauchwerger, “Run-Time Parallelization: Its Time Has Come,” Parallel Computing, vol. 24, nos. 3-4, pp. 527-556, 1998.
[30] L. Rauchwerger, N. Amato, and D. Padua, "A Scalable Method for Run-Time Loop Parallelization," Int'l J. Parallel Processing, vol. 26, no. 6, pp. 537-576, July 1995.
[31] L. Rauchwerger and D. Padua, “The LRPD Test: Speculative Runtime Parallelization of Loops with Privatization and Reduction Parallelization,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 2, pp. 160-180, Feb. 1999.
[32] J.H. Saltz, R. Mirchandaney, and K. Crowley, "Run-Time Parallelization and Scheduling of Loops," IEEE Trans. Computers, vol. 40, May 1991.
[33] Z. Shen, Z. Li, and P.C. Yew, “An Empirical Study on Array Subscripts and Data Dependencies,” Proc. Int'l Conf. Parallel Processing, vol. II, pp. 145-152, 1989.
[34] J.Y. Tsai, J. Huang, C. Amlo, D.J. Lilja, and P.C. Yew, “The Superthreaded Processor Architecture,” IEEE Trans. Computers, vol. 48, no. 9, Sept. 1999
[35] T.H. Tzen and L.M. Ni, “Dependence Uniformization: A Loop Parallelization Technique,” IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 5 pp. 547-558, May 1993.
[36] C. Xu, "Effects of Parallelism Degree on Runtime Parallelism of Loops," Proc. 31st Hawaii Int'l Conf. System Sciences, pp. 86-95, Jan. 1998.
[37] C. Xu and F.C.M. Lau, Load Balancing in Parallel Computers: Theory and Practice. Boston: Kluwer Academic, 1997.
[38] Y. Yan, X. Zhang, and Z. Zhang, “A Memory-Layout Oriented Runtime Technique for Locality Optimization,” Proc. Int'l Conf. Parallel Processing, 1998.
[39] H. Yu and L. Rauchwerger, “Techniques for Reducing the Overhead of Runtime Parallelization,” Proc. 9th Int'l Conf. Compiler Construction, Mar. 2000.
[40] C.Q. Zhu and P.C. Yew, “A Scheme to Enforce Data Dependence on Large Multiprocessor Systems,” IEEE Trans. Software Eng., vol. 13, no. 6, pp. 726-739, June 1987.

Index Terms:
Compiler, parallelizing compiler, runtime support, inspector-executor, doacross loop, dynamic dependence.
Citation:
Cheng-Zhong Xu, Vipin Chaudhary, "Time Stamp Algorithms for Runtime Parallelization of DOACROSS Loops with Dynamic Dependences," IEEE Transactions on Parallel and Distributed Systems, vol. 12, no. 5, pp. 433-450, May 2001, doi:10.1109/71.926166
Usage of this product signifies your acceptance of the Terms of Use.