This Article 
 Bibliographic References 
 Add to: 
Computing Programs Containing Band Linear Recurrences on Vector Supercomputers
August 1996 (vol. 7 no. 8)
pp. 769-782

Abstract—Many large-scale scientific and engineering computations, e.g., some of the Grand Challenge problems [1], spend a major portion of execution time in their core loops computing band linear recurrences (BLRs). Conventional compiler parallelization techniques[4] cannot generate scalable parallel code for this type of computation because they respect loop-carried dependences (LCDs) in programs, and there is a limited amount of parallelism in a BLR with respect to LCDs. For many applications, using library routines to replace the core BLR requires the separation of BLR from its dependent computation, which usually incurs significant overhead. In this paper, we present a new scalable algorithm, called the Regular Schedule, for parallel evaluation of BLRs. We describe our implementation of the Regular Schedule and discuss how to obtain maximum memory throughput in implementing the schedule on vector supercomputers. We also illustrate our approach, based on our Regular Schedule, to parallelizing programs containing BLR and other kinds of code. Significant improvements in CPU performance for a range of programs containing BLR implemented using the Regular Schedule in C over the same programs implemented using highly optimized coded-in-assembly BLAS routines [11] are demonstrated on Convex C240. Our approach can be used both at the user level in parallel programming code containing BLRs, and in compiler parallelization of such programs combined with recurrence recognition techniques for vector supercomputers.

[1] R.K. Agarwal, "Computational Fluid Dynamics on Parallel Processors," a tutorial at the 1992 Sixth Int'l Conf. Supercomputing, Washington, D.C., McDonnell Douglas Research Laboratories, July 1992.
[2] Z. Ammarguellat and W. Harrison III, "Automatic Recognition of Induction Variable and Recurrence Relations by Abstract Interpretation," Proc. ACM SIGPLAN 1990 Conf. Programming Language Design and Implementation, pp. 283-295,White Plains, New York, June20-22, 1990.
[3] U. Banerjee, S.C. Chen, D. Kuck, and R. Towle, "Time and Parallel Processor Bounds for Fortran-Like Loops," IEEE Trans. Computers, vol. 28, no. 9, pp. 660-670, Sept. 1979.
[4] U. Banerjee, R. Eigenmann, A. Nicolau, and D.A. Padua, "Automatic Program Parallelization," Proc. IEEE, vol. 81, Feb. 1993.
[5] D. Callahan, "Recognizing and Parallelizing Bounded Recurrences," Lecture Notes in Computer Science—Languages and Compilers for Parallel Computing, pp. 169-185, Springer-Verlag, 1992.
[6] S.C. Chen, D. Kuck, and A.H. Sameh, "Practical Parallel Band Triangular System Solvers," ACM Trans. Mathematics Software, vol. 4, pp. 270-277, Sept. 1978.
[7] H. Conn and L. Podrazik, "Parallel Recurrence Solvers for Vector and SIMD Supercomputers," Proc. 1992 Int'l Conf. Parallel Processing, pp. 88-95, vol. 3, Aug.17-21, 1992.
[8] Convex Computer Corp., Convex Architecture Reference,Richardson, Texas, 1991.
[9] Convex Computer Corp., Convex Theory of Operation (C200 Series), Document No. 081-005030-000, second edition, Richardson, Texas, Sept. 1990.
[10] Convex Computer Corp., Convex SCILIB User's Guide, Document No. 710-013630-001, first edition, Richardson, Texas, Aug. 1991.
[11] Convex Computer Corp., Convex VECLIB User's Guide, Document No. 710-011030-001, sixth edition, Richardson, Texas, Aug. 1991.
[12] J.J. Dongarra, C.B. Moler, J.R. Bunch, and G.W. Stewart, Linpack Users' Guide, Chapter 7, SIAM, Philadelphia, 1979.
[13] R. Eigenmann, J. Hoeflinger, G. Jaxon, Z. Li, and D. Padua, "Restructuring Fortran Programs for Cedar," Proc. ICPP, Aug. 1991.
[14] F.E. Fich, "New Bounds for Parallel Prefix Circuits," Proc. 15th ACM STOC, pp. 100-109, 1983.
[15] D. Gajski, "An Algorithm for Solving Linear Recurrence Systems on Parallel and Pipelined Machines," IEEE Trans. Computers, vol. 30, no. 3, Mar. 1981.
[16] K.A. Gallivan, R.J. Plemmons, and A.H. Sameh, "Parallel Algorithms for Dense Linear Algebra Computations," SIAM Rev., vol. 32, no. 1, pp. 54-135, Mar. 1990.
[17] L. Hyafil and H.T. Kung, "The Complexity of Parallel Evaluation of Linear Recurrence," J. ACM, vol. 24, no. 3, pp. 513-521, July 1977.
[18] R. Karp, R. Miller, and S. Winograd, "The Organization of Computations for Uniform Recurrence Equations," J. ACM, vol. 14, July 1967.
[19] P. Kogge and H. Stone, "A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations," IEEE Trans. Computer, vol. 22, no. 8, Aug. 1973.
[20] D. Kuck, The Structure of Computers and Computations, vol. 1. New York: John Wiley and Sons, 1978.
[21] R.E. Ladner and M.J. Fischer, "Parallel Prefix Computation," J. ACM, vol. 27, no. 4, pp. 831-838, Oct. 1980.
[22] F.T. Leighton,Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes.San Mateo, Calif.: Morgan Kaufmann, 1992.
[23] F.H. McMahon, "The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range," Lawrence Livermore National Laboratory, Livermore, Calif., UCRL-53745, Dec. 1986.
[24] A. Nicolau and H. Wang, "Optimal Schedules for Parallel Prefix Computation with Bounded Resources," SIGPLAN Notices and Proc. Third ACM SIGPLAN Symp. Principles and Practice of Parallel Programming,Williamsburg, Va., Apr.21-24, 1991.
[25] T. Peters, "Livermore Loops Coded in C," Kendall Square Research Corp., Latest File Modification: Oct.22, 1992, available at
[26] V.P. Roychowdhury, “Derivation, Extensions, and Parallel Implementation of Regular Iterative Algorithms,” PhD thesis, Dept. of Electrical Eng., Stanford Univ., Stanford, Calif., Dec. 1988.
[27] S. Pinter and R. Pinter, "Program Optimization and Parallelization Using Idioms," Conf. Record 18th ACM Symp. Principles of Programming Languages, Jan. 1991.
[28] A. Sameh and R. Brent, "Solving Triangular Systems on a Parallel Computer," SIAM J. Numerical Analysis, vol. 14, pp. 1,101-1,113, 1977.
[29] W. Shang and J.A.B. Fortes, "On Time Mapping of Uniform Dependence Algorithms into Lower Dimensional Processor Arrays," IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 5, pp. 350-363, May 1992.
[30] W. Shang and J.A.B. Fortes, "Independent Partitioning of Algorithms with Uniform Dependencies," IEEE Trans. Computers, vol. 41, no. 2, pp. 190-206, Feb. 1992.
[31] W. Shang and J.A.B. Fortes, "Time Optimal Linear Schedules for Algorithms with Uniform Dependencies," IEEE Trans. Computers, vol. 40, June 1991.
[32] M. Snir, "Depth-Size Trade-Offs for Parallel Prefix Computation," J. Algorithms, vol. 7, pp. 185-201, 1986.
[33] Y. Tanaka, "Compiling Techniques for First-Order Linear Recurrence," J. Supercomputing, vol. 4, no. 1, pp. 63-82, Mar. 1990.
[34] N.K. Tsao, "Solving Triangular System in Parallel is Accurate," Numerical Linear Algebra, Digital Signal Processing and Parallel Algorithms, pp. 633-638, G. Golub and P. Van Dooren, eds., NATO Series F: Computer and Systems Sciences, vol. 70, Springer-Verlag, 1991.
[35] H. Wang and A. Nicolau, "Speedup of Band Linear Recurrences in the Presence of Resource Constraints," Proc. Sixth Int'l Conf. Supercomputing, pp. 466-477,Washington, D.C., July19-23, 1992.
[36] H. Wang and A. Nicolau, "Computing Programs Containing Band Linear Recurrences on Vector Supercomputers," TR 92-113, Dept. of Computer Science, Univ. of California at Irvine, Dec. 1992.

Index Terms:
Band linear recurrences (BLRs), parallel evaluation of BLRs with resource constraints, programs with BLRs, parallel programming, vector supercomputer.
Haigeng Wang, Alexandru Nicolau, Stephen Keung, Kai-Yeung (Sunny) Siu, "Computing Programs Containing Band Linear Recurrences on Vector Supercomputers," IEEE Transactions on Parallel and Distributed Systems, vol. 7, no. 8, pp. 769-782, Aug. 1996, doi:10.1109/71.532109
Usage of this product signifies your acceptance of the Terms of Use.