This Article 
 Bibliographic References 
 Add to: 
Performance and Scalability of Preconditioned Conjugate Gradient Methods on Parallel Computers
May 1995 (vol. 6 no. 5)
pp. 455-469

Abstract—This paper analyzes the performance and scalability of an iteration of the preconditioned conjugate gradient algorithm on parallel architectures with a variety of interconnection networks, such as the mesh, the hypercube, and that of the CM-5™* parallel computer. It is shown that for block-tridiagonal matrices resulting from two-dimensional finite difference grids, the communication overhead due to vector inner products dominates the communication overheads of the remainder of the computation on a large number of processors. However, with a suitable mapping, the parallel formulation of a PCG iteration is highly scalable for such matrices on a machine like the CM-5 whose fast control network practically eliminates the overheads due to inner product computation. The use of the truncated Incomplete Cholesky (IC) preconditioner can lead to further improvement in scalability on the CM-5 by a constant factor. As a result, a parallel formulation of the PCG algorithm with IC preconditioner may execute faster than that with a simple diagonal preconditioner even if the latter runs faster in a serial implementation. For the matrices resulting from three-dimensional finite difference grids, the scalability is quite good on a hypercube or the CM-5, but not as good on a 2-D mesh architecture. In the case of unstructured sparse matrices with a constant number of nonzero elements in each row, the parallel formulation of the PCG iteration is unscalable on any message passing parallel architecture, unless some ordering is applied on the sparse matrix. The parallel system can be made scalable either if, after reordering, the nonzero elements of the $N\times N$ matrix can be confined in a band whose width is $O(N^y)$ for any $y\char'74 1$, or if the number of nonzero elements per row increases as $N^x$ for any $x > 0$. Scalability increases as the number of nonzero elements per row is increased and/or the width of the band containing these elements is reduced. For unstructured sparse matrices, the scalability is asymptotically the same for all architectures. Many of these analytical results are experimentally verified on the CM-5 parallel computer.

[1] E. Anderson,“Parallel implementation of preconditioned conjugate gradient methods for solving sparse systems of linear equations,”Cent. for Supercomput. Res. and Development, Univ. Illinois, Urbana, IL, Tech. Rep. 805, 1988.
[2] C. Aykanat, F. Ozguner, F. Ercal, and P. Sadayappan,“Iterative algorithms for solution of large sparse systems of linear equations on hypercubes,”IEEE Trans. Comput., vol. 37, pp. 1554–1567, Dec. 1988.
[3] D. L. Eager, J. Zahorjan, and E. D. Lazowska,“Speedup versus efficiency in parallel systems,”IEEE Trans. Comput., vol. 38, pp. 408–423, Mar. 1989.
[4] A. George and J. W.-H. Liu,Computer Solution of Large Sparse Positive Difinite Systems. Englewood Cliffs, NJ: Prentice-Hall, 1981.
[5] N. E. Gibbs, W. G. Poole, and P. K. Stockmeyer,“A comparison of several bandwidth and profile reduction algorithms,”ACM Trans. Math. Software, vol. 2, pp. 322–330, 1976.
[6] G. H. Golub and C. Van Loan,Matrix Computations: Second Edition. Baltimore, MD: The Johns Hopkins University Press, 1989.
[7] A. Grama, A. Gupta, and V. Kumar,“Isoefficiency: Measuring the scalability of parallel algorithms and architectures,”IEEE Parallel and Distrib. Technol., vol. 1, pp. 12–21, Aug. 1993. Also available in Dep. of Comput. Sci., Tech. Rep. TR 93-24, Univ. Minnesota, Minneapolis, MN.
[8] A. Gupta and V. Kumar,“A scalable parallel algorithm for sparse matrix factorization,”Dep. Comput. Sci., Univ. Minnesota, Minneapolis, MN, Tech. Rep. 94-19, 1994. A short version appeared inSupercomputing '94.
[9] ——,“The scalability of FFT on parallel computers,”IEEE Trans. Parallel and Distrib. Syst., vol. 4, pp. 922–932, Aug. 1993. A detailed version available in the Dep. Comput. Sci., Tech. Rep. TR 90-53, Univ. Minnesota, Minneapolis, MN.
[10] J. L. Gustafson,“Reevaluating Amdahl's law,”Commun. ACM, vol. 31, no. 5, pp. 532–533, 1988.
[11] J. L. Gustafson, G. R. Montry, and R. E. Benner,“Development of parallel methods for a 1024-processor hypercube,”SIAM J. Scientif. and Statist. Comput., vol. 9, no. 4, pp. 609–638, 1988.
[12] S. W. Hammond and R. Schreiber,“Efficient ICCG on a shared-memory multiprocessor,”Int. J. High Speed Comput., vol. 4, no. 1, pp. 1–22, Mar. 1992.
[13] K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability. McGraw-Hill, 1993.
[14] C. Kamath and A. H. Sameh,“The preconditioned conjugate gradient algorithm on a multiprocessor,”inAdvances in Computer Methods for Partial Differential Equations, R. Vichnevetsky and R. S. Stepleman, Eds. New York; IMACS, 1984.
[15] A. H. Karp and H. P. Flatt,“Measuring parallel processor performance,”Commun. ACM, vol. 33, no. 5, pp. 539–543, 1990.
[16] S. K. Kim and A. T. Chronopoulos,“A class of Lanczos-like algorithms implemented on parallel computers,”Parallel Comput., vol. 17, pp. 763–777, 1991.
[17] K. Kimura and I. Nobuyuki,“Probabilistic analysis of the efficiency of the dynamic load distribution,”inProc. Sixth Distrib. Memory Comput. Conf., 1991.
[18] V. Kumar, A. Grama, A. Gupta, and G. Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjamin Cummings, 1994.
[19] V. Kumar and A. Gupta,“Analyzing scalability of parallel algorithms and architectures,”Dep. Comput. Sci., Univ. Minnesota, Minneapolis, MN, Tech. Rep. TR 91-18, 1991; to appear inJ. Parallel and Distrib. Comput., 1994. A shorter version appears inProc. 1991 Int. Conf. Supercomput., 1991, pp. 396–405.
[20] C. E. Leiserson,“Fat-trees: Universal networks for hardware efficient supercomputing,”inProc. 1985 Int. Conf. Parallel Processing, 1985, pp. 393–402.
[21] R. Melhem,“Toward efficient implementation of preconditioned conjugate gradient methods on vector supercomputers,”Int. J. Supercomput. Appli., vol. I, no. 1, pp. 70–97, 1987.
[22] D. Nussbaum and A. Agarwal,“Scalability of parallel machines,”Commun. ACM, vol. 34, pp. 57–61, 1991.
[23] S. Ranka and S. Shani,Hypercube Algorithms for Image Processing and Pattern Recognition. New York: Springer-Verlag, 1990.
[24] Y. Saad,“SPARSKIT: A basic tool kit for sparse matrix computations,”Res. Inst. Advanced Comput. Sci., NASA Ames Res. Cen., Moffet Field, CA, Tech. Rep. 90-20, 1990.
[25] Y. Saad and M. H. Schultz,“Parallel implementations of preconditioned conjugate gradient methods,”Yale Univ., Dep. of Comput. Sci., New Haven, CT, Tech. Rep. YALEU/DCS/RR-425, 1985.
[26] V. Singh, V. Kumar, G. Agha, and C. Tomlinson,“Scalability of parallel sorting on mesh multicomputers,”Int. J. Parallel Programming, vol. 20, no. 2, 1991.
[27] Z. Tang and G.-J. Li,“Optimal granularity of grid iteration problems,”inProc. 1990 Int. Conf. Parallel Processing, 1990, pp. I111–I118.
[28] F. A. Van-Catledge,“Toward a general model for evaluating the relative performance of computer systems,”Int. J. Supercomput. Appli., vol. 3, no. 2, pp. 100–108, 1989.
[29] H. A. van der Vorst,“A vectorizable variant of some ICCG methods,”SIAM J. Scientif. and Statist. Comput., vol. III, no. 3, pp. 350–356, 1982.
[30] ——,“Large tridiagonal and block tridiagonal linear systems on vector and parallel computers,”Parallel Comput., vol. 5, pp. 45–54, 1987.
[31] J. Woo and S. Sahni,“Computing biconnected components on a hypercube,”J. Supercomput., June 1991. Also available from the Dep. Comput. Sci., Univ. Minnesota, Minneapolis, MN, Tech. Rep. TR 89-7.
[32] P. H. Worley,“The effect of time constraints on scaled speedup,”SIAM J. Scientif. and Statist. Comput., vol. 11, no. 5, pp. 838–858, 1990.
[33] J. R. Zorbas, D. J. Reble, and R. E. VanKooten,“Measuring the scalability of parallel computer systems,”inSupercomput.'89 Proc., 1989, pp. 832–841.

Anshul Gupta, Vipin Kumar, Ahmed Sameh, "Performance and Scalability of Preconditioned Conjugate Gradient Methods on Parallel Computers," IEEE Transactions on Parallel and Distributed Systems, vol. 6, no. 5, pp. 455-469, May 1995, doi:10.1109/71.382315
Usage of this product signifies your acceptance of the Terms of Use.