Subscribe
Issue No.01 - Jan. (2014 vol.25)
pp: 116-125
Albert-Jan Nicholas Yzelman , Flanders ExaScience Lab. (Intel Labs. Eur.), Leuven, Belgium
Dirk Roose , Dept. of Comput. Sci., KU Leuven, Heverlee, Belgium
ABSTRACT
The sparse matrix-vector multiplication is an important computational kernel, but is hard to efficiently execute even in the sequential case. The problems--namely low arithmetic intensity, inefficient cache use, and limited memory bandwidth--are magnified as the core count on shared-memory parallel architectures increases. Existing techniques are discussed in detail, and categorized chiefly based on their distribution types. Based on this, new parallelization techniques are proposed. The theoretical scalability and memory usage of the various strategies are analyzed, and experiments on multiple NUMA architectures confirm the validity of the results. One of the newly proposed methods attains the best average result in experiments on a large set of matrices. In one of the experiments it obtains a parallel efficiency of 90 percent, while on average it performs close to 60 percent.
INDEX TERMS
Sparse matrices, Vectors, Kernel, Bandwidth, Indexes, Computer architecture, Particle separators,NUMA architectures, Sparse matrix-vector multiplication, shared-memory parallelism, cache-oblivious, sparse matrix partitioning, matrix reordering, Hilbert space-filling curve, high-performance computing
CITATION
Albert-Jan Nicholas Yzelman, Dirk Roose, "High-Level Strategies for Parallel Shared-Memory Sparse Matrix-Vector Multiplication", IEEE Transactions on Parallel & Distributed Systems, vol.25, no. 1, pp. 116-125, Jan. 2014, doi:10.1109/TPDS.2013.31
REFERENCES
 [1] M.R. Hestenes and E. Stiefel, "Methods of Conjugate Gradients for Solving Linear Systems," J. Research Nat'l Bureau of Standards, vol. 49, pp. 409-436, 1952. [2] Y. Saad and M. Schultz, "GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems," SIAM J. Scientific and Statistical Computation, vol. 7, pp. 856-869, 1986. [3] H. van der Vorst, "BiCGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems," SIAM J. Scientific and Statistical Computation, vol. 13, pp. 631-644, 1992. [4] P. Sonneveld and M.B. van Gijzen, "IDR$(s)$ : A Family of Simple and Fast Algorithms for Solving Large Nonsymmetric Linear Systems," SIAM J. Scientific Computing, vol. 31, no. 2, pp. 1035-1062, 2008. [5] G.L.G. Sleijpen and H.A. van der Vorst, "A Jacobi-Davidson Iteration Method for Linear Eigenvalue Problems," SIAM Rev., vol. 42, no. 2, pp. 267-293, 2000. [6] B.N. Parlett, D. Taylor, and Z. Liu, "A Look-Ahead Lanczos Algorithm for Unsymmetric Matrices," Math. of Computation, vol. 44, pp. 105-124, 1985. [7] C.C. Paige and M.A. Saunders, "LSQR: An Algorithm for Sparse Linear Equations and Sparse Least Squares," ACM Trans. Math. Software, vol. 8, pp. 43-71, 1982. [8] S. Brin and L. Page, "The Anatomy of A Large-Scale Hypertextual Web Search Engine," Comput. Netw. ISDN Systems, vol. 30, pp. 107-117, 1998. [9] S. Toledo, "Improving the Memory-System Performance of Sparse-Matrix Vector Multiplication," IBM J. Research and Development, vol. 41, no. 6, pp. 711-725, 1997. [10] E.-J. Im and K.A. Yelick, "Optimizing Sparse Matrix-Vector Multiplication for Register Reuse in SPARSITY," Proc. Int'l Conf. Computational Science, Part I, pp. 127-136. 2001. [11] R. Vuduc, J.W. Demmel, and K.A. Yelick, "OSKI: A Library of Automatically Tuned Sparse Matrix Kernels," J. Physics Conf. Series, vol. 16, pp. 521-530, 2005. [12] Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst, Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide. SIAM, 2000. [13] J. Koster, "Parallel Templates for Numerical Linear Algebra, A High-Performance Computation Library," master's thesis, Dept. of Math., Utrecht Univ., July 2002. [14] A.N. Yzelman and R.H. Bisseling, "Two-Dimensional Cache-Oblivious Sparse Matrix-Vector Multiplication," Parallel Computing, vol. 37, no. 12, pp. 806-819, http://www.sciencedirect.com/science/article/ piiS0167819111001062, 2011. [15] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms," Parallel Computing, vol. 35, no. 3, pp. 178-194, http://www.sciencedirect.com/science/article/ piiS0167819108001403, 2009. [16] B. Vastenhouw and R.H. Bisseling, "A Two-Dimensional Data Distribution Method for Parallel Sparse Matrix-Vector Multiplication," SIAM Rev., vol. 47, no. 1, pp. 67-95, 2005. [17] K.D. Devine, E.G. Boman, R.T. Heaphy, R.H. Bisseling, and Ü.V. Çatalyürek, "Parallel Hypergraph Partitioning for Scientific Computing," Proc. IEEE Int'l Parallel and Distributed Processing Symp., 2006. [18] F. Pellegrini and J. Roman, "Scotch: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs," High-Performance Computing and Networking, pp. 493-498, Springer, 1996. [19] A. Trifunovic and W.J. Knottenbelt, "A Parallel Algorithm for Multilevel $k$ -Way Hypergraph Partitioning," Proc. IEEE Third Int'l Symp. Parallel and Distributed Computing, pp. 114-121, 2004. [20] Ü.V. Çatalyürek and C. Aykanat, "A Fine-Grain Hypergraph Model for 2D Decomposition of Sparse Matrices," Proc. IEEE Eigth Int'l Workshop Solving Irregularly Structured Problems in Parallel, p. 118, 2001. [21] T. Lengauer, Combinatorial Algorithms for Integrated Circuit Layout. John Wiley and Sons, 1990. [22] Ü.V. Çatalyürek and C. Aykanat, "Hypergraph-Partitioning-Based Decomposition for Parallel Sparse-Matrix Vector Multiplication," IEEE Trans. Parallel Distributed Systems, vol. 10, no. 7, pp. 673-693, July 1999. [23] R.H. Bisseling, B.O. Fagginger Auer, A.N. Yzelman, T. van Leeuwen, and Ü.V. Çatalyürek, "Two-Dimensional Approaches to Sparse Matrix Partitioning," Combinatorial Scientific Computing, U. Naumann and O. Schenk, eds., pp. 321-349, Chapman & Hall/CRC Press, 2012. [24] D.A. Burgess and M.B. Giles, "Renumbering Unstructured Grids to Improve the Performance of Codes on Hierarchical Memory Machines," Advances in Eng. Software, vol. 28, no. 3, pp. 189-201, 1997. [25] J.B. White,III and P. Sadayappan, "On Improving the Performance of Sparse Matrix-Vector Multiplication," Proc. IEEE Fourth Int'l Conf. High-Performance Computing, pp. 66-71. 1997. [26] A.N. Yzelman and R.H. Bisseling, "Cache-Oblivious Sparse Matrix-Vector Multiplication by Using Sparse Matrix Partitioning Methods," SIAM J. Scientific Computing, vol. 31, no. 4, pp. 3128-3154, 2009. [27] G. Haase, M. Liebmann, and G. Plank, "A Hilbert-Order Multiplication Scheme for Unstructured Sparse Matrices," Int'l J. Parallel, Emergent and Distributed Systems, vol. 22, no. 4, pp. 213-220, 2007. [28] A.N. Yzelman and R.H. Bisseling, "A Cache-Oblivious Sparse Matrix-Vector Multiplication Scheme Based on the Hilbert Curve," Progress in Industrial Mathematics at ECMI 2010, M. Günther, A. Bartel, M. Brunk, S. Schöps, and M. Striebel, eds., pp. 627-634, http://www.springer.com/Math./applications/ book978-3-642-25099-6, Springer, 2012. [29] A. Pinar and M.T. Heath, "Improving Performance of Sparse Matrix-Vector Multiplication," Proc. IEEE ACM Supercomputing Conf., Article 30, 1999. [30] R. Vuduc and H.J. Moon, "Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure," Proc. First Int'l Conf. High Performance Computing and Comm. (HPCC '05), pp. 807-816, 2005. [31] E.-J. Im, K. Yelick, and R. Vuduc, "Sparsity: Optimization Framework for Sparse Matrix Kernels," Int'l J. High Performance Computing Applications, vol. 18, no. 1, pp. 135-158, 2004. [32] A.N. Yzelman and R.H. Bisseling, "An Object-Oriented Bulk Synchronous Parallel Library for Multicore Programming," Concurrency and Computation: Practice and Experience, vol. 24, no. 5, pp. 533-553, http://dx.doi.org/10.1002cpe.1843, 2012. [33] A.N. Yzelman, "Fast Sparse Matrix-Vector Multiplication by Partitioning and Reordering," PhD dissertation, Utrecht Univ., 2011. [34] G. Morton, "A Computer Oriented Geodetic Data Base and a New Technique in File Sequencing," technical report, IBM, Mar. 1966. [35] K.P. Lorton and D.S. Wise, "Analyzing Block Locality in Morton-Order and Morton-Hybrid Matrices," ACM SIGARCH Computer Architecture News, vol. 35, no. 4, pp. 6-12, 2007. [36] M. Martone, S. Filippone, S. Tucci, M. Paprzycki, and M. Ganzha, "Utilizing Recursive Storage in Sparse Matrix-Vector Multiplication-Preliminary Considerations," Proc. ISCA 25th Int'l Conf. Computers and Their Applications (CATA '10), 2010. pp. 300-305. [37] A. Buluç, J.T. Fineman, M. Frigo, J.R. Gilbert, and C.E. Leiserson, "Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication Using Compressed Sparse Blocks," Proc. 21st Ann. Symp. Parallelism in Algorithms and Architectures (SPAA '09), pp. 233-244. 2009, [38] R.D. Blumofe, C.F. Joerg, B.C. Kuszmaul, C.E. Leiserson, K.H. Randall, and Y. Zhou, "Cilk: An Efficient Multithreaded Runtime System," ACM SIGPLAN Notices, vol. 30, no. 8, pp. 207-216, http://doi.acm.org/10.1145209937.209958, Aug. 1995. [39] Berkeley Benchmarking and Optimization Group "pOSKI: Parallel Optimized Sparse Kernel Interface," http://bebop.cs.berkeley. edu/poskiindex.php , 2012. [40] A. Buluç, S. Williams, L. Oliker, and J. Demmel, "Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication," Proc. IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS '11), pp. 721-733, 2011.