2014 21st International Conference on High Performance Computing (HiPC) (2014)
Dec. 17, 2014 to Dec. 20, 2014
Humayun Kabir , Department of Computer Science & Engineering, The Pennsylvania State Univeristy, University Park, Pennsylvania 16802
Joshua Dennis Booth , Department of Computer Science & Engineering, The Pennsylvania State Univeristy, University Park, Pennsylvania 16802
Padma Raghavan , Department of Computer Science & Engineering, The Pennsylvania State Univeristy, University Park, Pennsylvania 16802
We seek to improve the performance of sparse matrix computations on multicore processors with non-uniform memory access (NUMA). Typical implementations use a bandwidth reducing ordering of the matrix to increase locality of accesses with a compressed storage format to store and operate only on the non-zero values. We propose a new multilevel storage format and a companion ordering scheme as an explicit adaptation to map to NUMA hierarchies. More specifically, we propose CSR-k, a multilevel form of the popular compressed sparse row (CSR) format for a multicore processor with k > 1 well-differentiated levels in the memory subsystem. Additionally, we develop Band-k, a modified form of a traditional bandwidth reduction scheme, to convert a matrix represented in CSRto our proposed CSR-k. We evaluate the performance of the widely-used and important sparse matrix-vector multiplication (SpMV) kernel using CSR-2 on Intel Westmere processors for a test suite of 12 large sparse matrices with row densities in the range 3 to 45. On 32 cores, on average across all matrices in the test suite, the execution time for SpMV with CSR-2is less than 42% of the time taken by the state-of-the-art automatically tuned SpMV resulting in energy savings of approximately 56%. Additionally, on average, the parallel speed-up on 32 cores of the automatically tuned SpMV relative to its 1-core performance is 8.18 compared to a value of 19.71 for CSR-2. Our analysis indicates that the higher performance of SpMV with CSR-2 comes from achieving higher reuse of x in the shared L3 cache without incurring overheads from fill-in of original zeroes. Furthermore, the pre-processing costs of SpMV with CSR-2 can be amortized on average over 97 iterations of SpMV using CSR and are substantially lower than the 513 iterations required for the automatically tuned implementation. Based on these results, CSR-k appears to be a promising multilevel formulation of CSR for adapting sparse computations to multicore processors with NUMA memory hierarchies.
Sparse matrices, Program processors, Multicore processing, Bandwidth, Kernel, Symmetric matrices, Data structures
H. Kabir, J. D. Booth and P. Raghavan, "A multilevel compressed sparse row format for efficient sparse computations on multicore processors," 2014 21st International Conference on High Performance Computing (HiPC)(HIPC), Goa, India, 2014, pp. 1-10.