Parallel and Distributed Processing Symposium, International (2012)
Shanghai, China China
May 21, 2012 to May 25, 2012
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/IPDPS.2012.61
Dense LU factorization is a prominent benchmark used to rank the performance of supercomputers. Many implementations use block-cyclic distributions of matrix blocks onto a two-dimensional process grid. The process grid dimensions drive a trade-off between communication and computation and are architecture- and implementation-sensitive. The critical panel factorization steps can be made less communication-bound by overlapping asynchronous collectives for pivoting with the computation of rank-k updates. By shifting the computation-communication trade-off, a modified block-cyclic distribution can beneficially exploit more available parallelism on the critical path, and reduce panel factorization's memory hierarchy contention on now-ubiquitous multicore architectures. During active panel factorization, rank-1 updates stream through memory with minimal reuse. In a column-major process grid, the performance of this access pattern degrades as too many streaming processors contend for access to memory. A block-cyclic mapping in the row-major order does not encounter this problem, but consequently sacrifices node and network locality in the critical pivoting steps. We introduce 'striding' to vary between the two extremes of row- and column-major process grids. The maximum available parallelism in the critical path work (active panel factorization, triangular solves, and subsequent broadcasts) is bounded by the length or width of the process grid. Increasing one dimension of the process grid decreases the number of distinct processes and nodes in the other dimension. To increase the harnessed parallelism in both dimensions, we start with a tall process grid. We then apply periodic 'rotation' to this grid to restore exploited parallelism along the row to previous levels. As a test-bed for further mapping experiments, we describe a dense LU implementation that allows a block distribution to be defined as a general function of block to processor. Other mappings can be tested with only small, local changes to the code.
Program processors, Libraries, Equations, Supercomputers, Parallel processing, Benchmark testing, Computer architecture, scalapack, dense lu factorization, mapping, memory hierarchy contention, cray xt, bluegene, multicore, charm++, intel nehalem xeon, amd istanbul opteron, cache miss, parallelism, process grid, hpl, linpack
L. Kale, A. Arya, R. Venkataraman, P. Miller, J. Lifflander and T. Jones, "Mapping Dense LU Factorization on Multicore Supercomputer Nodes," Parallel and Distributed Processing Symposium, International(IPDPS), Shanghai, China China, 2012, pp. 596-606.