2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS) (2016)
Wuhan, Hubei, China
Dec. 13, 2016 to Dec. 16, 2016
The scope of this paper is to design and implement a scalable QR factorization solver that can deliver the fastest performance for tall and skinny matrices and square matrices on modern supercomputers. The new solver, named scalable universal communication-avoiding QR factorization (suCAQR), introduces a simplified and tuning-less way to realize the communication-avoiding QR factorization algorithm to support matrices of any shapes. The software design includes a mixed usage of physical and logical data layouts, a simplified method of dynamic-root binary-tree reduction, and a dynamic dataflow implementation. Compared with the existing communication avoiding QR factorization implementations, suCAQR has the benefits of being simpler, more general, and more efficient. By balancing the degree of parallelism and the proportion of faster computational kernels, it is able to achieve scalable performance on clusters of multicore nodes. The software essentially combines the strengths of both synchronization-reducing approach and communication-avoiding approach to achieve high performance. Based on the experimental results using 1,024 CPU cores, suCAQR is faster than DPLASMA by up to 30%, and faster than ScaLAPACK by up to 30 times.
Layout, Kernel, Algorithm design and analysis, Heuristic algorithms, Shape, Memory, Computer science
W. Zheng, F. Song, L. Lin and Z. Chen, "suCAQR: A Simplified Communication-Avoiding QR Factorization Solver Using the TBLAS Framework," 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), Wuhan, Hubei, China, 2016, pp. 1092-1099.