The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.08 - Aug. (2013 vol.24)
pp: 1613-1621
J. Kurzak , Dept. of Electr. Eng. & Comput. Sci., Univ. of Tennessee, Knoxville, TN, USA
P. Luszczek , Dept. of Electr. Eng. & Comput. Sci., Univ. of Tennessee, Knoxville, TN, USA
M. Faverge , Dept. of Electr. Eng. & Comput. Sci., Univ. of Tennessee, Knoxville, TN, USA
J. Dongarra , Dept. of Electr. Eng. & Comput. Sci., Univ. of Tennessee, Knoxville, TN, USA
ABSTRACT
LU factorization with partial pivoting is a canonical numerical procedure and the main component of the high performance LINPACK benchmark. This paper presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. The difficulty of implementing the algorithm for such a system lies in the disproportion between the computational power of the CPUs, compared to the GPUs, and in the meager bandwidth of the communication link between their memory systems. An additional challenge comes from the complexity of the memory-bound and synchronization-rich nature of the panel factorization component of the block LU algorithm, imposed by the use of partial pivoting. The challenges are tackled with the use of a data layout geared toward complex memory hierarchies, autotuning of GPU kernels, fine-grain parallelization of memory-bound CPU operations and dynamic scheduling of tasks to different devices. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.
INDEX TERMS
Graphics processing unit, Layout, Kernel, Dynamic scheduling, Libraries, Plasmas,GPU, Graphics processing unit, Layout, Tiles, Kernel, Dynamic scheduling, Libraries, Plasmas, accelerator, Gaussian elimination, LU factorization, partial pivoting, multicore, manycore
CITATION
J. Kurzak, P. Luszczek, M. Faverge, J. Dongarra, "LU Factorization with Partial Pivoting for a Multicore System with Accelerators", IEEE Transactions on Parallel & Distributed Systems, vol.24, no. 8, pp. 1613-1621, Aug. 2013, doi:10.1109/TPDS.2012.242
REFERENCES
[1] E. Anderson, Z. Bai, C. Bischof, L.S. Blackford, J.W. Demmel, J.J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK Users' Guide. SIAM, http://www.netlib.org/lapacklug/. 1992.
[2] D.B. Kirk and W.W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, Applications of GPU Computing Series. Morgan Kaufmann, 2010.
[3] M. Baboulin, J.J. Dongarra, and S. Tomov, "LAPACK Working Note 200: Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures," Technical Report UT-CS-08-615, Electrical Eng. and Computer Science Dept., Univ. of Tennessee, www.netlib.org/lapack/lawnspdf/lawn200.pdf" www. netlib.org/ lapack/lawnspdflawn200.pdf , 2008.
[4] M. Bach, M. Kretz, V. Lindenstruth, and D. Rohr, "Optimized HPL for AMD GPU and Multi-Core CPU Usage," Computer Science: Research Development, vol. 26, nos. 3/4, pp. 153-164, 2011, DOI: 10.1007s00450-011-0161-5.
[5] S. Barrachina, M. Castillo, F.D. Igual, R. Mayo, and E.S. Quintana-Orti, "Solving Dense Linear Systems on Graphics Processors," Proc. 14th Int'l Euro-Par Conf. Parallel Processing, pp. 739-748, Aug. 2008, DOI: 10.1007978-3-540-85451-7_79.
[6] R.F. Barrett, T.H.F. Chan, E.F. D'Azevedo, E.F. Jaeger, K. Wong, and R.Y. Wong, "Complex Version of High Performance Computing LINPACK Benchmark (HPL)," Concurrency Computation: Practice Experience, vol. 22, no. 5, pp. 573-587, 2009, DOI: 10.1002cpe.1476.
[7] A.M. Castaldo and R.C. Whaley, "Scaling LAPACK Panel Operations Using Parallel Cache Assignment," Proc. ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP '10), Jan. 2010, DOI: 10.11451693453.1693484.
[8] M. Castillo, E. Chan, F.D. Igual, R. Mayo, E.S. Quintana-Orti, G. Quintana-Orti, R. van de Geijn, and F.G. Van Zee, "FLAME Working Note 31: Making Programming Synonymous with Programming for Linear Algebra Libraries," Technical Report TR-08-20, Computer Science Dept., Univ. of Texas at Austin, www.cs.utexas.edu/users/flame/pubsflawn31.pdf , 2008.
[9] T. Chen, R. Raghavan, J.N. Dale, and E. Iwata, "Cell Broadband Engine Architecture and Its First Implementation—A Performance View," IBM J. Research & Development, vol. 51, no. 5, pp. 559-572, 2007, DOI: 10.1147rd.515.0559.
[10] H. Cui, L. Wang, J. Xue, Y. Yang, and X. Feng, "Automatic Library Generation for BLAS3 on GPUs," Proc. Int'l Parallel and Distributed Processing Symp., May 2011, DOI: 10.1109IPDPS.2011.33.
[11] M. Deisher, M. Smelyanskiy, B. Nickerson, V.W. Lee, M. Chuvelev, and P. Dubey, "Designing and Dynamically Load Balancing Hybrid LU for Multi/Many-Core," Computer Science Research and Development, vol. 26, no. 3/4, pp. 211-220, 2011, DOI: 10.1007s00450-011-0169-x.
[12] J. Dongarra, M. Faverge, H. Ltaief, and P. Luszczek, "LAPACK Working Note 259: Achieving Numerical Accuracy and High Performance Using Recursive Tile LU Factorization," Technical Report UT-CS-11-688, Electrical Eng. and Computer Science Dept., Univ. of Tennessee, http://www.netlib.org/lapack/lawnspdflawn259.pdf , 2011.
[13] J.J. Dongarra, P. Luszczek, and A. Petitet, "The LINPACK Benchmark: Past, Present and Future," Concurrency Computation: Practice Experience, vol. 15, no. 9, pp. 803-820, 2003, DOI: 10.1002cpe.728.
[14] E. Elmroth and F.G. Gustavson, "Applying Recursion to Serial and Parallel QR Factorization Leads to Better Performance," IBM J. Research and Development, vol. 44, no. 4, pp. 605-624, 2000, DOI: 10.1147rd.444.0605.
[15] N. Galoppo, N.K. Govindaraju, M. Henson, and D. Manocha, "LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware," Proc. ACM/IEEE Conf. Supercomputing, Nov. 2005, DOI: 10.1109SC.2005.42.
[16] F.G. Gustavson, "Recursion Leads to Automatic Variable Blocking for Dense Linear-Algebra Algorithms," IBM J. Research Development, vol. 41, no. 6, pp. 737-756, 1997, DOI: 10.1147rd.416.0737.
[17] F.G. Gustavson, L. Karlsson, and B. Kågström, "Parallel and Cache-Efficient in-Place Matrix Storage Format Conversion." ACM Trans. Math. Software, vol. 38, no. 3,article no. 17, 2012, DOI: 10.11452168773.2168775.
[18] A. Haidar, H. Ltaief, A. YarKhan, and J.J. Dongarra, "Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures," Concurrency Computation: Practice Experience, vol. 24, pp. 305-321, 2011, DOI: 10.1002cpe.1829.
[19] GPU Computing Gems Jade Edition, Applications of GPU Computing Series, W.W. Hwu, ed. Morgan Kaufmann, 2011.
[20] M. Kistler, J. Gunnels, D. Brokenshire, and B. Benton, "Programming the Linpack Benchmark for the IBM PowerXCell 8i Processor," Scientific Programming, vol. 17, nos. 1/2, pp. 43-57, 2009, DOI: 10.3233SPR-2009-0278.
[21] Scientific Computing with Multicore and Accelerators, J. Kurzak, D.A. Bader, J. Dongarra, eds. Chapman & Hall, 2010.
[22] J. Kurzak, A. Buttari, and J.J. Dongarra, "Solving Systems of Linear Equation on the CELL Processor Using Cholesky Factorization," Trans. Parallel Distributed System, vol. 19, no. 9, pp. 1175-1186, 2008, DOI: TPDS.2007.70813.
[23] J. Kurzak and J.J. Dongarra, "Implementation of Mixed Precision in Solving Systems of Linear Equations on the CELL Processor," Concurrency Computation: Practice Experience, vol. 19, no. 10, pp. 1371-1385, 2007, DOI: 10.1002cpe.1164.
[24] J. Kurzak and J.J. Dongarra, "QR Factorization for the Cell Broadband Engine," Scientific Programming, vol. 17, nos. 1/2, pp. 31-42, 2009, DOI: 10.3233SPR-2009-0268.
[25] J. Kurzak, H. Ltaief, J.J. Dongarra, and R.M. Badia, "Scheduling Dense Linear Algebra Operations on Multicore Processors," Concurrency Computation: Practice Experience, vol. 21, no. 1, pp. 15-44, 2009, DOI: 10.1002cpe.1467.
[26] J. Kurzak, R. Nath, P. Du, and J.J. Dongarra, "An Implementation of the Tile QR Factorization for a GPU and Multiple CPUs," Proc. State of the Art in Scientific and Parallel Computing Conf., pp. 248-257, June 2010, DOI: 10.1007978-3-642-28145-7.
[27] J. Kurzak, S. Tomov, and J. Dongarra, "LAPACK Working Note 245: Autotuning GEMMs for Fermi," Technical Report UT-CS-11-671, Electrical Eng. and Computer Science Dept., Univ. of Tennessee, www.netlib.org/lapack/lawnspdflawn245.pdf , 2011.
[28] Y. Li, J. Dongarra, and S. Tomov, "A Note on Auto-Tuning GEMM for GPUs," Proc. Int'l Conf. Computational Science, pp. 884-892, May 2009, DOI: 10.1007978-3-64 2-01970-8_89.
[29] N. Nakasato, "A Fast GEMM Implementation on a Cypress GPU," Proc. First Int'l Workshop Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, http://www.dcs.warwick.ac.uk/sdh/pmbs10/ pmbs10/Workshop_ Programme_file sfastgemm.pdf , Nov. 2010.
[30] R. Nath, S. Tomov, and J. Dongarra, "Accelerating GPU Kernels for Dense Linear Algebra," Proc. Int'l Meeting High Performance Computing for Computational Science, pp. 83-92, June 2010, DOI: 10.1007978-3-642-19328-6_10.
[31] R. Nath, S. Tomov, and J. Dongarra, "An Improved MAGMA GEMM for Fermi Graphics Processing Units," Int'l J. High Performance Computing Applications, vol. 24, no. 4, pp. 511-515, 2010, DOI: 10.11771094342010385 729.
[32] J.D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A.E. Lefohn, and T.J. Purcell, "A Survey of General-Purpose Computation on Graphics Hardware," Computer Graphics Forum, vol. 26, no. 1, pp. 80-113, 2007, DOI: 10.1111j.146 7-8659.2007.01012.x.
[33] G. Rudy, M.M. Khan, M. Hall, C. Chen, and J. Chame, "A Programming Language Interface to Describe Transformations and Code Generation," Proc. 23rd Int'l Workshop Languages and Compilers for Parallel Computing, pp. 136-150, Oct. 2010, DOI: 10.1007978-3-642-19595-2_10.
[34] G. Tan, L. Li, S. Triechle, E. Phillips, Y. Bao, and N. Sun, "Fast Implementation of DGEMM on Fermi GPU," Proc. IEEE/ACM Supercomputing Conf., Nov. 2011, DOI: 10.11452063384.2063431.
[35] S. Tomov, J. Dongarra, and M. Baboulin, "Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems," Parallel Computing, vol. 36, nos. 5/6, pp. 232-240, 2010, DOI: 10.1016j.parco.20 09.12.005.
[36] S. Tomov, R. Nath, H. Ltaief, and J. Dongarra, "Dense Linear Algebra Solvers for Multicore with GPU Accelerators." Proc. IEEE Int'l Parallel and Distributed Processing Symp., pp. 1-8, Apr. 2010, DOI: 10.1109IPDPSW.201 0.5470941.
[37] L.G. Valiant, "A Bridging Model for Parallel Computation." Comm. ACM, vol. 33, no. 8, pp. 103-111, 1990, DOI: 10.114579173.79181.
[38] V. Volkov and J.W. Demmel, "Benchmarking GPUs to Tune Dense Linear Algebra," Proc. ACM/IEEE Conf. Supercomputing, Nov. 2008, DOI: 10.11451413370.1413402.
[39] R.C. Whaley, A. Petitet, and J. Dongarra, "Automated Empirical Optimizations of Software and the ATLAS Project," Parallel Computing, vol. 27, nos. 1-20, pp. 3-35, 2001, DOI: 10.1016S0167-81 91(00)00087-9.
[40] A. YarKhan, J. Kurzak, and J. Dongarra, "QUARK Users' Guide: Queueing and Runtime for Kernels," Technical Report ICL-UT-11-02, Innovative Computing Laboratory, Univ. of Tennessee, http://icl.cs.utk.edu/projectsfiles/plasma/ pubs56-quark_users_guide. pdf, Apr. 2011.
45 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool