Issue No.08 - Aug. (2013 vol.24)
pp: 1602-1612
Kai J. Kohlhoff , Stanford University, Stanford
Vijay S. Pande , Stanford University, Stanford
Russ B. Altman , Stanford University, Stanford
We present an implementation of parallel $(K)$-means clustering, called $(K_{ps})$-means, that achieves high performance with near-full occupancy compute kernels without imposing limits on the number of dimensions and data points permitted as input, thus combining flexibility with high degrees of parallelism and efficiency. As a key element to performance improvement, we introduce parallel sorting as data preprocessing and updating steps. Our final implementation for Nvidia GPUs achieves speedups of up to 200-fold over CPU reference code and of up to three orders of magnitude when compared with popular numerical software packages.
Kernel, Instruction sets, Graphics processing unit, Memory management, Sorting, Arrays, Vectors, parallel algorithms, Kernel, Instruction sets, Graphics processing unit, Memory management, Sorting, Arrays, Vectors, biology and genetics, Clustering algorithms, graphics processors
Kai J. Kohlhoff, Vijay S. Pande, Russ B. Altman, "K-Means for Parallel Architectures Using All-Prefix-Sum Sorting and Updating Steps", IEEE Transactions on Parallel & Distributed Systems, vol.24, no. 8, pp. 1602-1612, Aug. 2013, doi:10.1109/TPDS.2012.234
[1] R. Farivar, D. Rebolledo, E. Chan, and R. Campbell, "A Parallel Implementation of K-Means Clustering on GPUs," Proc. Int'l Conf. Parallel and Distributed Processing Techniques and Applications, 2008.
[2] S. Che, J. Meng, J.W. Sheaffer, and K. Skadron, "A Performance Study of General Purpose Applications on Graphics Processors," Proc. First Workshop General Purpose Processing on Graphics Processing Units, 2007.
[3] R. Wu, B. Zhang, M. Hsu, and Clustering, "Billions of Data Points Using GPUs," Proc. Combined Workshops Unconventional High Performance Computing Workshop Plus Memory Access Workshop (UCHPC-MAW '09), 2009, doi: 10.1145/1531666.1531668.
[4] M. Zechner and M. Granitzer, "Accelerating K-Means on the Graphics Processor via CUDA," Proc. First Int'l Conf. Intensive Applications and Services (INTENSIVE '09), pp. 7-15 , 2009, doi: 10.1109/INTENSIVE.2009.19.
[5] H. Bai, L. He, D. Ouyang, Z. Li, and H. Li, "K-Means on Commodity GPUs with CUDA," Proc. WRI World Congress Computer Science and Information Eng., vol. 3, pp. 651-655, 2009, doi:10.1109/CSIE.2009.491.
[6] S.A.A. Shalom, M. Dash, and M. Tue, "Efficient K-Means Clustering Using Accelerated Graphics Processors," Proc. 10th Int'l Conf. Data Warehousing and Knowledge Discovery I. Song, J. Eder, and T. Nguyen, eds., pp. 166-175, 2008, doi: 10.1007/978-3-540-85836-2_16.
[7] T. Zhang, R. Ramakrishnan, and M. Livny, "BIRCH: An Efficient Data Clustering Method for Very Large Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 103-114, 1996, doi: 10.1145/235968.233324.
[8] R.T. Ng and J. Han, "CLARANS: A Method for Clustering Objects for Spatial Data Mining," IEEE Trans. Knowledge Data Eng., vol. 14, no. 5, pp. 1003-1016, Sept./Oct. 2002.
[9] M. Ester, H-P. Kriegel, J. Sander, and X. Xu, "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise," Proc. Second Int'l Conf. Knowledge Discovery Data Mining (KDD '96), pp. 226-231, 1996.
[10] J.M. Engreitz, B.J. DaigleJr., J.J. Marshall, and R.B. Altman, "Independent Component Analysis: Mining Microarray Data for Fundamental Human Gene Expression Modules," J. Biomedical Informatics, vol. 43, pp. 923-944, 2010, doi: 10.1016/j.jbi.2010.07.001.
[11] M. Harris, "Optimizing Parallel Reduction in CUDA" Nvidia CUDA SDK Whitepaper, cuda/1_1/Website/projects/reduction/ docreduction.pdf, 2007.
[12] W.D. Hillis and G.L. SteeleJr., "Data Parallel Algorithms," Comm. ACM, vol. 29, no. 12, pp. 1170-1183, Dec. 1986, doi:10.1145/7902.7903.
[13] M. Harris, S. Sengupta, and J.D. Owens, "Parallel Prefix Sum (Scan) with CUDA," GPU Gems 3, H. Nguyen, ed., Pearson Education, 2007.
[14] M.P. Liang, D.R. Banatao, T.E. Klein, D.L. Brutlag, and R.B. Altman, "WebFEATURE: An Interactive Web Tool for Identifying and Visualizing Functional Sites on Macromolecular Structures," Nucleic Acids Research, vol. 31, no. 13, pp. 3324-3327, 2003.
[15] S.P. Lloyd, "Least Squares Quantization in PCM," IEEE Trans. Information Theory, vol. IT-28, no. 2, pp. 128-137, Mar. 1982.
[16] SimTk, "The Simulation Toolkit, Part of the Simbios Project," https:/, 2010.
[17] K.J. Kohlhoff, M.H. Sosnick, W.T. Hsu, V.S. Pande, and R.B. Altman, "CAMPAIGN: an Open-Source Library of GPU-Accelerated Data Clustering Algorithms," Bioinformatics, vol. 27, no. 16, pp. 2322-2323, doi:10.1093/bioinformatics/btr386, 2011.