This Article 
 Bibliographic References 
 Add to: 
Mining Projected Clusters in High-Dimensional Spaces
April 2009 (vol. 21 no. 4)
pp. 507-522
Mohamed Bouguessa, University of Sherbrooke, Sherbrooke
Shengrui Wang, University of Sherbrooke, Sherbrooke
Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. To address this problem, a number of projected clustering algorithms have been proposed. However, most of them encounter difficulties when clusters hide in subspaces with very low dimensionality. These challenges motivate our effort to propose a robust partitional distance-based projected clustering algorithm. The algorithm consists of three phases. The first phase performs attribute relevance analysis by detecting dense and sparse regions and their location in each attribute. Starting from the results of the first phase, the goal of the second phase is to eliminate outliers, while the third phase aims to discover clusters in different subspaces. The clustering process is based on the K-means algorithm, with the computation of distance restricted to subsets of attributes where object values are dense. Our algorithm is capable of detecting projected clusters of low dimensionality embedded in a high-dimensional space and avoids the computation of the distance in the full-dimensional space. The suitability of our proposal has been demonstrated through an empirical study using synthetic and real datasets.

[1] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic Subspace Clustering of High Dimensional Data,” Data Mining and Knowledge Discovery, vol. 11, no. 1, pp. 5-33, 2005.
[2] A.K. Jain, M.N. Mutry, and P.J. Flynn, “Data Clustering: A Review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.
[3] K. Beyer, J. Goldstein, R. Ramakrishan, and U. Shaft, “When is Nearest Neighbor Meaningful,” Proc. Seventh Int'l Conf. Database Theory (ICDT '99), pp. 217-235, 1999.
[4] H. Liu and L. Yu, “Toward Integrating Feature Selection Algorithms for Classification and Clustering,” IEEE Trans. Knowledge and Data Eng., vol. 17, no. 4, pp. 491-502, Apr. 2005.
[5] C.C. Aggarwal, C. Procopiuc, J.L. Wolf, P.S. Yu, and J.S. Park, “FastAlgorithm for Projected Clustering,” Proc. ACM SIGMOD '99, pp.61-72, 1999.
[6] K.Y.L. Yip, D.W. Cheung, M.K. Ng, and K. Cheung, “Identifying Projected Clusters from Gene Expression Profiles,” J. Biomedical Informatics, vol. 37, no. 5, pp. 345-357, 2004.
[7] K.Y.L. Yip, D.W. Cheng, and M.K. Ng, “On Discovery of Extremely Low-Dimensional Clusters Using Semi-Supervised Projected Clustering,” Proc. 21st Int'l Conf. Data Eng. (ICDE '05), pp. 329-340, 2005.
[8] C.C. Aggarwal and P.S. Yu, “Redefining Clustering for High Dimensional Applications,” IEEE Trans. Knowledge and Data Eng., vol. 14, no. 2, pp. 210-225, Mar./Apr. 2002.
[9] K.Y.L. Yip, D.W. Cheng, and M.K. Ng, “HARP: A Practical Projected Clustering Algorithm,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 11, pp. 1387-1397, Nov. 2004.
[10] C.M. Procopiuc, M. Jones, P.K. Agarwal, and T.M. Murali, “Monte Carlo Algorithm for Fast Projective Clustering,” Proc. ACM SIGMOD '02, pp. 418-427, 2002.
[11] M. Lung and N. Mamoulis, “Iterative Projected Clustering by Subspace Mining,” IEEE Trans. Knowledge and Data Eng., vol. 17, no. 2, pp. 176-189, Feb. 2005.
[12] E.K.K. Ng, A.W. Fu, and R.C. Wong, “Projective Clustering by Histograms,” IEEE Trans. Knowledge and Data Eng., vol. 17, no. 3, pp. 369-383, Mar. 2005.
[13] M. Bouguessa, S. Wang, and Q. Jiang, “A K-Means-Based Algorithm for Projective Clustering,” Proc. 18th IEEE Int'l Conf. Pattern Recognition (ICPR '06), pp. 888-891, 2006.
[14] C.H. Cheng, A.W. Fu, and Y. Zhang, “Entropy-Based Subspace Clustering for Mining Numerical Data,” Proc. ACM SIGMOD '99, pp. 84-93, 1999.
[15] S. Goil, H. Nagesh, and A. Choudhary, “MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets,” Technical Report CPDC-TR-9906-010, Northwestern Univ., 1999.
[16] K. Kailing, H.-P. Kriegel, and P. Kroger, “Density-Connected Subspace Clustering for High-Dimensional Data,” Proc. Fourth SIAM Int'l Conf. Data Mining (SDM '04), pp. 246-257, 2004.
[17] L. Parsons, E. Haque, and H. Liu, “Subspace Clustering for High Dimensional Data: A Review,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 90-105, 2004.
[18] K.Y.L. Yip, “HARP: A Practical Projected Clustering Algorithm for Mining Gene Expression Data,” master's thesis, The Univ. of Hong Kong, 2004.
[19] K. Bury, Statistical Distributions in Engineering. Cambridge Univ. Press, 1998.
[20] N. Balakrishnan and V.B. Nevzorov, A Primer on Statistical Distributions. John Wiley & Sons, 2003.
[21] R.V. Hogg, J.W. McKean, and A.T. Craig, Introduction to Mathematical Statistics, sixth ed. Pearson Prentice Hall, 2005.
[22] J.F. Lawless, Statistical Models and Methods for Lifetime Data. John Wiley & Sons, 1982.
[23] M. Bouguessa, S. Wang, and H. Sun, “An Objective Approach to Cluster Validation,” Pattern Recognition Letters, vol. 27, no. 13, pp.1419-1430, 2006.
[24] J.J. Oliver, R.A. Baxter, and C.S. Wallace, “Unsupervised Learning Using MML,” Proc. 13th Int'l Conf. Machine Learning (ICML '96), pp. 364-372, 1996.
[25] G. Schwarz, “Estimating the Dimension of a Model,” Annals of Statistics, vol. 6, no. 2, pp. 461-464, 1978.
[26] A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc. (Series B), vol. 39, pp. 1-37, 1977.
[27] M.A.T. Figueiredo and A.K. Jain, “Unsupervised Learning of Finite Mixture Models,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 3, pp. 381-396, Mar. 2002.
[28] G. McLachlan and T. Krishnan, The EM Algorithm and Extensions. John Wiley & Sons, 1997.
[29] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum, 1981.
[30] J. Rissanen, Stochastic Complexity in Statistical Inquiry. World Scientific, 1989.
[31] S. Breunig, H.-P. Kriegel, R. Ng, and J. Sander, “LOF: Identifying Density-Based Local Outliers,” Proc. ACM SIGMOD '00, pp. 93-104, 2000.
[32] J. Han and M. Kamber, Data Mining, Concepts and Techniques. Morgan Kaufman, 2001.
[33] F. Angiulli and C. Pizzuti, “Outlier Mining in Large High-Dimensional Data Sets,” IEEE Trans. Knowledge and Data Eng., vol. 17, no. 2, pp. 369-383, Feb. 2005.
[34] E.M. Knorr, R.T. Ng, and V. Tucakov, “Distance-Based Outliers: Algorithms and Applications,” The VLDB J., vol. 8, nos. 3/4, pp.237-253, 2000.
[35] T. Li, “A Unified View on Clustering Binary Data,” Machine Learning, vol. 62, no. 3, pp. 199-215, 2006.
[36] A. Patrikainen and M. Meila, “Comparing Subspace Clusterings,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 7, pp. 902-916, July 2006.
[37] C.H. Papadimitrio and K. Steiglizt, Combinatorial Optimization, Algorithms and Complexity. Prentice-Hall, 1982.
[38] B. Tjaden, “An Approach for Clustering Gene Expression Data with Error Information,” BMC Bioinformatics, vol. 7, no. 17, 2006.
[39] K.A.J. Doherty, R.G. Adams, and N. Davey, “Unsupervised Learning with Normalised Data and Non-Euclidean Norms,” Applied Soft Computing, vol. 7, no. 17, pp. 203-210, 2007.
[40] T.M. Martinetz, S.G. Berkovich, and K.J. Schulten, “Neural Gas Network for Vector Quantization and Its Application to Timeseries Prediction,” IEEE Trans. Neural Networks, vol. 4, no. 4, pp.558-569, 1993.
[41] B. Fritzke, “A Growing Neural Gas Network Learns Topologies,” Advances in Neural Information Processing Systems, G. Tesauro, D.S.Touretzky, and T. K. Leen, eds., vol. 7, pp. 625-632, 1995.
[42] T. Kohonen, Self-Organizing Maps. Springer, 1997.
[43] P. Mitra, C.A. Murthy, and S.K. Pal, “Unsupervised Feature Selection Using Feature Similarity,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 3, pp. 301-312, Mar. 2002.
[44] A.K. Jain, R.P.W. Duin, and J. Mao, “Statistical Pattern Recognition: A Review,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4-37, Jan. 2000.

Index Terms:
data mining, Clustering, Mining methods and algorithms
Mohamed Bouguessa, Shengrui Wang, "Mining Projected Clusters in High-Dimensional Spaces," IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 4, pp. 507-522, April 2009, doi:10.1109/TKDE.2008.162
Usage of this product signifies your acceptance of the Terms of Use.