Subscribe

Issue No.07 - July (2012 vol.24)

pp: 1291-1305

Lifei Chen , Fujian Normal University, Fuzhou

Shengrui Wang , University of Sherbooke, Quebec

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.256

ABSTRACT

Clustering high-dimensional data is a major challenge due to the curse of dimensionality. To solve this problem, projective clustering has been defined as an extension to traditional clustering that attempts to find projected clusters in subsets of the dimensions of a data space. In this paper, a probability model is first proposed to describe projected clusters in high-dimensional data space. Then, we present a model-based algorithm for fuzzy projective clustering that discovers clusters with overlapping boundaries in various projected subspaces. The suitability of the proposal is demonstrated in an empirical study done with synthetic data set and some widely used real-world data set.

INDEX TERMS

Clustering, high dimensions, projective clustering, probability model.

CITATION

Lifei Chen, Shengrui Wang, "Model-Based Method for Projective Clustering",

*IEEE Transactions on Knowledge & Data Engineering*, vol.24, no. 7, pp. 1291-1305, July 2012, doi:10.1109/TKDE.2010.256REFERENCES

- [1] A.K. Jain, M.N. Murty, and P.J. Flynn, "Data Clustering: A Review,"
ACM Computing Survey, vol. 31, no. 3, pp. 264-323, 1999.- [2] S.B. Kotsiantis and P.E. Pintelas, "Recent Advances in Clustering: A Brief Survey,"
WSEAS Trans. Information Science and Applications, vol. 11, no. 1, pp. 73-81, 2004.- [3] T. Hastie, R. Tibshirani, and J. Friedman,
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, second ed. Springer-Verlag, 2001.- [4] S. Dasgupta, "Learning Mixtures of Gaussians,"
Proc. Ann. Symp. Foundations of Computer Science, pp. 634-644, 1999.- [5] S. Wang and H. Sun, "Measuring Overlap-Rate for Cluster Merging in a Hierarchical Approach to Color Image Segmentation,"
Int'l J. Fuzzy Systems, vol. 6, no. 3, pp. 147-156, 2004.- [6] M. Steinbach, L. Ertoz, and V. Kumar, "The Challenges of Clustering High Dimensional Data," http://www-users.cs.umn. edu/ertoz/papers clustering_chapter.pdf, 2003.
- [7] M. Verleysen, "Learning High-Dimensional Data,"
Limitations and Future Trends in Neural Computation, pp. 141-162, IOS Press, 2003.- [8] A. Hinneburg, C.C. Aggarwal, and D.A. Keim, "What Is the Nearest Neighbor in High Dimensional Spaces,"
Proc. Int'l Conf. Very Large Databases (VLDB), pp. 506-515, 2000.- [9] C.C. Aggarwal, C. Procopiuc, J.L. Wolf, P.S. Yu, and J.S. Park, "Fast Algorithm for Projected Clustering,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 61-71, 1999.- [10] L. Parsons, E. Haque, and H. Liu, "Subspace Clustering for High Dimensional Data: A Review,"
ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 90-105, 2004.- [11] K.Y. Yip, D.W. Cheung, and M.K. Ng, "A Review on Projected Clustering Algorithms,"
Int'l J. Applied Math., vol. 13, pp. 24-35, 2003.- [12] L. Jing, M.K. Ng, and J.Z. Huang, "An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data,"
IEEE Trans. Knowledge and Data Eng., vol. 19, no. 8, pp. 1026-1041, Aug. 2007.- [13] C.C. Aggarwal and P.S. Yu, "Refining Clustering for High Dimensional Applications,"
IEEE Trans. Knowledge and Data Eng., vol. 14, no. 2, pp. 210-225, Mar./Apr. 2002.- [14] K.G. Woo, J.H. Lee, M.H. Kim, and Y.J.K. Lee, "FINDIT: A Fast and Intelligent Subspace Clustering Algorithm Using Dimension Voting,"
Information and Software Technology, vol. 46, no. 4, pp. 255-271, 2004.- [15] J.Z. Huang, M.K. Ng, H. Rong, and Z. Li, "Automated Variable Weighting in k-Means Type Clustering,"
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 5, pp. 657-668, May 2005.- [16] C. Domeniconi et al., "Locally Adaptive Metrics for Clustering High Dimensional Data,"
Data Mining and Knowledge Discovery, vol. 14, pp. 63-97, 2007.- [17] L. Xu and M.I. Jordan, "On Convergence Properties of the EM Algorithm for Gaussian Mixtures,"
Neural Computation, vol. 8, pp. 129-151, 1996.- [18] K. Chakrabarti and S. Mehrotra, "Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces,"
Proc. Int'l Conf. Very Large Databases (VLDB), pp. 89-100, 2000.- [19] P.D. Hoff, "Model-Based Subspace Clustering,"
Bayesian Analysis, vol. 1, no. 2, pp. 321-344, 2006.- [20] R. Harpaz and R. Haralick, "Linear Manifold Clustering in High Dimensional Spaces by Stochastic Search,"
Pattern Recognition Letters, vol. 40, pp. 2672-2684, 2007.- [21] S.C. Madeira and A.L. Oliveira, "Biclustering Algorithms for Biological Data Analysis: A Survey,"
IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 1, no. 1, pp. 24-45, Jan. 2004.- [22] Y. Lu, S. Wang, S. Li, and C. Zhou, "Particle Swarm Optimizer for Variable Weighting in Clustering High-Dimensional Data,"
Proc. IEEE Swarm Intelligence Symp., pp. 37-44, 2009.- [23] G. Moise, J. Sander, and M. Ester, "Robust Projected Clustering,"
Knowledge Information System, vol. 14, no. 3, pp. 273-298, 2008.- [24] M. Patrikainen and M. Meila, "Comparing Subspace Clusterings,"
IEEE Trans. Knowledge and Data Eng., vol. 18, no. 7, pp. 902-916, July 2006.- [25] G. Gao, J. Wu, and Z. Yang, "A Fuzzy Subspace Clustering Algorithm for Clustering High Dimensional Data,"
Proc. Int'l Conf. Advanced Data Mining and Applications (ADMA), pp. 271-278, 2006.- [26] K.Y.L. Yip, D.W. Cheng, and M.K. Ng, "HARP: A Practical Projected Clustering Algorithm,"
IEEE Trans. Knowledge and Data Eng., vol. 16, no. 11, pp. 1387-1397, Nov. 2004.- [27] C.M. Procopiuc, M. Jones, P.K. Agarwal, and T.M. Murali, "A Monte Carlo Algorithm for Fast Projective Clustering,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 418-427, 2002.- [28] M. Yiu and N. Mamoulis, "Iterative Projected Clustering by Subspace Mining,"
IEEE Trans. Knowledge and Data Eng., vol. 17, no. 2, pp. 176-189, Feb. 2005.- [29] R.K. Agarwal and N.H. Mustafa, "K-Means Projective Clustering,"
Proc. ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS), pp. 155-165, 2004.- [30] E.K.K. Ng, A.W. Fu, and R.C. Wong, "Projective Clustering by Histograms,"
IEEE Trans. Knowledge and Data Eng., vol. 17, no. 3, pp. 369-382, Mar. 2005.- [31] Q. Wang, Y. Ye, and J.Z. Huang, "Fuzzy k-Means with Variable Weighting in High Dimensional Data Analysis,"
Proc. Ninth Int'l Conf. Web-Age Information Management (WAIM), pp. 365-372, 2008.- [32] H. Sun, S. Wang, and Q. Jiang, "FCM-Based Model Selection Algorithms for Determining the Number of Clusters,"
Pattern Recognition, vol. 37, no. 10, pp. 2027-2037, 2004.- [33] L. Chen, Q. Jiang, and S. Wang, "A Probability Model for Projective Clustering on High Dimensional Data,"
Proc. IEEE Int'l Conf. Data Mining (ICDM), pp. 755-760, 2008.- [34] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, "Automatic Subspace Clustering of High Dimensional Data,"
Data Mining and Knowledge Discovery, vol. 11, no. 1, pp. 5-33, 2005.- [35] C.H. Cheng, A.W. Fu, and Y. Zhang, "Entropy-Based Subspace Clustering for Mining Numerical Data,"
Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 84-93, 1999.- [36] S. Goil, H. Nagesh, and A. Choudhary, "Mafia: Efficient and Scalable Subspace Clustering for Very Large Data Sets," Technical Report—TR-9906-010, Northwestern Univ., 1999.
- [37] C. Bohm, K. Kailing, H.P. Kriegel, and P. Kroger, "Density Connected Clustering with Local Subspace Preferences,"
Proc. IEEE Int'l Conf. Data Mining (ICDM), pp. 27-34, 2004.- [38] L. Jing, M.K. Ng, J. Xu, and J.Z. Huang, "A Text Clustering System Based on k-Means Type Subspace Clustering,"
Int'l J. Intelligent Technology, vol. 1, no. 2, pp. 91-103, 2006.- [39] M. Bouguessa, S. Wang, and H. Sun, "An Objective Approach to Cluster Validation,"
Pattern Recognition Letters, vol. 27, pp. 1419-1430, 2006.- [40] D. Lowd and P. Domingos, "Naive Bayes Models for Probability Estimation,"
Proc. Int'l Conf. Machine Learning (ICML), pp. 529-536, 2005.- [41] Y.M. Cheung, "K$^{\ast}$ -Means: A New Generalized k-Means Clustering Algorithm,"
Pattern Recognition Letters, vol. 24, pp. 2883-2893, 2003.- [42] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery,
Numerical Recipes in C++: The Art of Scientific Computing, second ed. Cambridge Univ. Press, 2002.- [43] N.R. Pal and J.C. Bezdek, "On Cluster Validity for the Fuzzy C-Means Model,"
IEEE Trans. Fuzzy Systems, vol. 3, no. 3, pp. 370-379, Aug. 1995.- [44] M. Bouguessa and S. Wang, "Mining Projected Clusters in High Dimensional Spaces,"
IEEE Trans. Knowledge and Data Eng., vol. 21, no. 4, pp. 507-522, Apr. 2009.- [45] X. Yin, J. Han, and P.S. Yu, "LinkClus: Efficient Clustering via Heterogeneous Semantic Links,"
Proc. Int'l Conf. Very Large Databases (VLDB), pp. 427-438, 2006.- [46] I. Androutsopoulos et al., "An Evaluation of Naive Bayesian Anti-Spam Filtering,"
Proc. Workshop Machine Learning in the New Information Age, pp. 9-17, 2000.- [47] V. Metsis, I. Androutsopoulos, and G. Paliouras, "Spam Filtering with Naive Bayes—Which Naive Bayes?,"
Proc. Third Conf. Email and Anti-Spam (CEAS), pp. 1-5, 2006. |