Subscribe

Issue No.04 - April (2009 vol.21)

pp: 507-522

Mohamed Bouguessa , University of Sherbrooke, Sherbrooke

Shengrui Wang , University of Sherbrooke, Sherbrooke

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2008.162

ABSTRACT

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. To address this problem, a number of projected clustering algorithms have been proposed. However, most of them encounter difficulties when clusters hide in subspaces with very low dimensionality. These challenges motivate our effort to propose a robust partitional distance-based projected clustering algorithm. The algorithm consists of three phases. The first phase performs attribute relevance analysis by detecting dense and sparse regions and their location in each attribute. Starting from the results of the first phase, the goal of the second phase is to eliminate outliers, while the third phase aims to discover clusters in different subspaces. The clustering process is based on the K-means algorithm, with the computation of distance restricted to subsets of attributes where object values are dense. Our algorithm is capable of detecting projected clusters of low dimensionality embedded in a high-dimensional space and avoids the computation of the distance in the full-dimensional space. The suitability of our proposal has been demonstrated through an empirical study using synthetic and real datasets.

INDEX TERMS

data mining, Clustering, Mining methods and algorithms

CITATION

Mohamed Bouguessa, Shengrui Wang, "Mining Projected Clusters in High-Dimensional Spaces",

*IEEE Transactions on Knowledge & Data Engineering*, vol.21, no. 4, pp. 507-522, April 2009, doi:10.1109/TKDE.2008.162REFERENCES

- [3] K. Beyer, J. Goldstein, R. Ramakrishan, and U. Shaft, “When is Nearest Neighbor Meaningful,”
Proc. Seventh Int'l Conf. Database Theory (ICDT '99), pp. 217-235, 1999.- [10] C.M. Procopiuc, M. Jones, P.K. Agarwal, and T.M. Murali, “Monte Carlo Algorithm for Fast Projective Clustering,”
Proc. ACM SIGMOD '02, pp. 418-427, 2002.- [13] M. Bouguessa, S. Wang, and Q. Jiang, “A K-Means-Based Algorithm for Projective Clustering,”
Proc. 18th IEEE Int'l Conf. Pattern Recognition (ICPR '06), pp. 888-891, 2006.- [14] C.H. Cheng, A.W. Fu, and Y. Zhang, “Entropy-Based Subspace Clustering for Mining Numerical Data,”
Proc. ACM SIGMOD '99, pp. 84-93, 1999.- [15] S. Goil, H. Nagesh, and A. Choudhary, “MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets,” Technical Report CPDC-TR-9906-010, Northwestern Univ., 1999.
- [16] K. Kailing, H.-P. Kriegel, and P. Kroger, “Density-Connected Subspace Clustering for High-Dimensional Data,”
Proc. Fourth SIAM Int'l Conf. Data Mining (SDM '04), pp. 246-257, 2004.- [17] L. Parsons, E. Haque, and H. Liu, “Subspace Clustering for High Dimensional Data: A Review,”
ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 90-105, 2004.- [18] K.Y.L. Yip, “HARP: A Practical Projected Clustering Algorithm for Mining Gene Expression Data,” master's thesis, The Univ. of Hong Kong, 2004.
- [19] K. Bury,
Statistical Distributions in Engineering. Cambridge Univ. Press, 1998.- [20] N. Balakrishnan and V.B. Nevzorov,
A Primer on Statistical Distributions. John Wiley & Sons, 2003.- [21] R.V. Hogg, J.W. McKean, and A.T. Craig,
Introduction to Mathematical Statistics, sixth ed. Pearson Prentice Hall, 2005.- [22] J.F. Lawless,
Statistical Models and Methods for Lifetime Data. John Wiley & Sons, 1982.- [24] J.J. Oliver, R.A. Baxter, and C.S. Wallace, “Unsupervised Learning Using MML,”
Proc. 13th Int'l Conf. Machine Learning (ICML '96), pp. 364-372, 1996.- [26] A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,”
J. Royal Statistical Soc. (Series B), vol. 39, pp. 1-37, 1977.- [28] G. McLachlan and T. Krishnan,
The EM Algorithm and Extensions. John Wiley & Sons, 1997.- [29] J.C. Bezdek,
Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum, 1981.- [30] J. Rissanen,
Stochastic Complexity in Statistical Inquiry. World Scientific, 1989.- [31] S. Breunig, H.-P. Kriegel, R. Ng, and J. Sander, “LOF: Identifying Density-Based Local Outliers,”
Proc. ACM SIGMOD '00, pp. 93-104, 2000.- [32] J. Han and M. Kamber,
Data Mining, Concepts and Techniques. Morgan Kaufman, 2001.- [33] F. Angiulli and C. Pizzuti, “Outlier Mining in Large High-Dimensional Data Sets,”
IEEE Trans. Knowledge and Data Eng., vol. 17, no. 2, pp. 369-383, Feb. 2005.- [34] E.M. Knorr, R.T. Ng, and V. Tucakov, “Distance-Based Outliers: Algorithms and Applications,”
The VLDB J., vol. 8, nos. 3/4, pp.237-253, 2000.- [37] C.H. Papadimitrio and K. Steiglizt,
Combinatorial Optimization, Algorithms and Complexity. Prentice-Hall, 1982.- [38] B. Tjaden, “An Approach for Clustering Gene Expression Data with Error Information,”
BMC Bioinformatics, vol. 7, no. 17, 2006.- [41] B. Fritzke, “A Growing Neural Gas Network Learns Topologies,”
Advances in Neural Information Processing Systems, G. Tesauro, D.S.Touretzky, and T. K. Leen, eds., vol. 7, pp. 625-632, 1995.- [42] T. Kohonen,
Self-Organizing Maps. Springer, 1997. |