This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
HARP: A Practical Projected Clustering Algorithm
November 2004 (vol. 16 no. 11)
pp. 1387-1397
David W. Cheung, IEEE Computer Society
In high-dimensional data, clusters can exist in subspaces that hide themselves from traditional clustering methods. A number of algorithms have been proposed to identify such projected clusters, but most of them rely on some user parameters to guide the clustering process. The clustering accuracy can be seriously degraded if incorrect values are used. Unfortunately, in real situations, it is rarely possible for users to supply the parameter values accurately, which causes practical difficulties in applying these algorithms to real data. In this paper, we analyze the major challenges of projected clustering and suggest why these algorithms need to depend heavily on user parameters. Based on the analysis, we propose a new algorithm that exploits the clustering status to adjust the internal thresholds dynamically without the assistance of user parameters. According to the results of extensive experiments on real and synthetic data, the new method has excellent accuracy and usability. It outperformed the other algorithms even when correct parameter values were artificially supplied to them. The encouraging results suggest that projected clustering can be a practical tool for various kinds of real applications.

[1] J. Hartigan, Clustering Algorithms. Wiley, 1975.
[2] C.M. Procopiuc, M. Jones, P.K. Agarwal, and T.M. Murali, A Monte Carlo Algorithm for Fast Projective Clustering Proc. ACM SIGMOD Int'l Conf. Management of Data, 2002.
[3] H. Wang, W. Wang, J. Yang, and P.S. Yu, Clustering by Pattern Similarity in Large Data Sets Proc. ACM SIGMOD Int'l Conf. Management of Data, 2002.
[4] C.C. Aggarwal, C. Procopiuc, J.L. Wolf, P.S. Yu, and J.S. Park, Fast Algorithms for Projected Clustering Proc. ACM SIGMOD Int'l Conf. Management of Data, 1999.
[5] J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.
[6] C.C. Aggarwal and P.S. Yu, Finding Generalized Projected Clusters in High Dimensional Spaces Proc. ACM SIGMOD Int'l Conf. Management of Data, 2000.
[7] K.Y.L. Yip, HARP: A Practical Projected Clustering Algorithm for Mining Gene Expression Data Master's thesis, The Univ. of Hong Kong, Hong Kong, 2004.
[8] R.T. Ng and J. Han, Efficient and Effective Clustering Methods for Spatial Data Mining Proc. 20th Int'l Conf. Very Large Data Bases, Sept. 1994.
[9] J. Pei, X. Zhang, M. Cho, H. Wang, and P.S. Yu, MaPle: A Fast Algorithm for Maximal Pattern-Based Clustering Proc. IEEE Int'l Conf. Data Mining, 2003.
[10] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications Proc. ACM SIGMOD Int'l Conf. Management of Data, 1998.
[11] Y. Cheng and G.M. Church, Biclustering of Expression Data Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology, 2000.
[12] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. Inter-Science, 1990.
[13] S. Guha, R. Rastogi, and K. Shim, CURE: An Efficient Clustering Algorithm for Large Databases Proc. ACM SIGMOD Int'l Conf. Management of Data, 1998.
[14] ROCK: A Robust Clustering Algorithm for Categorical Attributes Proc. 15th Int'l Conf. Data Eng., 1999.
[15] P. Bickel and K. Doksum, Mathematical Statistics, Basic Ideas and Selected Topics. Oakland, 1977.
[16] D. Eppstein, Fast Hierarchical Clustering and Other Applications of Dynamic Closest Pairs Proc. SODA: ACM-SIAM Symp. Discrete Algorithms, 1998.
[17] A.A. Alizadeh, M.B. Eisen, R.E. Davis, C. Ma, I.S. Lossos, A. Rosenwald, J.C. Boldrick, H. Sabet, T. Tran, X. Yu, J.I. Powell, L. Yang, G.E. Marti, T. Moore, J. Hudson, L. Lu, D.B. Lewis, R. Tibshirani, G. Sherlock, W.C. Chan, T.C. Greiner, D.D. Weisenburger, J.O. Armitage, R. Warnke, R. Levy, W. Wilson, M.R. Grever, J.C. Byrd, D. Botstein, P.O. Brown, and L.M. Staudt, Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling Nature, vol. 403, pp. 503-511, 2000.
[18] L. Lazzeroni and A. Owen, Plaid Models for Gene Expression Data Statistica Sinica, vol. 12, pp. 61-86, 2002.
[19] Z. Huang, Clustering Large Data Sets with Mixed Numeric and Categorical Values Proc. First Pacific-Asia Conf. Knowledge Discovery and Data Mining, 1997.
[20] A. Ben-Dor and Z. Yakhini, Clustering Gene Expression Patterns Proc. Ann. Int'l Conf. Computational Molecular Biology, 1999.
[21] K. Yeung and W. Ruzzo, An Empirical Study on Principal Component Analysis for Clustering Gene Expression Data Bioinformatics, vol. 17, no. 9, pp. 763-774, 2001.
[22] W.M. Rand, Objective Criteria for the Evaluation of Clustering Methods J. Am. Statistical Assoc., vol. 66, pp. 846-850, 1971.

Index Terms:
Data mining, mining methods and algorithms, clustering, bioinformatics.
Citation:
Kevin Y. Yip, David W. Cheung, Michael K. Ng, "HARP: A Practical Projected Clustering Algorithm," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 11, pp. 1387-1397, Nov. 2004, doi:10.1109/TKDE.2004.74
Usage of this product signifies your acceptance of the Terms of Use.