This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Projective Clustering by Histograms
March 2005 (vol. 17 no. 3)
pp. 369-383
Recent research suggests that clustering for high-dimensional data should involve searching for "hidden” subspaces with lower dimensionalities, in which patterns can be observed when data objects are projected onto the subspaces. Discovering such interattribute correlations and location of the corresponding clusters is known as the projective clustering problem. In this paper, we propose an efficient projective clustering technique by histogram construction (EPCH). The histograms help to generate "signatures,” where a signature corresponds to some region in some subspace, and signatures with a large number of data objects are identified as the regions for subspace clusters. Hence, projected clusters and their corresponding subspaces can be uncovered. Compared to the best previous methods to our knowledge, this approach is more flexible in that less prior knowledge on the data set is required, and it is also much more efficient. Our experiments compare behaviors and performances of this approach and other projective clustering algorithms with different data characteristics. The results show that our technique is scalable to very large databases, and it is able to return accurate clustering results.

[1] UCI Machine Learning Repository, http://www.ics.uci.edu/mlearnMLRepository.html , 2003.
[2] C. Aggarwal, “A Human-Computer Cooperative System for Effective High Dimensional Clustering,” Proc. ACM SIGKDD Conf., 2001.
[3] C.C. Aggarwal and P.S. Yu, “Redefining Clustering for High-Dimensional Applications,” IEEE Trans. Knowledge and Data Eng., vol. 14, no. 2, Mar./Apr. 2002.
[4] C.C. Aggarwal, C. Procopiuc, J.L. Wolf, P.S. Yu, and J.S. Park, “Fast Algorithms for Projected Clustering,” Proc. SIGMOD Conf., 1999.
[5] C.C. Aggarwal and P.S. Yu, “Finding Generalized Projected Clusters in High Dimensional Space,” Proc. SIGMOD Conf., 2000.
[6] R. Aggarwal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Application,” Proc. ACM SIGMOD Conf., 1998.
[7] R. Aggrawal, T. Imielinski, and A. Swami, “Mining Association Rules between Sets of Items in Large Databases,” Proc. ACM SIGMOD Conf., 1993.
[8] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. Very Large Data Bases Conf., 1994.
[9] M. Ankerst, M.M. Breunig, H.-P. Kriegel, and J. Sander, “Optics: Ordering Points to Identify the Clustering Structure,” Proc. 1999 Int'l Conf. Management of Data, 1999.
[10] C.-H. Cheng, A. Fu, and Y. Zhang, “Entropy-Based Subspace Clustering for Numerical Data,” Proc. ACM SIGKDD Conf., 1999.
[11] M. Datar, E. Cohen, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J.D. Ullman, and C. Yang, “Finding Interesting Associations without Support Pruning,” Proc. Int'l Conf. Data Eng., 2000.
[12] M. Ester, H-P. Kriegel, X. Xu, and J. Sander, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proc. ACM SIGKDD Conf., 1996.
[13] B. Everitt, Cluster Analysis, second ed. Halsted Heinemann, 1980.
[14] A. Hinneburg and D.A. Keim, “Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering,” Proc. Very Large Data Bases Conf., 1999.
[15] Y.E. Ioannidis and V. Poosala, “Balancing Histogram Optimality and Practicality for Query Result Size Estimation,” Proc. ACM SIGMOD Conf., 1990.
[16] A.K. Jain, M.N. Murty, and P.J. Flynn, “Data Clustering: A Review,” ACM Computing Surveys, vol. 31, no. 3, 1999.
[17] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When Is Nearest Neighbor Meaningful,” Proc. Int'l Conf. Database Theory, 1999.
[18] D. Lin and Z.M. Kedemt, “Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set,” Proc. Conf. Extending Database Technology (EDBT), 1998.
[19] E.K.K. Ng and A.W.-C. Fu, “An Efficient Algorithm for Projected Clustering,” Proc. Int'l Conf. Data Eng., 2002.
[20] R.T. Ng and J. Han, “Efficient and Effective Clustering Methods for Spatial Data Mining,” Proc. Very Large Data Bases Conf., 1994.
[21] M. Procopiuc, M. Jones, P.K. Agarwal, and T.M. Murali, “A Monte Carlo Algorithm for Fast Projective Clustering,” Proc. ACM SIGMOD Conf., 2002.
[22] D.W. Scott, “On Optimal and Data-Based Histograms,” Biometrika, vol. 66, pp. 605-610, 1979.
[23] B.W. Silverman, Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986.
[24] D.W. Scott, Multivariate Density Estimation. Wiley & Sons, 1992.
[25] W. Wang, J. Yang, and R. Muntz, “Sting: A Statistical Information Grid Approach to Spatial Data Mining,” Proc. 23rd Very Large Data Bases Conf., pp. 186-195, 1997.
[26] T. Zhang, R. Ramakrishnan, and M. Livny, “Birch: An Efficient Data Clustering Method for Very Large Databases,” Proc. ACM SIGMOD Conf., 1996.

Index Terms:
Projective clustering, histogram, subspace.
Citation:
Eric Ka Ka Ng, Ada Wai-chee Fu, Raymond Chi-Wing Wong, "Projective Clustering by Histograms," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 3, pp. 369-383, March 2005, doi:10.1109/TKDE.2005.47
Usage of this product signifies your acceptance of the Terms of Use.