This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Redefining Clustering for High-Dimensional Applications
March/April 2002 (vol. 14 no. 2)
pp. 210-225

Clustering problems are well-known in the database literature for their use in numerous applications, such as customer segmentation, classification, and trend analysis. High-dimensional data has always been a challenge for clustering algorithms because of the inherent sparsity of the points. Recent research results indicate that, in high-dimensional data, even the concept of proximity or clustering may not be meaningful. We introduce a very general concept of projected clustering which is able to construct clusters in arbitrarily aligned subspaces of lower dimensionality. The subspaces are specific to the clusters themselves. This definition is substantially more general and realistic than the currently available techniques which limit the method to only projections from the original set of attributes. The generalized projected clustering technique may also be viewed as a way of trying to redefine clustering for high-dimensional applications by searching for hidden subspaces with clusters which are created by interattribute correlations. We provide a new concept of using extended cluster feature vectors in order to make the algorithm scalable for very large databases. The running time and space requirements of the algorithm are adjustable and are likely to trade-off with better accuracy.

[1] R. Aggrawal et al., "Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1998, pp. 94-105.
[2] M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander, “OPTICS: Ordering Points To Identify the Clustering Structure,” Proc. 1999 ACM Special Interest Group on Management of Data, pp. 49–60, 1999.
[3] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When Is‘Nearest Neighbor’Meaningful?,” Proc. Int'l Conf. Database Theory (ICDT '99), pp. 217–235, Jan. 1999.
[4] J.C. Bezdek, J. Keller, R. Krishnapuram, and N.R. Pal, “Fuzzy Models and Algorithms for Pattern Recognition and Image Processing,” The Handbooks of Fuzzy Sets Series, D. Dubois and H. Prade, eds., 1999.
[5] C. Cheng, A.W. Fu, and Y. Zhang, “Entropy-Based Subspace Clustering for Mining Numerical Data,” Proc. ACM SIGKDD Conf., pp. 84-93, 1999.
[6] C.C. Aggarwal et al., "Fast Algorithms for Projected Clustering," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1999, pp. 61-72.
[7] C.C. Aggarwal and P.S. Yu, "Finding Generalized Projected Clusters in High Dimensional Spaces," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 2000, pp. 70-81.
[8] K. Chakrabarti and S. Mehrotra, “Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces,” Proc. 26th Int'l Conf. Very Large Data Bases, pp. 89-100, Sept. 2000.
[9] M. Ester, H.-P. Kriegel, J. Sander, M. Wimmer, and X. Xu, “Incremental Clustering for Mining in a Data Warehousing Environment,” Proc. 24th Int'l Conf. Very Large Data Bases (VLDB), 1998.
[10] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proc. Knowledge Discovery in Databases and Data Mining Conf., 1996.
[11] V. Estivill-Castro and M.E. Houle, “Robust Clustering of Large Data Sets with Categorical Attributes,” Proc. Australasian Database Conf., pp. 165-176, 1999.
[12] V. Estivill-Castro and M.E. Houle, “Robust Clustering of Large Geo Referenced Datasets,” Proc. Pacific-Asia KDD Conf., 1999.
[13] C. Faloutsos and K.I. Lin, “Fastmap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets,” Proc. SIGMOD, Int'l Conf. Management of Data, pp. 163-174, 1995.
[14] V. Ganti, J. Gehrke, and R. Ramakrishnan, “CACTUS- Clustering Categorical Data Using Summaries,” Proc. ACM SIGKDD Conf., 1999
[15] V. Ganti, R. Ramakrishnan, J. Gehrke, A. Powell, “Clustering Large Data sets in Arbitrary Metric Spaces.” Proc. Int'l Conf. Data Eng., 1999
[16] D. Gibson, J. Kleinberg, and P. Raghavan, “Clustering Categorical Data: An Approach Based on Dynamical Systems,” Proc. Very Large Data Base Conf., 1998.
[17] A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” Proc. Very Large Data Base Conf. (VLDB '99), pp. 518–529, Sept. 1999.
[18] S. Guha, R. Rastogi, and K. Shim, CURE: An Efficient Clustering Algorithm for Large Databases Proc. ACM SIGMOD, pp. 73-84, June 1998.
[19] S. Guha, R. Rastogi, and K. Shim, ROCK: A Robust Clustering Algorithm For Categorical Attributes Proc. 15th Int'l Conf. Data Eng., pp. 512-521, 1999.
[20] E.E. Gustafson and W. Kessel, “Fuzzy Clustering with a Fuzzy Covariance Matrix,” Proc. IEEE Conf. Decision and Control, 1979.
[21] A. Hinneburg and D.A. Keim, "Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering," Proc. 25th Int'l Conf. Very Large Databases, Morgan Kaufmann, 1999, pp. 506-517.
[22] Z. Huang, “Extensions of the k-Means Algorithm for Clustering Large Data Sets with Categorical Values,” Data Mining and Knowledge Discovery, vol. 2, no. 3, pp. 283-304, 1998.
[23] P. Indyk and R. Motwani, “Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality,” Proc. ACM Symp. Theory of Computing, pp. 604-613, 1998.
[24] H.V. Jagadish, L.V.S. Lakshmanan, and D. Srivastava, “Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse,” Proc. ACM SIGMOD Conf., 1999.
[25] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, N.J.: Prentice Hall, 1988.
[26] I.T. Jolliffe, Principal Component Analysis. New York: Springer-Verlag, 1986.
[27] L. Kaufman and P. Rousseuw, Finding Groups in Data—An Introduction to Cluster Analysis. Wiley, 1990.
[28] J. Kleinberg, “Two Algorithms for Nearest-Neighbor Search in High Dimensional Space,” Proc. ACM Symp. Theory of Computing, 1997.
[29] R. Kohavi and D. Sommerfield, “Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology,” Proc. Int'l Conf. Knowledge Discovery and Data Mining, 1995.
[30] R.T. Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining," Proc. 20th Int'l Conf. Very Large Databases, Morgan Kaufmann, 1994, pp. 144-155.
[31] K.V.R. Kanth, D. Agrawal, and A. Singh, “Dimensionality Reduction for Similarity Searching in Dynamic Databases,” Proc. ACM SIGMOD Conf., 1998.
[32] E. Schikuta, "Grid-Clustering: An Efficient Hierarchical Clustering Method for Very Large Data Sets," Proc. 13th Int'l Conf. Pattern Recognition, IEEE CS Press, 1996, pp. 101-105.
[33] E. Schikuta and M. Erhart, The Bang-Clustering System: Grid-Based Data Analysis. Springer Verlag, 1997.
[34] A. Thomasian, V. Castelli, and C.-S. Li, “Clustering and Singular Value Decomposition for Approximate Indexing in High Dimensional Spaces,” Proc. Conf. Information and Knowledge Management, 1998.
[35] X. Xu et al., "A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases," Proc. 14th Int'l Conf. Data Eng., IEEE CS Press, 1998, pp. 324-331.
[36] M. Zait and H. Messatfa, “A Comparative Study of Clustering Methods,” Future Generation Computer Systems J., special issue on data mining, 1997.
[37] Y. Zhang, A.W.-C. Fu, C.H. Cai, and P.A. Heng, “Clustering Categorical Data,” Proc. Int'l Conf. Data Eng., 2000.
[38] T. Zhang, R. Ramakrishnan, and M. Livny, "Birch: An Efficient Data Clustering Method for Very Large Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1996, pp. 103-114.
[39] B. Zhou, D.W. Cheung, and B. Kao, “A Fast Algorithm for Density-Based Clustering in Large Database,” Proc. Pacific Asia Knowledge Discovery and Data Mining Conf., 1999.

Index Terms:
data mining, clustering, high dimensions, dimensionality curse
Citation:
C.C. Aggarwal, P.S. Yu, "Redefining Clustering for High-Dimensional Applications," IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 2, pp. 210-225, March-April 2002, doi:10.1109/69.991713
Usage of this product signifies your acceptance of the Terms of Use.