
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
C.C. Aggarwal, P.S. Yu, "Redefining Clustering for HighDimensional Applications," IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 2, pp. 210225, March/April, 2002.  
BibTex  x  
@article{ 10.1109/69.991713, author = {C.C. Aggarwal and P.S. Yu}, title = {Redefining Clustering for HighDimensional Applications}, journal ={IEEE Transactions on Knowledge and Data Engineering}, volume = {14}, number = {2}, issn = {10414347}, year = {2002}, pages = {210225}, doi = {http://doi.ieeecomputersociety.org/10.1109/69.991713}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Knowledge and Data Engineering TI  Redefining Clustering for HighDimensional Applications IS  2 SN  10414347 SP210 EP225 EPD  210225 A1  C.C. Aggarwal, A1  P.S. Yu, PY  2002 KW  data mining KW  clustering KW  high dimensions KW  dimensionality curse VL  14 JA  IEEE Transactions on Knowledge and Data Engineering ER   
Clustering problems are wellknown in the database literature for their use in numerous applications, such as customer segmentation, classification, and trend analysis. Highdimensional data has always been a challenge for clustering algorithms because of the inherent sparsity of the points. Recent research results indicate that, in highdimensional data, even the concept of proximity or clustering may not be meaningful. We introduce a very general concept of projected clustering which is able to construct clusters in arbitrarily aligned subspaces of lower dimensionality. The subspaces are specific to the clusters themselves. This definition is substantially more general and realistic than the currently available techniques which limit the method to only projections from the original set of attributes. The generalized projected clustering technique may also be viewed as a way of trying to redefine clustering for highdimensional applications by searching for hidden subspaces with clusters which are created by interattribute correlations. We provide a new concept of using extended cluster feature vectors in order to make the algorithm scalable for very large databases. The running time and space requirements of the algorithm are adjustable and are likely to tradeoff with better accuracy.
[1] R. Aggrawal et al., "Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1998, pp. 94105.
[2] M. Ankerst, M. Breunig, H.P. Kriegel, and J. Sander, “OPTICS: Ordering Points To Identify the Clustering Structure,” Proc. 1999 ACM Special Interest Group on Management of Data, pp. 49–60, 1999.
[3] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When Is‘Nearest Neighbor’Meaningful?,” Proc. Int'l Conf. Database Theory (ICDT '99), pp. 217–235, Jan. 1999.
[4] J.C. Bezdek, J. Keller, R. Krishnapuram, and N.R. Pal, “Fuzzy Models and Algorithms for Pattern Recognition and Image Processing,” The Handbooks of Fuzzy Sets Series, D. Dubois and H. Prade, eds., 1999.
[5] C. Cheng, A.W. Fu, and Y. Zhang, “EntropyBased Subspace Clustering for Mining Numerical Data,” Proc. ACM SIGKDD Conf., pp. 8493, 1999.
[6] C.C. Aggarwal et al., "Fast Algorithms for Projected Clustering," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1999, pp. 6172.
[7] C.C. Aggarwal and P.S. Yu, "Finding Generalized Projected Clusters in High Dimensional Spaces," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 2000, pp. 7081.
[8] K. Chakrabarti and S. Mehrotra, “Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces,” Proc. 26th Int'l Conf. Very Large Data Bases, pp. 89100, Sept. 2000.
[9] M. Ester, H.P. Kriegel, J. Sander, M. Wimmer, and X. Xu, “Incremental Clustering for Mining in a Data Warehousing Environment,” Proc. 24th Int'l Conf. Very Large Data Bases (VLDB), 1998.
[10] M. Ester, H.P. Kriegel, J. Sander, and X. Xu, “A Density Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proc. Knowledge Discovery in Databases and Data Mining Conf., 1996.
[11] V. EstivillCastro and M.E. Houle, “Robust Clustering of Large Data Sets with Categorical Attributes,” Proc. Australasian Database Conf., pp. 165176, 1999.
[12] V. EstivillCastro and M.E. Houle, “Robust Clustering of Large Geo Referenced Datasets,” Proc. PacificAsia KDD Conf., 1999.
[13] C. Faloutsos and K.I. Lin, “Fastmap: A Fast Algorithm for Indexing, DataMining and Visualization of Traditional and Multimedia Datasets,” Proc. SIGMOD, Int'l Conf. Management of Data, pp. 163174, 1995.
[14] V. Ganti, J. Gehrke, and R. Ramakrishnan, “CACTUS Clustering Categorical Data Using Summaries,” Proc. ACM SIGKDD Conf., 1999
[15] V. Ganti, R. Ramakrishnan, J. Gehrke, A. Powell, “Clustering Large Data sets in Arbitrary Metric Spaces.” Proc. Int'l Conf. Data Eng., 1999
[16] D. Gibson, J. Kleinberg, and P. Raghavan, “Clustering Categorical Data: An Approach Based on Dynamical Systems,” Proc. Very Large Data Base Conf., 1998.
[17] A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” Proc. Very Large Data Base Conf. (VLDB '99), pp. 518–529, Sept. 1999.
[18] S. Guha, R. Rastogi, and K. Shim, CURE: An Efficient Clustering Algorithm for Large Databases Proc. ACM SIGMOD, pp. 7384, June 1998.
[19] S. Guha, R. Rastogi, and K. Shim, ROCK: A Robust Clustering Algorithm For Categorical Attributes Proc. 15th Int'l Conf. Data Eng., pp. 512521, 1999.
[20] E.E. Gustafson and W. Kessel, “Fuzzy Clustering with a Fuzzy Covariance Matrix,” Proc. IEEE Conf. Decision and Control, 1979.
[21] A. Hinneburg and D.A. Keim, "Optimal GridClustering: Towards Breaking the Curse of Dimensionality in HighDimensional Clustering," Proc. 25th Int'l Conf. Very Large Databases, Morgan Kaufmann, 1999, pp. 506517.
[22] Z. Huang, “Extensions of the kMeans Algorithm for Clustering Large Data Sets with Categorical Values,” Data Mining and Knowledge Discovery, vol. 2, no. 3, pp. 283304, 1998.
[23] P. Indyk and R. Motwani, “Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality,” Proc. ACM Symp. Theory of Computing, pp. 604613, 1998.
[24] H.V. Jagadish, L.V.S. Lakshmanan, and D. Srivastava, “Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse,” Proc. ACM SIGMOD Conf., 1999.
[25] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, N.J.: Prentice Hall, 1988.
[26] I.T. Jolliffe, Principal Component Analysis. New York: SpringerVerlag, 1986.
[27] L. Kaufman and P. Rousseuw, Finding Groups in Data—An Introduction to Cluster Analysis. Wiley, 1990.
[28] J. Kleinberg, “Two Algorithms for NearestNeighbor Search in High Dimensional Space,” Proc. ACM Symp. Theory of Computing, 1997.
[29] R. Kohavi and D. Sommerfield, “Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology,” Proc. Int'l Conf. Knowledge Discovery and Data Mining, 1995.
[30] R.T. Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining," Proc. 20th Int'l Conf. Very Large Databases, Morgan Kaufmann, 1994, pp. 144155.
[31] K.V.R. Kanth, D. Agrawal, and A. Singh, “Dimensionality Reduction for Similarity Searching in Dynamic Databases,” Proc. ACM SIGMOD Conf., 1998.
[32] E. Schikuta, "GridClustering: An Efficient Hierarchical Clustering Method for Very Large Data Sets," Proc. 13th Int'l Conf. Pattern Recognition, IEEE CS Press, 1996, pp. 101105.
[33] E. Schikuta and M. Erhart, The BangClustering System: GridBased Data Analysis. Springer Verlag, 1997.
[34] A. Thomasian, V. Castelli, and C.S. Li, “Clustering and Singular Value Decomposition for Approximate Indexing in High Dimensional Spaces,” Proc. Conf. Information and Knowledge Management, 1998.
[35] X. Xu et al., "A DistributionBased Clustering Algorithm for Mining in Large Spatial Databases," Proc. 14th Int'l Conf. Data Eng., IEEE CS Press, 1998, pp. 324331.
[36] M. Zait and H. Messatfa, “A Comparative Study of Clustering Methods,” Future Generation Computer Systems J., special issue on data mining, 1997.
[37] Y. Zhang, A.W.C. Fu, C.H. Cai, and P.A. Heng, “Clustering Categorical Data,” Proc. Int'l Conf. Data Eng., 2000.
[38] T. Zhang, R. Ramakrishnan, and M. Livny, "Birch: An Efficient Data Clustering Method for Very Large Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1996, pp. 103114.
[39] B. Zhou, D.W. Cheung, and B. Kao, “A Fast Algorithm for DensityBased Clustering in Large Database,” Proc. Pacific Asia Knowledge Discovery and Data Mining Conf., 1999.