
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Dantong Yu, Aidong Zhang, "<em>ClusterTree</em>: Integration of Cluster Representation and NearestNeighbor Search for Large Data Sets with High Dimensions," IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 5, pp. 13161337, September/October, 2003.  
BibTex  x  
@article{ 10.1109/TKDE.2003.1232281, author = {Dantong Yu and Aidong Zhang}, title = {<em>ClusterTree</em>: Integration of Cluster Representation and NearestNeighbor Search for Large Data Sets with High Dimensions}, journal ={IEEE Transactions on Knowledge and Data Engineering}, volume = {15}, number = {5}, issn = {10414347}, year = {2003}, pages = {13161337}, doi = {http://doi.ieeecomputersociety.org/10.1109/TKDE.2003.1232281}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Knowledge and Data Engineering TI  <em>ClusterTree</em>: Integration of Cluster Representation and NearestNeighbor Search for Large Data Sets with High Dimensions IS  5 SN  10414347 SP1316 EP1337 EPD  13161337 A1  Dantong Yu, A1  Aidong Zhang, PY  2003 KW  Indexing KW  cluster representation KW  nearestneighbor search KW  highdimensional data sets. VL  15 JA  IEEE Transactions on Knowledge and Data Engineering ER   
Abstract—In this paper, we introduce the ClusterTree, a new indexing approach to representing clusters generated by any existing clustering approach. A cluster is decomposed into several subclusters and represented as the union of the subclusters. The subclusters can be further decomposed, which isolates the most related groups within the clusters. A ClusterTree is a hierarchy of clusters and subclusters which incorporates the cluster representation into the index structure to achieve effective and efficient retrieval. Our cluster representation is highly adaptive to any kind of cluster. It is well accepted that most existing indexing techniques degrade rapidly as the dimensions increase. The ClusterTree provides a practical solution to index clustered data sets and supports the retrieval of the nearestneighbors effectively without having to linearly scan the highdimensional data set. We also discuss an approach to dynamically reconstruct the ClusterTree when new data is added. We present the detailed analysis of this approach and justify it extensively with experiments.
[1] C.C. Aggarwal et al., "Fast Algorithms for Projected Clustering," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1999, pp. 6172.
[2] S.D. Bay The UCI KDD Archive [http:/kdd. ics. uci. edu], Univ. of California, Irvine, Dept. of Information and Computer Science, 1999.
[3] N. Beckmann, H.P. Kriegel, R. Schneider, and B. Seeger, “The R*Tree: An Efficient and Robust Access Method for Points and Rectangles,” Proc. ACM SIGMOD Conf. Management of Data, 1990.
[4] K. Bennett, U. Fayyad, and D. Geiger, DensityBased Indexing for Approximate NearestNeighbor Queries Proc. Fifth Int'l Conf. KDD, 1999.
[5] S. Berchtold, C. Böhm, and H.P. Kriegel, “The PyramidTechnique: Towards Breaking the Curse of Dimensionality,” Proc. ACM SIGMOD Int'l Conf. Managment of Data, 1998.
[6] S. Berchtold, D. Keim, and H.P. Kriegel, “The XTree: An Index Structure for HighDimensional Data,” Proc. 22nd Conf. Very Large Data Bases, pp. 2839, 1996.
[7] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When Is‘Nearest Neighbor’Meaningful?,” Proc. Int'l Conf. Database Theory (ICDT '99), pp. 217–235, Jan. 1999.
[8] C. Bohm, A Cost Model for Query Processing in HighDimensional Data Spaces ACM Trans. Database Systems, vol. 25, no. 2, 2000.
[9] K. Chakrabarti and S. Mehrotra, The Hybrid Tree: An Index Structure for HighDimensional Feature Spaces Proc. Int'l Conf. Data Eng., pp. 440447, 1999.
[10] P. Ciaccia, M. Patella, and P. Zezula, “MTree: An Efficient Access Method for Similarity Search in Metric Spaces,” Proc. Int'l Conf. Very Large Data Bases, 1997.
[11] T.H. Cormen,C.E. Leiserson, and R.L. Rivest,Introduction to Algorithms.Cambridge, Mass.: MIT Press/McGrawHill, 1990.
[12] D.M. Gavrila, RTree Index Optimization Advances in GIS Research, T. Waugh and R. Healey, eds., Tayor and Francis, 1994.
[13] T. Gonzalez, Clustering to Minimize the Maximum Intercluster Distance Theoretical Computer Science, vol. 38, pp. 311322, 1985.
[14] S. Guha, R. Rastogi, and K. Shim, CURE: An Efficient Clustering Algorithm for Large Databases Proc. ACM SIGMOD, pp. 7384, June 1998.
[15] A. Guttman, RTrees: A Dynamic Index for Geometric Data Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 4757, 1984.
[16] I. Kamel and C. Faloutsos, “On Packing RTrees,” Proc. Second Int'l Conf. Information and Knowledge Management (CIKM), 1993.
[17] N. Katayama and S. Satoh, “The SRTree: An Index Structure for HighDimensional Nearest Neighbor Queries,” Proc. SIGMOD, Int'l Conf. Management of Data, pp. 369380, 1997.
[18] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley&Sons, 1990.
[19] R. Kurniawati, J.S. Jin, and J.A. Shepherd, The SS+Tree: An Improved Index Structure for Similarity Searches in a HighDimensional Feature Space Proc. SPIE Conf. Storage and Retrieval for Image and Video Databases, pp. 1324, Feb. 1997.
[20] B.S. Manjunath and W.Y. Ma, “Texture Features for Browsing and Retrieval of Image Data,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 8, pp. 837842, Aug. 1996
[21] R.T. Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining," Proc. 20th Int'l Conf. Very Large Databases, Morgan Kaufmann, 1994, pp. 144155.
[22] J.T. Robinson, “The KDBTree: A Search Structure for Large Multidimensional Dynamic Indexes,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 1018, 1981.
[23] N. Roussopoulos, S. Kelley, and F. Vincent, “Nearest Neighbor Queries,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 7179, 1995.
[24] T. Seidl and H.P. Kriegel, “Optimal MultiStep kNearest Neighbor Search,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 154165, 1998.
[25] G. Sheikholeslami, W. Chang, and A. Zhang, “Semantic Clustering and Querying on Heterogeneous Features for Visual Data,” Proc. Sixth ACM Int'l Multimedia Conf. (ACM Multimedia '98), pp. 312, Sept. 1998.
[26] G. Sheikholeslami, S. Chatterjee, and A. Zhang, WaveCluster: A MultiResolution Clustering Approach for Very Large Spatial Databases Proc. Very Large Date Bases Conf., pp. 428439, Aug. 1998.
[27] G. Sheikholeslami, S. Chatterjee, and A. Zhang, “WaveCluster: A WaveletBased Clustering Approach for Multidimensional Data in Very Large Databases,” The VLDB J., vol. 8, no. 4, pp. 289304, Feb. 2000.
[28] J.R. Smith and S. Chang, “Transform Features For Texture Classification and Discrimination in Large Image Databases,” Proc. IEEE Int'l Conf. Image Processing, pp. 407411, 1994.
[29] E. Welzl, Smallest Enclosing Disks (Balls and Ellipsoids) Proc. Conf. New Results and New Trends in Computer Science, pp. 359370, June, 1991.
[30] D. White and R. Jain, “Similarity Indexing with the SSTree,” Proc. 12th Int'l Conf. Data Eng., 1996.
[31] D. Yu, Multidimensional Indexing and Management for LargeScale Databases PhD dissertation, State Univ. of New York at Buffalo, Feb. 2001.
[32] M. Zait and H. Messatfa, “A Comparative Study of Clustering Methods,” Future Generation Computer Systems J., special issue on data mining, 1997.