This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
ClusterTree: Integration of Cluster Representation and Nearest-Neighbor Search for Large Data Sets with High Dimensions
September/October 2003 (vol. 15 no. 5)
pp. 1316-1337
Dantong Yu, IEEE

Abstract—In this paper, we introduce the ClusterTree, a new indexing approach to representing clusters generated by any existing clustering approach. A cluster is decomposed into several subclusters and represented as the union of the subclusters. The subclusters can be further decomposed, which isolates the most related groups within the clusters. A ClusterTree is a hierarchy of clusters and subclusters which incorporates the cluster representation into the index structure to achieve effective and efficient retrieval. Our cluster representation is highly adaptive to any kind of cluster. It is well accepted that most existing indexing techniques degrade rapidly as the dimensions increase. The ClusterTree provides a practical solution to index clustered data sets and supports the retrieval of the nearest-neighbors effectively without having to linearly scan the high-dimensional data set. We also discuss an approach to dynamically reconstruct the ClusterTree when new data is added. We present the detailed analysis of this approach and justify it extensively with experiments.

[1] C.C. Aggarwal et al., "Fast Algorithms for Projected Clustering," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1999, pp. 61-72.
[2] S.D. Bay The UCI KDD Archive [http:/kdd. ics. uci. edu], Univ. of California, Irvine, Dept. of Information and Computer Science, 1999.
[3] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, “The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles,” Proc. ACM SIGMOD Conf. Management of Data, 1990.
[4] K. Bennett, U. Fayyad, and D. Geiger, Density-Based Indexing for Approximate Nearest-Neighbor Queries Proc. Fifth Int'l Conf. KDD, 1999.
[5] S. Berchtold, C. Böhm, and H.-P. Kriegel, “The Pyramid-Technique: Towards Breaking the Curse of Dimensionality,” Proc. ACM SIGMOD Int'l Conf. Managment of Data, 1998.
[6] S. Berchtold, D. Keim, and H.-P. Kriegel, “The X-Tree: An Index Structure for High-Dimensional Data,” Proc. 22nd Conf. Very Large Data Bases, pp. 28-39, 1996.
[7] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When Is‘Nearest Neighbor’Meaningful?,” Proc. Int'l Conf. Database Theory (ICDT '99), pp. 217–235, Jan. 1999.
[8] C. Bohm, A Cost Model for Query Processing in High-Dimensional Data Spaces ACM Trans. Database Systems, vol. 25, no. 2, 2000.
[9] K. Chakrabarti and S. Mehrotra, The Hybrid Tree: An Index Structure for High-Dimensional Feature Spaces Proc. Int'l Conf. Data Eng., pp. 440-447, 1999.
[10] P. Ciaccia, M. Patella, and P. Zezula, “M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces,” Proc. Int'l Conf. Very Large Data Bases, 1997.
[11] T.H. Cormen,C.E. Leiserson, and R.L. Rivest,Introduction to Algorithms.Cambridge, Mass.: MIT Press/McGraw-Hill, 1990.
[12] D.M. Gavrila, R-Tree Index Optimization Advances in GIS Research, T. Waugh and R. Healey, eds., Tayor and Francis, 1994.
[13] T. Gonzalez, Clustering to Minimize the Maximum Intercluster Distance Theoretical Computer Science, vol. 38, pp. 311-322, 1985.
[14] S. Guha, R. Rastogi, and K. Shim, CURE: An Efficient Clustering Algorithm for Large Databases Proc. ACM SIGMOD, pp. 73-84, June 1998.
[15] A. Guttman, R-Trees: A Dynamic Index for Geometric Data Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 47-57, 1984.
[16] I. Kamel and C. Faloutsos, “On Packing R-Trees,” Proc. Second Int'l Conf. Information and Knowledge Management (CIKM), 1993.
[17] N. Katayama and S. Satoh, “The SR-Tree: An Index Structure for High-Dimensional Nearest Neighbor Queries,” Proc. SIGMOD, Int'l Conf. Management of Data, pp. 369-380, 1997.
[18] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley&Sons, 1990.
[19] R. Kurniawati, J.S. Jin, and J.A. Shepherd, The SS+-Tree: An Improved Index Structure for Similarity Searches in a High-Dimensional Feature Space Proc. SPIE Conf. Storage and Retrieval for Image and Video Databases, pp. 13-24, Feb. 1997.
[20] B.S. Manjunath and W.Y. Ma, “Texture Features for Browsing and Retrieval of Image Data,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 8, pp. 837-842, Aug. 1996
[21] R.T. Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining," Proc. 20th Int'l Conf. Very Large Databases, Morgan Kaufmann, 1994, pp. 144-155.
[22] J.T. Robinson, “The K-D-B-Tree: A Search Structure for Large Multidimensional Dynamic Indexes,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 10-18, 1981.
[23] N. Roussopoulos, S. Kelley, and F. Vincent, “Nearest Neighbor Queries,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 71-79, 1995.
[24] T. Seidl and H.-P. Kriegel, “Optimal Multi-Step k-Nearest Neighbor Search,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 154-165, 1998.
[25] G. Sheikholeslami, W. Chang, and A. Zhang, “Semantic Clustering and Querying on Heterogeneous Features for Visual Data,” Proc. Sixth ACM Int'l Multimedia Conf. (ACM Multimedia '98), pp. 3-12, Sept. 1998.
[26] G. Sheikholeslami, S. Chatterjee, and A. Zhang, WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases Proc. Very Large Date Bases Conf., pp. 428-439, Aug. 1998.
[27] G. Sheikholeslami, S. Chatterjee, and A. Zhang, “WaveCluster: A Wavelet-Based Clustering Approach for Multidimensional Data in Very Large Databases,” The VLDB J., vol. 8, no. 4, pp. 289-304, Feb. 2000.
[28] J.R. Smith and S. Chang, “Transform Features For Texture Classification and Discrimination in Large Image Databases,” Proc. IEEE Int'l Conf. Image Processing, pp. 407-411, 1994.
[29] E. Welzl, Smallest Enclosing Disks (Balls and Ellipsoids) Proc. Conf. New Results and New Trends in Computer Science, pp. 359-370, June, 1991.
[30] D. White and R. Jain, “Similarity Indexing with the SS-Tree,” Proc. 12th Int'l Conf. Data Eng., 1996.
[31] D. Yu, Multidimensional Indexing and Management for Large-Scale Databases PhD dissertation, State Univ. of New York at Buffalo, Feb. 2001.
[32] M. Zait and H. Messatfa, “A Comparative Study of Clustering Methods,” Future Generation Computer Systems J., special issue on data mining, 1997.

Index Terms:
Indexing, cluster representation, nearest-neighbor search, high-dimensional data sets.
Citation:
Dantong Yu, Aidong Zhang, "ClusterTree: Integration of Cluster Representation and Nearest-Neighbor Search for Large Data Sets with High Dimensions," IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 5, pp. 1316-1337, Sept.-Oct. 2003, doi:10.1109/TKDE.2003.1232281
Usage of this product signifies your acceptance of the Terms of Use.