
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Chen Li, Edward Chang, Hector GarciaMolina, Gio Wiederhold, "Clustering for Approximate Similarity Search in HighDimensional Spaces," IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 4, pp. 792808, July/August, 2002.  
BibTex  x  
@article{ 10.1109/TKDE.2002.1019214, author = {Chen Li and Edward Chang and Hector GarciaMolina and Gio Wiederhold}, title = {Clustering for Approximate Similarity Search in HighDimensional Spaces}, journal ={IEEE Transactions on Knowledge and Data Engineering}, volume = {14}, number = {4}, issn = {10414347}, year = {2002}, pages = {792808}, doi = {http://doi.ieeecomputersociety.org/10.1109/TKDE.2002.1019214}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Knowledge and Data Engineering TI  Clustering for Approximate Similarity Search in HighDimensional Spaces IS  4 SN  10414347 SP792 EP808 EPD  792808 A1  Chen Li, A1  Edward Chang, A1  Hector GarciaMolina, A1  Gio Wiederhold, PY  2002 KW  Approximate search KW  clustering KW  highdimensional index KW  similarity search. VL  14 JA  IEEE Transactions on Knowledge and Data Engineering ER   
In this paper, we present a clustering and indexing paradigm (called Clindex) for highdimensional search spaces. The scheme is designed for approximate similarity searches, where one would like to find many of the data points near a target point, but where one can tolerate missing a few near points. For such searches, our scheme can find near points with high recall in very few IOs and perform significantly better than other approaches. Our scheme is based on finding clusters and, then, building a simple but efficient index for them. We analyze the tradeoffs involved in clustering and building such an index structure, and present extensive experimental results.
[1] R. Aggrawal et al., "Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1998, pp. 94105.
[2] S. Arya, D.M. Mount, N.S. Netanyahu, R. Silverman, and A.Y. Wu, “An Optimal Algorithm for Approximate Nearest Neighborhood Searching,” Proc. Symp. Discrete Algorithms, pp. 573582, 1994.
[3] N. Beckmann, H.P. Kriegel, R. Schneider, and B. Seeger, “The R*Tree: An Efficient and Robust Access Method for Points and Rectangles,” Proc. ACM SIGMOD Conf. Management of Data, 1990.
[4] K. Bennett, U. Fayyad, and D. Geiger, DensityBased Indexing for Approximate NearestNeighbor Queries Proc. Fifth Int'l Conf. KDD, 1999.
[5] S. Berchtold, C. Böhm, and H.P. Kriegel, “The PyramidTechnique: Towards Breaking the Curse of Dimensionality,” Proc. ACM SIGMOD Int'l Conf. Managment of Data, 1998.
[6] S. Berchtold, D. Keim, and H.P. Kriegel, “The XTree: An Index Structure for HighDimensional Data,” Proc. 22nd Conf. Very Large Data Bases, pp. 2839, 1996.
[7] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford Univ. Press, Oxford, UK, 1996.
[8] P.S. Bradley, U.M. Fayyad, and C.A. Reina, “Scaling EM (ExpectationMaximization) Clustering to Large Databases,” Microsoft Technical Report MSRTR9834, Nov. 1999.
[9] S. Brin and H. GarciaMolina, “Copy Detection Mechanisms for Digital Documents,” Proc. ACM SIGMOD, May 1995.
[10] E. Chang, K.T. Cheng, W. Lai, C. Wu, C. Chang, and Y. Wu, “PBIR: A System that Subjective Image Query Concepts,” Proc. ACM Int'l Conf. Multimedia, Oct. 2001.
[11] E. Chang, K.T. Cheng, and L. Chang, “PBIRPerceptionBased Image Retrieval,” Proc. ACM Sigmond, May, 2001.
[12] E. Chang, C. Li, J. Wang, P. Mork, and G. Wiederhold, “Searching NearReplicas of Images via Clustering,” Proc. Int'l Soc. for Optical Eng. (SPIE) Symp. Voice, Video, and Data Comm., Sept. 1999.
[13] E. Chang, J. Wang, C. Li, and G. Wiederhold, “RIME–A Replicated Image Detector for the WWW,” Proc. Proc. Int'l Soc. for Optical Eng. (SPIE) Symp. Voice, Video, and Data Comm., Nov. 1998.
[14] R. Choubey, L. Chen, and E.A. Rundensteiner, “GBI: A Geberalized RTree BulkInsertion,” Proc. Eighth Symp. Large Spatial Databases (SSD), pp. 91108, July 1999.
[15] P. Ciaccia and M. Patella, “Pac Nearest Neighbor Queries: Approximate and Controlled Search in HighDimensional and Metric Spaces,” Proc. Int'l Conf. Data Eng. (ICDE), pp. 244255, 2000.
[16] P. Ciaccia, M. Patella, and P. Zezula, “MTree: An Efficient Access Method for Similarity Search in Metric Spaces,” Proc. Int'l Conf. Very Large Data Bases, 1997.
[17] K. Clarkson, “An Algorithm for Approximate ClosestPoint Queries,” Proc. 10th Software Consulting Group (SCG), pp. 160164, 1994.
[18] M. Ester, H.P. Kriegel, J. Sander, and X. Xu, “A DensityBased Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proc. Second Int'l Conf. Knowledge Discovery in Databases and Data Mining, Aug. 1996.
[19] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker, “Query by Image and Video Content: The QBIC System,” IEEE Computer, 1995.
[20] H. GarciaMolina, S. Ketchpel, and N. Shivakumar, “Safeguarding and Charging for Information on the Internet,” Proc. Int'l Conf. Data Eng. (ICDE), 1998.
[21] H. GarciaMolina, J.D. Ullman, and J. Widom, Database System Implementation. Prentice Hall, 2000.
[22] A. Gersho and R.M. Gray, Vector Quantization and Signal Compression. Boston: Kluwer Academic, 1992.
[23] A. Gupta and R. Jain, “Visual Information Retrieval,” Comm. ACM, vol. 40, no. 5, pp. 7079, May 1997.
[24] A. Guttman, “RTrees: A Dynamic Index Structure for Spatial Searching,” Proc. ACM SIGMOD Conf. Management of Data, 1984.
[25] G.R. Hjaltason and H. Samet, “Ranking in Spatial Databases,” Proc. Fourth Int'l Symp. Large Spatial Databases, pp. 8395, 1995.
[26] A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” Proc. Very Large Data Base Conf. (VLDB '99), pp. 518–529, Sept. 1999.
[27] P. Indyk and R. Motwani, “Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality,” Proc. ACM Symp. Theory of Computing, pp. 604613, 1998.
[28] N. Katayama and S. Satoh, “The SRTree: An Index Structure for HighDimensional Nearest Neighbor Queries,” Proc. SIGMOD, Int'l Conf. Management of Data, pp. 369380, 1997.
[29] J. Kleinberg, “Two Algorithms for NearestNeighbor Search in High Dimensional Space,” Proc. ACM Symp. Theory of Computing, 1997.
[30] E. Kushilevitz, R. Ostrovsky, and Y. Rabani, “Efficient Search for Approximate Nearest Neighbor in High Dimensional Spaces,” Proc. 30th Symp. Theory of Computers (STOC), pp. 614623, 1998.
[31] K. Lin, H.V. Jagadish, and C. Faloutsos, “The TVTree: An Index Structure for HighDimensional Data,” VLDB J., vol. 3, pp. 517542, 1995.
[32] G.J. McLachlan and T. Krishnan, The EM Algorithm&Extensions. John Wiley&Sons, 1997.
[33] G. Miller, “The Magical Number Seven + Two, Some Limits on Our Capacity for Processing Information,” Psych Rev., vol. 68, pp. 8197, 1956.
[34] T. Mitchell, Machine Learning, McGrawHill, 1997.
[35] R.T. Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining," Proc. 20th Int'l Conf. Very Large Databases, Morgan Kaufmann, 1994, pp. 144155.
[36] J.T. Robinson, “The KDBTree: A Search Structure for Large Multidimensional Dynamic Indexes,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 1018, 1981.
[37] N. Roussopoulos, S. Kelley, and F. Vincent, “Nearest Neighbor Queries,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 7179, 1995.
[38] Y. Rubner, C. Tomasi, and L. Guibas, “Adaptive ColorImage Embedding for Database Navigation,” Proc. Asian Conf. Computer Vision, Jan. 1998.
[39] G. Sheikholeslami, S. Chatterjee, and A. Zhang, WaveCluster: A MultiResolution Clustering Approach for Very Large Spatial Databases Proc. Very Large Date Bases Conf., pp. 428439, Aug. 1998.
[40] J.R. Smith and S.F. Chang, “VisualSEEk: A Fully Automated ContentBased Image Query System,” ACM Multimedia '96, Nov. 1996.
[41] W.W.J.Z. Wang, G. Wiederhold, O. Firschein, and S.X. Wei, “WaveletBased Image Indexing Techniques with Partial Sketch Retrieval Capability,” J. Digital Libraries, 1997.
[42] J. Z. Wang, G. Wiederhold, O. Firschein, and S.X. Wei, “ContentBased Image Indexing and Searching Using Daubechies' Wavelets,” J. Digital Libraries, vol. 1, no. 4, pp. 311328, 1998.
[43] R. Weber, H.J. Schek, and S. Blott, “A Quantitative Analysis and Performance Study for SimilaritySearch Methods in HighDimensional Spaces,” Proc. Very Large Data Base Conf. (VLDB '98), pp. 194–205, Aug. 1998.
[44] D.A. White and R. Jain, “Similarity Indexing: Algorithms and Performance,” Proc. Int'l Soc. for Optical Eng. (SPIE), vol. 2670, 1996.
[45] D. White and R. Jain, “Similarity Indexing with the SSTree,” Proc. 12th Int'l Conf. Data Eng., 1996.
[46] G. Wiederhold, Database Design, Computer Science Series, second ed. New York: McGraw Hill, 1983.
[47] P. Zezula, P. Savino, G. Amato, and F. Rabitti, “Approximate Similarity Retrieval with MTrees,” Very Large Databases J., vol. 7, no. 4, pp. 275293, Dec. 1998.
[48] T. Zhang, R. Ramakrishnan, and M. Livny, "Birch: An Efficient Data Clustering Method for Very Large Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1996, pp. 103114.