This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Clustering for Approximate Similarity Search in High-Dimensional Spaces
July/August 2002 (vol. 14 no. 4)
pp. 792-808

In this paper, we present a clustering and indexing paradigm (called Clindex) for high-dimensional search spaces. The scheme is designed for approximate similarity searches, where one would like to find many of the data points near a target point, but where one can tolerate missing a few near points. For such searches, our scheme can find near points with high recall in very few IOs and perform significantly better than other approaches. Our scheme is based on finding clusters and, then, building a simple but efficient index for them. We analyze the trade-offs involved in clustering and building such an index structure, and present extensive experimental results.

[1] R. Aggrawal et al., "Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1998, pp. 94-105.
[2] S. Arya, D.M. Mount, N.S. Netanyahu, R. Silverman, and A.Y. Wu, “An Optimal Algorithm for Approximate Nearest Neighborhood Searching,” Proc. Symp. Discrete Algorithms, pp. 573-582, 1994.
[3] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, “The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles,” Proc. ACM SIGMOD Conf. Management of Data, 1990.
[4] K. Bennett, U. Fayyad, and D. Geiger, Density-Based Indexing for Approximate Nearest-Neighbor Queries Proc. Fifth Int'l Conf. KDD, 1999.
[5] S. Berchtold, C. Böhm, and H.-P. Kriegel, “The Pyramid-Technique: Towards Breaking the Curse of Dimensionality,” Proc. ACM SIGMOD Int'l Conf. Managment of Data, 1998.
[6] S. Berchtold, D. Keim, and H.-P. Kriegel, “The X-Tree: An Index Structure for High-Dimensional Data,” Proc. 22nd Conf. Very Large Data Bases, pp. 28-39, 1996.
[7] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford Univ. Press, Oxford, UK, 1996.
[8] P.S. Bradley, U.M. Fayyad, and C.A. Reina, “Scaling EM (Expectation-Maximization) Clustering to Large Databases,” Microsoft Technical Report MSR-TR-98-34, Nov. 1999.
[9] S. Brin and H. Garcia-Molina, “Copy Detection Mechanisms for Digital Documents,” Proc. ACM SIGMOD, May 1995.
[10] E. Chang, K.-T. Cheng, W. Lai, C. Wu, C. Chang, and Y. Wu, “PBIR: A System that Subjective Image Query Concepts,” Proc. ACM Int'l Conf. Multimedia, Oct. 2001.
[11] E. Chang, K.-T. Cheng, and L. Chang, “PBIR-Perception-Based Image Retrieval,” Proc. ACM Sigmond, May, 2001.
[12] E. Chang, C. Li, J. Wang, P. Mork, and G. Wiederhold, “Searching Near-Replicas of Images via Clustering,” Proc. Int'l Soc. for Optical Eng. (SPIE) Symp. Voice, Video, and Data Comm., Sept. 1999.
[13] E. Chang, J. Wang, C. Li, and G. Wiederhold, “RIME–A Replicated Image Detector for the WWW,” Proc. Proc. Int'l Soc. for Optical Eng. (SPIE) Symp. Voice, Video, and Data Comm., Nov. 1998.
[14] R. Choubey, L. Chen, and E.A. Rundensteiner, “GBI: A Geberalized R-Tree Bulk-Insertion,” Proc. Eighth Symp. Large Spatial Databases (SSD), pp. 91-108, July 1999.
[15] P. Ciaccia and M. Patella, “Pac Nearest Neighbor Queries: Approximate and Controlled Search in High-Dimensional and Metric Spaces,” Proc. Int'l Conf. Data Eng. (ICDE), pp. 244-255, 2000.
[16] P. Ciaccia, M. Patella, and P. Zezula, “M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces,” Proc. Int'l Conf. Very Large Data Bases, 1997.
[17] K. Clarkson, “An Algorithm for Approximate Closest-Point Queries,” Proc. 10th Software Consulting Group (SCG), pp. 160-164, 1994.
[18] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proc. Second Int'l Conf. Knowledge Discovery in Databases and Data Mining, Aug. 1996.
[19] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker, “Query by Image and Video Content: The QBIC System,” IEEE Computer, 1995.
[20] H. Garcia-Molina, S. Ketchpel, and N. Shivakumar, “Safeguarding and Charging for Information on the Internet,” Proc. Int'l Conf. Data Eng. (ICDE), 1998.
[21] H. Garcia-Molina, J.D. Ullman, and J. Widom, Database System Implementation. Prentice Hall, 2000.
[22] A. Gersho and R.M. Gray, Vector Quantization and Signal Compression. Boston: Kluwer Academic, 1992.
[23] A. Gupta and R. Jain, “Visual Information Retrieval,” Comm. ACM, vol. 40, no. 5, pp. 70-79, May 1997.
[24] A. Guttman, “R-Trees: A Dynamic Index Structure for Spatial Searching,” Proc. ACM SIGMOD Conf. Management of Data, 1984.
[25] G.R. Hjaltason and H. Samet, “Ranking in Spatial Databases,” Proc. Fourth Int'l Symp. Large Spatial Databases, pp. 83-95, 1995.
[26] A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” Proc. Very Large Data Base Conf. (VLDB '99), pp. 518–529, Sept. 1999.
[27] P. Indyk and R. Motwani, “Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality,” Proc. ACM Symp. Theory of Computing, pp. 604-613, 1998.
[28] N. Katayama and S. Satoh, “The SR-Tree: An Index Structure for High-Dimensional Nearest Neighbor Queries,” Proc. SIGMOD, Int'l Conf. Management of Data, pp. 369-380, 1997.
[29] J. Kleinberg, “Two Algorithms for Nearest-Neighbor Search in High Dimensional Space,” Proc. ACM Symp. Theory of Computing, 1997.
[30] E. Kushilevitz, R. Ostrovsky, and Y. Rabani, “Efficient Search for Approximate Nearest Neighbor in High Dimensional Spaces,” Proc. 30th Symp. Theory of Computers (STOC), pp. 614-623, 1998.
[31] K. Lin, H.V. Jagadish, and C. Faloutsos, “The TV-Tree: An Index Structure for High-Dimensional Data,” VLDB J., vol. 3, pp. 517-542, 1995.
[32] G.J. McLachlan and T. Krishnan, The EM Algorithm&Extensions. John Wiley&Sons, 1997.
[33] G. Miller, “The Magical Number Seven +- Two, Some Limits on Our Capacity for Processing Information,” Psych Rev., vol. 68, pp. 81-97, 1956.
[34] T. Mitchell, Machine Learning, McGraw-Hill, 1997.
[35] R.T. Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining," Proc. 20th Int'l Conf. Very Large Databases, Morgan Kaufmann, 1994, pp. 144-155.
[36] J.T. Robinson, “The K-D-B-Tree: A Search Structure for Large Multidimensional Dynamic Indexes,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 10-18, 1981.
[37] N. Roussopoulos, S. Kelley, and F. Vincent, “Nearest Neighbor Queries,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 71-79, 1995.
[38] Y. Rubner, C. Tomasi, and L. Guibas, “Adaptive Color-Image Embedding for Database Navigation,” Proc. Asian Conf. Computer Vision, Jan. 1998.
[39] G. Sheikholeslami, S. Chatterjee, and A. Zhang, WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases Proc. Very Large Date Bases Conf., pp. 428-439, Aug. 1998.
[40] J.R. Smith and S.F. Chang, “VisualSEEk: A Fully Automated Content-Based Image Query System,” ACM Multimedia '96, Nov. 1996.
[41] W.W.J.Z. Wang, G. Wiederhold, O. Firschein, and S.X. Wei, “Wavelet-Based Image Indexing Techniques with Partial Sketch Retrieval Capability,” J. Digital Libraries, 1997.
[42] J. Z. Wang, G. Wiederhold, O. Firschein, and S.X. Wei, “Content-Based Image Indexing and Searching Using Daubechies' Wavelets,” J. Digital Libraries, vol. 1, no. 4, pp. 311-328, 1998.
[43] R. Weber, H.-J. Schek, and S. Blott, “A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces,” Proc. Very Large Data Base Conf. (VLDB '98), pp. 194–205, Aug. 1998.
[44] D.A. White and R. Jain, “Similarity Indexing: Algorithms and Performance,” Proc. Int'l Soc. for Optical Eng. (SPIE), vol. 2670, 1996.
[45] D. White and R. Jain, “Similarity Indexing with the SS-Tree,” Proc. 12th Int'l Conf. Data Eng., 1996.
[46] G. Wiederhold, Database Design, Computer Science Series, second ed. New York: McGraw Hill, 1983.
[47] P. Zezula, P. Savino, G. Amato, and F. Rabitti, “Approximate Similarity Retrieval with M-Trees,” Very Large Databases J., vol. 7, no. 4, pp. 275-293, Dec. 1998.
[48] T. Zhang, R. Ramakrishnan, and M. Livny, "Birch: An Efficient Data Clustering Method for Very Large Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1996, pp. 103-114.

Index Terms:
Approximate search, clustering, high-dimensional index, similarity search.
Citation:
Chen Li, Edward Chang, Hector Garcia-Molina, Gio Wiederhold, "Clustering for Approximate Similarity Search in High-Dimensional Spaces," IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 4, pp. 792-808, July-Aug. 2002, doi:10.1109/TKDE.2002.1019214
Usage of this product signifies your acceptance of the Terms of Use.