The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.06 - June (2011 vol.23)
pp: 815-830
Sharadh Ramaswamy , Mayachitra Inc., Santa Barbara
Kenneth Rose , University of California, Santa Barbara
ABSTRACT
We consider approaches for similarity search in correlated, high-dimensional data sets, which are derived within a clustering framework. We note that indexing by “vector approximation” (VA-File), which was proposed as a technique to combat the “Curse of Dimensionality,” employs scalar quantization, and hence necessarily ignores dependencies across dimensions, which represents a source of suboptimality. Clustering, on the other hand, exploits interdimensional correlations and is thus a more compact representation of the data set. However, existing methods to prune irrelevant clusters are based on bounding hyperspheres and/or bounding rectangles, whose lack of tightness compromises their efficiency in exact nearest neighbor search. We propose a new cluster-adaptive distance bound based on separating hyperplane boundaries of Voronoi clusters to complement our cluster based index. This bound enables efficient spatial filtering, with a relatively small preprocessing storage overhead and is applicable to euclidean and Mahalanobis similarity measures. Experiments in exact nearest-neighbor set retrieval, conducted on real data sets, show that our indexing method is scalable with data set size and data dimensionality and outperforms several recently proposed indexes. Relative to the VA-File, over a wide range of quantization resolutions, it is able to reduce random IO accesses, given (roughly) the same amount of sequential IO operations, by factors reaching 100X and more.
INDEX TERMS
Multimedia databases, indexing methods, similarity measures, clustering, image databases.
CITATION
Sharadh Ramaswamy, Kenneth Rose, "Adaptive Cluster Distance Bounding for High-Dimensional Indexing", IEEE Transactions on Knowledge & Data Engineering, vol.23, no. 6, pp. 815-830, June 2011, doi:10.1109/TKDE.2010.59
REFERENCES
[1] C. Faloutsos, Searching in Multimedia Databases By Content. Kluwer Academic Press, 1996.
[2] K.S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, "When is "Nearest Neighbor" Meaningful?," Proc. Int'l Conf. Database Theory (ICDT), pp. 217-235, 1999.
[3] C.C. Aggarwal, A. Hinneburg, and D.A. Keim, "On the Surprising Behavior of Distance Metrics in High Dimensional Spaces," Proc. Int'l Conf. Database Theory (ICDT), pp. 420-434, 2001.
[4] B.U. Pagel, F. Korn, and C. Faloutsos, "Deflating the Dimensionality Curse Using Multiple Fractal Dimensions," Proc. Int'l Conf. Data Eng. (ICDE), pp. 589-598, 2000.
[5] T. Huang and X.S. Zhou, "Image Retrieval with Relevance Feedback: From Heuristic Weight Adjustment to Optimal Learning Methods," Proc. Int'l Conf. Image Processing (ICIP), vol. 3, pp. 2-5, 2001.
[6] J. Davis, B. Kulis, P. Jain, S. Sra, and I. Dhillon, "Information-Theoretic Metric Learning," Proc. Int'l Conf. Machine Learning (ICML), pp. 209-216, 2007.
[7] R. Weber, H. Schek, and S. Blott, "A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 194-205, Aug. 1998.
[8] A. Gersho and R.M. Gray, Vector Quantization and Signal Compression. Kluwer Academic Publishers, 1992.
[9] T. Zhang, R. Ramakrishnan, and M. Livny, "BIRCH: An Efficient Data Clustering Method for Very Large Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 103-114, 1996.
[10] N. Katayama and S. Satoh, "The SR-Tree: An Index Structure for High-Dimensional Nearest Neighbor Queries," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 369-380, May 1997.
[11] D.A. White and R. Jain, "Similarity Indexing with the SS-Tree," Proc. Int'l Conf. Data Eng. (ICDE), pp. 516-523, 1996.
[12] N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger, "The ${\rm R}^\ast$ -Tree: An Efficient and Robust Access Method for Points and Rectangles," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 322-331, 1990.
[13] Y. Sakurai, M. Yoshikawa, S. Uemura, and H. Kojima, "The A-Tree: An Index Structure for High-Dimensional Spaces Using Relative Approximation," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 516-526, Sept. 2000.
[14] S. Berchtold, C. Bohm, H.V. Jagadish, H.P. Kriegel, and J. Sander, "Independent Quantization: An Index Compression Technique for High-Dimensional Data Spaces," Proc. Int'l Conf. Data Eng. (ICDE), pp. 577-588, 2000.
[15] C. Yu, B.C. Ooi, K.L. Tan, and H.V. Jagadish, "Indexing the Distance: An Efficient Method to KNN Processing," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 421-430, Sept. 2001.
[16] Y. Ishikawa, R. Subramanya, and C. Faloutsos, "Mindreader: Querying Databases through Multiple Examples," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 218-227, Aug. 1998.
[17] Y. Rui and T. Huang, "Optimizing Learning in Image Retrieval," Computer Vision and Pattern Recognition, vol. 1, pp. 1236-1243, 2000.
[18] S. Ramaswamy and K. Rose, "Adaptive Cluster-Distance Bounding for Similarity Search in Image Databases," Proc. Int'l Conf. Image Processing (ICIP), vol. 6, pp. 381-384, 2007.
[19] N. Koudas, B.C. Ooi, H.T. Shen, and A.K.H. Tung, "LDC: Enabling Search by Partial Distance in a Hyper-Dimensional Space," Proc. Int'l Conf. Data Eng. (ICDE), pp. 6-17, 2004.
[20] A. Guttman, "R-Trees: A Dynamic Index Structure for Spatial Searching," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 47-57, 1984.
[21] P. Ciaccia, M. Patella, and P. Zezula, "M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces," Proc. Int'l Conf. Very Large Databases (VLDB), pp. 426-435, 1997.
[22] R. Bellman, Adaptive Control Processes: A Guided Tour. Princeton Univ. Press, 1961.
[23] H. Ferhatosmanoglu, E. Tuncel, D. Agrawal, and A.E. Abbadi, "Vector Approximation Based Indexing for Non-Uniform High Dimensional Data Sets," Proc. Int'l Conf. Information and Knowledge Management (CIKM), pp. 202-209, 2000.
[24] K. Chakrabarti and S. Mehrotra, "Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces," Proc. Int'l Conf. Very Large Databases (VLDB), pp. 89-100, Sept. 2000.
[25] K. Vu, K. Hua, H. Cheng, and S. Lang, "A Non-Linear Dimensionality-Reduction Technique for Fast Similarity Search in Large Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 527-538, 2006.
[26] S. Berchtold, C. Bohm, and H. Kriegel, "The Pyramid-Technique: Towards Breaking the Curse of Dimensionality," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 142-153, 1998.
[27] H. Jin, B.C. Ooi, H.T. Shen, C. Yu, and A. Zhou, "An Adaptive and Efficient Dimensionality Reduction Algorithm for High-Dimensional Indexing," Proc. Int'l Conf. Data Eng. (ICDE), pp. 87-98, Mar. 2003.
[28] P. Ciaccia and M. Patella, "PAC Nearest Neighbor Queries: Approximate and Controlled Search in High-Dimensional and Metric Spaces," Proc. Int'l Conf. Data Eng. (ICDE), pp. 244-255, 2000.
[29] R. Weber and K. Böhm, "Trading Quality for Time with Nearest Neighbor Search," Proc. Seventh Int'l Conf. Extending Database Technology (EDBT): Advances in Database Technology, pp. 21-35, 2000.
[30] E. Tuncel, H. Ferhatosmanoglu, and K. Rose, "VQ-Index: An Index Structure for Similarity Searching in Multimedia Databases," Proc. ACM Int'l Conf. Multimedia, pp. 543-552, 2002.
[31] E. Tuncel and K. Rose, "Towards Optimal Clustering for Approximate Similarity Searching," Proc. IEEE Int'l Conf. Multimedia and Expo (ICME), vol. 2, pp. 497-500, Aug. 2002.
[32] H. Ferhatosmanoglu, E. Tuncel, D. Agrawal, and A.E. Abbadi, "Approximate Nearest Neighbor Searching in Multimedia Databases," Proc. Int'l Conf. Data Eng. (ICDE), pp. 503-511, Apr. 2001.
[33] A. Gionis, P. Indyk, and R. Motwani, "Similarity Search in High Dimensions via Hashing," Proc. Int'l Conf. Very Large Databases (VLDB), pp. 518-529, Sept. 1999.
[34] P. Ciaccia and M. Patella, "Approximate Similarity Queries: A Survey," Technical Report CSITE-08-01, May 2001.
[35] E. Tuncel, P. Koulgi, and K. Rose, "Rate-Distortion Approach to Databases: Storage and Content-Based Retrieval," IEEE Trans. Information Theory, vol. 50, no. 6, pp. 953-967, June 2004.
[36] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis. John Wiley & Sons, 1996.
34 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool