Subscribe
Issue No.06 - June (2008 vol.20)
pp: 768-783
ABSTRACT
We examine the problem of efficient distance-based similarity search over high-dimensional data. A promising approach to this problem is to reduce dimensions and allow fast approximation. Conventional reduction approaches, however, entail a significant shortcoming: the approximation volume extends across the dataspace, which causes over-estimation of retrieval sets and impairs performance. This paper focuses on a new criterion for dimensionality reduction methods: bounded approximation. We show that this requirement can be accomplished by a novel non-linear transformation scheme that extracts two important parameters from the data. We devise two approximation formulations, rectangular and spherical range search, each corresponding to a closed volume around the original search sphere. We discuss in detail how to derive tight bounds for the parameters and to prove further results, as well as highlighting insights into the problems and our proposed solutions. To demonstrate the benefits of the new criterion, we study the effects of (un)boundedness on approximation performance, including selectivity, error toleration, and efficiency. Extensive experiments confirm the superiority of this technique over recent state-of-the-art schemes.
INDEX TERMS
Information Storage and Retrieval, Information Search and Retrieval, Search process
CITATION
Khanh Vu, Kien A. Hua, Hao Cheng, Sheau-Dong Lang, "Bounded Approximation: A New Criterion for Dimensionality Reduction Approximation in Similarity Search", IEEE Transactions on Knowledge & Data Engineering, vol.20, no. 6, pp. 768-783, June 2008, doi:10.1109/TKDE.2008.30
REFERENCES
 [1] http://sipi.usc.edu/services/databasedatabase.html , 2007. [2] http://u-foria.org/mariohspatialindex/, 2007. [3] http://www.cs.cmu.edu/ christossoftware.html , 2007. [4] http://www.cse.ohio-state.edu/õzturk/ datadata, 2007. [5] http://www.ctisus.org/tfindextf.html, 2007. [6] http://www.mediateam.oulu.fi/mtdbdownload.html , 2007. [7] N. Beckman, H. Kriegel, R. Schneider, and B. Seeger, “The ${\rm R}^{\ast}\hbox{-}{\rm tree}$ : An Efficient and Robust Access Method for Points and Rectangles,” Proc. ACM SIGMOD '90, pp. 322-331, May 1990. [8] S. Berchtold, C. Bohm, and H. Kriegel, “The Pyramid Technique: Toward Breaking the Curse of Dimensionality,” Proc. ACM SIGMOD '98, pp. 142-153, 1998. [9] S. Berchtold, D. Keim, and H. Kriegel, “The X-Tree: An Index Structure for High-Dimensional Data,” Proc. 22nd Int'l Conf. Very Large Data Bases (VLDB), 1996. [10] C. Blake and C. Merz, UCI Repository of Machine Learning Databases, http://www.ics.uci.edu/mlearnMLRepository.html , 1998. [11] J. Bourgain, “On Lipschitz Embedding of Finite Metric Spaces into Hilbert Space,” Israel J. Math., no. 52, pp. 46-52, 1985. [12] J.C. Traina, A. Traina, C. Faloutsos, and B. Seeger, “Fast Indexing and Visualization of Metric Data Sets Using Slim-Trees,” IEEE Trans. Knowledge and Data Eng., vol. 14, no. 2, pp. 244-260, Mar./Apr. 2002. [13] G. Cha, X. Zhu, D. Petkovic, and C. Chung, “An Efficient Indexing Method for Nearest Neighbor Searches in High-Dimensional Image Databases,” IEEE Trans. Multimedia, vol. 4, no. 1, pp. 76-87, Mar. 2002. [14] K. Chan and W. Fu, “Efficient Time Series Matching by Wavelets,” Proc. 15th IEEE Int'l Conf. Data Eng. (ICDE), 1999. [15] D.L. Donoho, “High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality,” Proc. AMS Conf. Math. Challenges of the 21st Century, http://www.waveletidr.orglectures.html, 2000. [16] O. Egecioglu, H. Ferhatosmanoglu, and U. Ogra, “Dimensionality Reduction and Similarity Computation by Inner-Product Approximations,” IEEE Trans. Knowledge and Eng., vol. 16, no. 6, pp. 714-726, June 2004. [17] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, “Fast Subsequence Matching in Time-Series Databases,” Proc. ACM SIGMOD '94, pp. 419-429, May 1994. [18] R.F.S. Filho, A.J.M. Traina, C. T. Jr., and C. Faloutsos, “Similarity Search without Tears: The OMNI Family of All-Purpose Access Methods,” Proc. 17th IEEE Int'l Conf. Data Eng. (ICDE '01), pp. 623-630, 2001. [19] V. Gaede and O. Günther, “Multidimensional Access Methods,” ACM Computing Surveys, vol. 30, no. 2, pp. 170-231, 1998. [20] D. Goldin and P. Kanellakis, “On Similarity Queries for Time-Series Data: Constraint Specifications and Implementation,” Proc. First Int'l Conf. Principles and Practice of Constraint Programming (CP '95), pp. 137-153, Sept. 1995. [21] A. Guttman, “The R-Tree: A Dynamic Index Structure for Spatial Searching,” Proc. ACM SIGMOD '84, pp. 47-57, June 1984. [22] K.A. Hua, K. Vu, and J. Oh, “SamMatch: A Flexible and Efficient Sampling-Based Image Retrieval Technique for Large Image Databases,” Proc. Seventh ACM Int'l Conf. Multimedia (Multimedia '99), pp. 225-234, Oct. 1999. [23] H.V. Jagadish, B.C. Ooi, K. Tan, C. Yu, and R. Zhang, “iDistance: An Adaptive ${\rm B}^{+}\hbox{-}{\rm tree}$ Based Indexing Method for Nearest Neighbor Search,” ACM Trans. Data Base Systems, vol. 30, no. 2, pp. 364-397, 2005. [24] K.V.R. Kanth, D. Agrawal, A.E. Abbadi, and A. Singh, “Dimensionality Reduction for Similarity Searching in Dynamic Databases,” Proc. ACM SIGMOD '98, pp. 166-176, 1998. [25] E. Keogh, K. Chakrabarti, M. Pazzani, and Mehrotra, “Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases,” J. Knowledge and Information Systems, 2000. [26] E.J. Keogh, K. Chakrabarti, S. Mehrotra, and M.J. Pazzani, “Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases,” Proc. ACM SIGMOD '01, pp. 151-162, 2001. [27] C. Li, P. Yu, and V. Castelli, “Hierarchyscan: A Hierarchical Similarity Search Algorithm for Databases of Long Sequences,” Proc. 12th IEEE Int'l Conf. Data Eng. (ICDE '96), pp. 546-553, 1996. [28] A. Natsev, R. Rastogi, and K. Shim, “Walrus: A Similarity Retrieval Algorithm for Image Databases,” Proc. ACM SIGMOD '99, pp. 395-406, 1999. [29] R. Orlandic, J. Lukaszuk, and C. Swietlik, “The Design of a Retrieval Technique for High-Dimensional Data on Tertiary Storage,” ACM SIGMOD Record, vol. 31, no. 2, pp. 15-21, June 2002. [30] A. Paradopoulos and Y. Manolopoulos, “Performance of Nearest Neighbor Queries in R-Trees,” Proc. Sixth Int'l Conf. Database Theory (ICDT '97), pp. 394-408, 1997. [31] R. Agrawal, C. Faloutsos, and A. Swami, “Efficient Similarity Search in Sequence Databases,” Proc. Fourth Int'l Conf. Foundations of Data Organizations and Algorithms (FODO), 1993. [32] S. Roweis, “EM Algorithms for PCA and SPCA,” Advances in Neural Information Processing Systems 10, pp. 626-632, 1997. [33] H. Samet, “Foundations of Multidimensional and Metric Data Structures,” The Morgan Kaufmann Series in Computer Graphics, first ed. Morgan Kaufmann, 2006. [34] T. Seidl and H. Kriegel, “Optimal Multi-Step $k\hbox{-}{\rm Nearest}\;{\rm Neighbor}$ Search,” Proc. ACM SIGMOD '98, pp. 154-165, 1998. [35] L. Sirovich and R. Everson, “Management and Analysis of Large Scientific Datasets,” Int'l J. Supercomputer Applications, vol. 6, no. 1, pp. 50-68, 1992. [36] K. Vu, K.A. Hua, H. Cheng, and S.-D. Lang, “A Non-Linear Dimensionality-Reduction Technique for Fast Similarity Search in Large Databases,” Proc. ACM SIGMOD '06, pp. 527-538, 2006. [37] R. Weber, H. Schek, and S. Blott, “A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces,” Proc. 24th Int'l Conf. Very Large Data Bases (VLDB '98), pp. 194-205, 1998. [38] Y. Wu, D. Agrawal, and A. Abbadi, “A Comparison of DFT and DWT Based Similarity Search in Time Series Databases,” Proc. Ninth ACM Int'l Conf. Information and Knowledge Management (CIKM), 2000. [39] J. Ye, R. Janardan, and Q. Li, “GPCA: An Efficient Dimension Reduction Scheme for Image Compression and Retrieval,” Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '04), pp. 354-363, 2004. [40] B.-K. Yi and C. Faloutsos, “Fast Time Sequence Indexing for Arbitrary ${\rm l}_{p}$ Norms,” The VLDB J., pp. 385-394, 2000. [41] T. Zhang, R. Ramakrishnan, and M. Livny, “Birch: An Efficient Data Clustering Method for Very Large Databases,” Proc. ACM SIGMOD '96, pp. 103-114, June 1996.