Subscribe

Issue No.06 - June (2008 vol.20)

pp: 768-783

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2008.30

ABSTRACT

We examine the problem of efficient distance-based similarity search over high-dimensional data. A promising approach to this problem is to reduce dimensions and allow fast approximation. Conventional reduction approaches, however, entail a significant shortcoming: the approximation volume extends across the dataspace, which causes over-estimation of retrieval sets and impairs performance. This paper focuses on a new criterion for dimensionality reduction methods: bounded approximation. We show that this requirement can be accomplished by a novel non-linear transformation scheme that extracts two important parameters from the data. We devise two approximation formulations, rectangular and spherical range search, each corresponding to a closed volume around the original search sphere. We discuss in detail how to derive tight bounds for the parameters and to prove further results, as well as highlighting insights into the problems and our proposed solutions. To demonstrate the benefits of the new criterion, we study the effects of (un)boundedness on approximation performance, including selectivity, error toleration, and efficiency. Extensive experiments confirm the superiority of this technique over recent state-of-the-art schemes.

INDEX TERMS

Information Storage and Retrieval, Information Search and Retrieval, Search process

CITATION

Kien A. Hua, Hao Cheng, Khanh Vu, "Bounded Approximation: A New Criterion for Dimensionality Reduction Approximation in Similarity Search",

*IEEE Transactions on Knowledge & Data Engineering*, vol.20, no. 6, pp. 768-783, June 2008, doi:10.1109/TKDE.2008.30REFERENCES

- [2] http://u-foria.org/mariohspatialindex/, 2007.
- [3] http://www.cs.cmu.edu/ christossoftware.html , 2007.
- [5] http://www.ctisus.org/tfindextf.html, 2007.
- [6] http://www.mediateam.oulu.fi/mtdbdownload.html , 2007.
- [8] S. Berchtold, C. Bohm, and H. Kriegel, “The Pyramid Technique: Toward Breaking the Curse of Dimensionality,”
Proc. ACM SIGMOD '98, pp. 142-153, 1998.- [9] S. Berchtold, D. Keim, and H. Kriegel, “The X-Tree: An Index Structure for High-Dimensional Data,”
Proc. 22nd Int'l Conf. Very Large Data Bases (VLDB), 1996.- [10] C. Blake and C. Merz, UCI Repository of Machine Learning Databases, http://www.ics.uci.edu/mlearnMLRepository.html , 1998.
- [14] K. Chan and W. Fu, “Efficient Time Series Matching by Wavelets,”
Proc. 15th IEEE Int'l Conf. Data Eng. (ICDE), 1999.- [15] D.L. Donoho, “High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality,”
Proc. AMS Conf. Math. Challenges of the 21st Century, http://www.waveletidr.orglectures.html, 2000.- [20] D. Goldin and P. Kanellakis, “On Similarity Queries for Time-Series Data: Constraint Specifications and Implementation,”
Proc. First Int'l Conf. Principles and Practice of Constraint Programming (CP '95), pp. 137-153, Sept. 1995.- [21] A. Guttman, “The R-Tree: A Dynamic Index Structure for Spatial Searching,”
Proc. ACM SIGMOD '84, pp. 47-57, June 1984.- [22] K.A. Hua, K. Vu, and J. Oh, “SamMatch: A Flexible and Efficient Sampling-Based Image Retrieval Technique for Large Image Databases,”
Proc. Seventh ACM Int'l Conf. Multimedia (Multimedia '99), pp. 225-234, Oct. 1999.- [23] H.V. Jagadish, B.C. Ooi, K. Tan, C. Yu, and R. Zhang, “iDistance: An Adaptive ${\rm B}^{+}\hbox{-}{\rm tree}$ Based Indexing Method for Nearest Neighbor Search,”
ACM Trans. Data Base Systems, vol. 30, no. 2, pp. 364-397, 2005.- [24] K.V.R. Kanth, D. Agrawal, A.E. Abbadi, and A. Singh, “Dimensionality Reduction for Similarity Searching in Dynamic Databases,”
Proc. ACM SIGMOD '98, pp. 166-176, 1998.- [25] E. Keogh, K. Chakrabarti, M. Pazzani, and Mehrotra, “Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases,”
J. Knowledge and Information Systems, 2000.- [28] A. Natsev, R. Rastogi, and K. Shim, “Walrus: A Similarity Retrieval Algorithm for Image Databases,”
Proc. ACM SIGMOD '99, pp. 395-406, 1999.- [30] A. Paradopoulos and Y. Manolopoulos, “Performance of Nearest Neighbor Queries in R-Trees,”
Proc. Sixth Int'l Conf. Database Theory (ICDT '97), pp. 394-408, 1997.- [31] R. Agrawal, C. Faloutsos, and A. Swami, “Efficient Similarity Search in Sequence Databases,”
Proc. Fourth Int'l Conf. Foundations of Data Organizations and Algorithms (FODO), 1993.- [32] S. Roweis, “EM Algorithms for PCA and SPCA,”
Advances in Neural Information Processing Systems 10, pp. 626-632, 1997.- [33] H. Samet, “Foundations of Multidimensional and Metric Data Structures,”
The Morgan Kaufmann Series in Computer Graphics, first ed. Morgan Kaufmann, 2006.- [34] T. Seidl and H. Kriegel, “Optimal Multi-Step $k\hbox{-}{\rm Nearest}\;{\rm Neighbor}$ Search,”
Proc. ACM SIGMOD '98, pp. 154-165, 1998.- [35] L. Sirovich and R. Everson, “Management and Analysis of Large Scientific Datasets,”
Int'l J. Supercomputer Applications, vol. 6, no. 1, pp. 50-68, 1992.- [37] R. Weber, H. Schek, and S. Blott, “A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces,”
Proc. 24th Int'l Conf. Very Large Data Bases (VLDB '98), pp. 194-205, 1998.- [38] Y. Wu, D. Agrawal, and A. Abbadi, “A Comparison of DFT and DWT Based Similarity Search in Time Series Databases,”
Proc. Ninth ACM Int'l Conf. Information and Knowledge Management (CIKM), 2000.- [39] J. Ye, R. Janardan, and Q. Li, “GPCA: An Efficient Dimension Reduction Scheme for Image Compression and Retrieval,”
Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '04), pp. 354-363, 2004.- [40] B.-K. Yi and C. Faloutsos, “Fast Time Sequence Indexing for Arbitrary ${\rm l}_{p}$ Norms,”
The VLDB J., pp. 385-394, 2000.- [41] T. Zhang, R. Ramakrishnan, and M. Livny, “Birch: An Efficient Data Clustering Method for Very Large Databases,”
Proc. ACM SIGMOD '96, pp. 103-114, June 1996. |