Subscribe

Issue No.10 - October (2009 vol.21)

pp: 1447-1460

Xiang Lian , Hong Kong University of Science and Technology, Hong Kong

Lei Chen , Hong Kong University of Science and Technology, Hong Kong

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2008.170

ABSTRACT

Similarity search usually encounters a serious problem in the high-dimensional space, known as the “curse of dimensionality.” In order to speed up the retrieval efficiency, most previous approaches reduce the dimensionality of the entire data set to a fixed lower value before building indexes (referred to as global dimensionality reduction (GDR)). More recent works focus on locally reducing the dimensionality of data to different values (called the local dimensionality reduction (LDR)). In addition, random projection is proposed as an approximate dimensionality reduction (ADR) technique to answer the approximate similarity search instead of the exact one. However, so far little work has formally evaluated the effectiveness and efficiency of GDR, LDR, and ADR for the range query. Motivated by this, in this paper, we propose general cost models for evaluating the query performance over the reduced data sets by GDR, LDR, and ADR, in light of which we introduce a novel (A)LDR method, Partitioning based on RANdomized Search (PRANS). It can achieve high retrieval efficiency with the guarantee of optimality given by the formal models. Finally, a {\rm B}^{+}-tree index is constructed over the reduced partitions for fast similarity search. Extensive experiments validate the correctness of our cost models on both real and synthetic data sets and demonstrate the efficiency and effectiveness of the proposed PRANS method.

INDEX TERMS

High-dimensionality reduction, similarity search.

CITATION

Xiang Lian, Lei Chen, "General Cost Models for Evaluating Dimensionality Reduction in High-Dimensional Spaces",

*IEEE Transactions on Knowledge & Data Engineering*, vol.21, no. 10, pp. 1447-1460, October 2009, doi:10.1109/TKDE.2008.170REFERENCES

- [1] D. Achlioptas, “Database-Friendly Random Projections,”
Proc. 20th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems (PODS), 2001.- [2] R. Agrawal, C. Faloutsos, and A.N. Swami, “Efficient Similarity Search in Sequence Databases,”
Proc. Fourth Int'l Conf. Foundations of Data Organization and Algorithms (FODO), 1993.- [3] S. Berchtold, C. Böhm, and H.P. Kriegel, “The Pyramid-Technique: Towards Breaking the Curse of Dimensionality,”
Proc. ACM SIGMOD, 1998.- [4] S. Berchtold, D.A. Keim, and H.P. Kriegel, “The X-Tree: An Index Structure for High-Dimensional Data,”
Proc. 22nd Int'l Conf. Very Large Data Bases (VLDB), 1996.- [5] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When Is ‘Nearest Neighbor’ Meaningful?”
LNCS, 1999.- [6] K. Chakrabarti and S. Mehrotra, “The Hybrid Tree: An Index Structure for High Dimensional Feature Spaces,”
Proc. 15th Int'l Conf. Data Eng. (ICDE), 1999.- [7] K. Chakrabarti and S. Mehrotra, “Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces,”
Proc. 26th Int'l Conf. Very Large Data Bases (VLDB), 2000.- [8] P. Ciaccia, M. Patella, and P. Zezula, “M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces,”
Proc. 23rd Int'l Conf. Very Large Data Bases (VLDB), 1997.- [9] H. Ferhatosmanoglu, E. Tuncel, D. Agrawal, and A.E. Abbadi, “Vector Approximation Based Indexing for Non-Uniform High Dimensional Data Sets,”
Proc. Int'l Conf. Information and Knowledge Management (CIKM), 2000.- [10] A. Guttman, “R-Trees: A Dynamic Index Structure for Spatial Searching,”
Proc. ACM SIGMOD, 1984.- [11] X. He, D. Cai, and P. Niyogi, “Tensor Subspace Analysis,”
Proc. Advances in Neural Information Processing Systems (NIPS), 2006.- [12] J. Huang, R. Kumar, M. Mitra, W. Zhu, and R. Zabih, “Image Indexing Using Color Correlograms,”
Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 1997.- [13] H.V. Jagadish, B.C. Ooi, K.-L. Tan, C. Yu, and R. Zhang, “iDistance: An Adaptive ${\rm B}^{+}$ -tree Based Indexing Method for Nearest Neighbor Search,”
ACM Trans. Database Systems, 2005.- [14] H. Jin, B.C. Ooi, H.T. Shen, C. Yu, and A. Zhou, “An Adaptive and Efficient Dimensionality Reduction Algorithm for High-Dimensional Indexing,”
Proc. 19th Int'l Conf. Data Eng. (ICDE), 2003.- [15] I. Kamel and C. Faloutsos, “Hilbert R-tree: An Improved R-Tree Using Fractals,”
Proc. 20th Int'l Conf. Very Large Data Bases (VLDB), 1994.- [16] K.V. Ravi Kanth, D. Agrawal, and A. Singh, “Dimensionality Reduction for Similarity Searching in Dynamic Databases,”
Proc. ACM SIGMOD, 1998.- [17] N. Katayama and S. Satoh, “The SR-Tree: An Index Structure for High-Dimensional Nearest Neighbor Queries,”
Proc. ACM SIGMOD, 1997.- [18] R. Kohavi and D. Sommerfield, “Feature Subset Selection Using the Wrapper Model: Overfitting and Dynamic Search Space Topology,”
Proc. First Int'l Conf. Knowledge Discovery and Data Mining (KDD), 1995.- [19] F. Korn, H. Jagadish, and C. Faloutsos, “Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences,”
Proc.ACM SIGMOD, 1997.- [20] X. Lian and L. Chen, “A General Cost Model for Dimensionality Reduction in High Dimensional Spaces,”
Proc. 23rd Int'l Conf. Data Eng. (ICDE), 2007.- [21] K.-I. Lin, H.V. Jagadish, and C. Faloutsos, “The TV-Tree: An Index Structure for High-Dimensional Data,”
VLDB J., vol. 3, no. 4, pp.517-542, 1994.- [22] R.T. Ng and J. Han, “Efficient and Effective Clustering Methods for Spatial Data Mining,”
Proc. 20th Int'l Conf. Very Large Data Bases (VLDB), 1994.- [23] B.C. Ooi, K.-L. Tan, C. Yu, and S. Bressan, “Indexing the Edges: A Simple and Yet Efficient Approach to High-Dimensional Indexing,”
Proc. 19th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems (PODS), 2000.- [24] I. Popivanov and R.J. Miller, “Similarity Search over Time Series Data Using Wavelets,”
Proc. 17th Int'l Conf. Data Eng. (ICDE), 2001.- [25] S. Rasetic, J. Sander, J. Elding, and M.A. Nascimento, “A Trajectory Splitting Model for Efficient Spatio-Temporal Indexing,”
Proc. 31st Int'l Conf. Very Large Data Bases (VLDB), 2005.- [26] C.J.V. Rijsbergen,
Information Retrieval. Butterworth-Heinemann, 1979.- [27] T. Seidl and H. Kriegel, “Optimal Multi-Step $k$ -Nearest Neighbor Search,”
Proc. ACM SIGMOD, 1998.- [28] R. Weber, H.-J. Schek, and S. Blott, “A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces,”
Proc. 24th Int'l Conf. Very Large Data Bases (VLDB), 1998.- [29] E.W. Weisstein,
Central Limit Theorem, citeseer.ist.psu.edu/47461.htmlhttp://mathworld.wolfram. com CentralLimitTheorem.html , 2008. |