Subscribe

Issue No.02 - February (2012 vol.24)

pp: 365-382

Xiang Lian , University of Texas - Pan American, Edinburg

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.219

ABSTRACT

Similarity search has been widely used in many applications such as information retrieval, image data analysis, and time-series matching. Previous work on similarity search usually consider the search problem in the full space. In this paper, however, we tackle a problem, subspace similarity search, which finds all data objects that match with a query object in the subspace instead of the original full space. In particular, the query object can specify arbitrary subspace with arbitrary number of dimensions. Due to the exponential number of possible subspaces specified by users, we introduce an efficient and effective pruning technique, which assigns scores to data objects with respect to pivots and prunes candidates via scores. We propose an effective multipivot-based method to preprocess data objects by selecting appropriate pivots, where the entire procedure is guided by a formal cost model, such that the pruning power is maximized. Then, scores of each data object are organized in sorted lists to facilitate an efficient subspace similarity search. Furthermore, many real-world application data such as image databases, time-series data, and sensory data often contain noises, which can be modeled as uncertain objects. Different from certain data, efficient query processing on uncertain data is more challenging due to its intensive computation of probability confidences. Thus, it is also crucial to answer subspace queries efficiently and effectively over uncertain objects. Specifically, we define a novel query, namely probabilistic subspace range query (PSRQ) in the uncertain database, which finds objects within a distance from a query object in any subspace with high probability. To address this query, we extend our proposed pruning techniques for precise data to that of answering PSRQ in arbitrary subspaces. Extensive experiments demonstrated the performance of our proposed approaches.

INDEX TERMS

Subspace similarity search, L_p-norm, triangle inequality.

CITATION

Xiang Lian, "Subspace Similarity Search under {\rm L}_p-Norm",

*IEEE Transactions on Knowledge & Data Engineering*, vol.24, no. 2, pp. 365-382, February 2012, doi:10.1109/TKDE.2010.219REFERENCES

- [1] "Data Warehousing and OLAP: A Research-Oriented Bibliography," http://www.ondelette.com/OLAPdwbib.html, 2011.
- [2] R. Agrawal, C. Faloutsos, and A.N. Swami, "Efficient Similarity Search in Sequence Databases,"
Proc. Int'l Conf. Foundations of Data Organization and Algorithms, 1993.- [3] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, "Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,"
SIGMOD Record, vol. 27, no. 2, pp. 94-105, 1998.- [4] R. Basri, T. Hassner, and L. Zelnik-Manor, "A General Framework for Approximate Nearest Subspace Search,"
Technical Report CCIT #699, Dept. of Electrical Eng., Technion, 2008.- [5] S. Berchtold, C. Böhm, and H.P. Kriegel, "The Pyramid-Technique: Towards Breaking the Curse of Dimensionality,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, 1998.- [6] S. Berchtold, D.A. Keim, and H.P. Kriegel, "The X-Tree: An Index Structure for High-Dimensional Data,"
Proc. 22nd Int'l Conf. Very Large Data Bases (VLDB), 1996.- [7] T. Bozkaya and M. Ozsoyoglu, "Distance-Based Indexing for High-Dimensional Metric Spaces,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, 1997.- [8] C. TrainaJr., R.F. Filho, A.J. Traina, M.R. Vieira, and C. Faloutsos, "The Omni-Family of All-Purpose Access Methods: A Simple and Effective Way to Make Similarity Search More Efficient,"
The Int'l J. Very Large Data Bases, vol. 16, no. 4, pp. 483-505, 2007.- [9] K.P. Chan and A.W-C Fu, "Efficient Time Series Matching by Wavelets,"
Proc. 15th Int'l Conf. Data Eng. (ICDE), 1999.- [10] E. Chaváez, G. Navarro, R. Baeza-Yates, and J.L. Marroquń, "Searching in Metric Spaces,"
ACM Computing Surveys, vol. 33, no. 3, pp. 273-321, 2001.- [11] R. Cheng, D.V. Kalashnikov, and S. Prabhakar, "Evaluating Probabilistic Queries over Imprecise Data,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, 2003.- [12] A. Faradjian, J. Gehrke, and P. Bonnet, "GADT: A Probability Space ADT for Representing and Querying the Physical World,"
Proc. 18th Int'l Conf. Data Eng. (ICDE), 2002.- [13] Y. Gao, B. Zheng, G. Chen, W.-C. Lee, K.C.K. Lee, and Q. Li, "Visible Reverse K-Nearest Neighbor Query Processing in Spatial Databases,"
IEEE Trans. Knowledge and Data Eng., vol. 21, no. 9, pp. 1314-1327, Sept. 2009.- [14] K-S. Goh, B.T. Li, and Ed. Chang, "Dyndex: A Dynamic and Non-Metric Space Indexer,"
Proc. 10th ACM Int'l Conf. Multimedia, 2002.- [15] C.M. Grinstead and J.L. Snell,
Introduction to Probability. Am. Math. Soc., 1997.- [16] A. Guttman, "R-Trees: A Dynamic Index Structure for Spatial Searching,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, 1984.- [17] I. Guyon and A. Elisseeff, "An Introduction to Variable and Feature Selection,"
J. Machine Learning Research, vol. 3, pp. 1157-1182, 2003.- [18] X. He, "Incremental Semi-Supervised Subspace Learning for Image Retrieval,"
Proc. 13rd ACM Int'l Conf. Multimedia, 2005.- [19] X. He, D. Cai, and P. Niyogi, "Tensor Subspace Analysis,"
Proc. Nineth Ann. Conf. Neural Information Processing Systems (NIPS), 2005.- [20] H.V. Jagadish, B.C. Ooi, K.-L. Tan, C. Yu, and R. Zhang, "Idistance: An Adaptive ${\rm b}^+$ -Tree Based Indexing Method for Nearest Neighbor Search,"
ACM Trans. Database Systems, vol. 30, no. 2, pp. 364-397, 2005.- [21] G. Jovanovic-Dolecek, "Demo Program for Central Limit Theorem,"
Proc. 40th Midwest Symp. Circuits and Systems, 1997.- [22] Y. Ke, R. Sukthankar, and L. Huston, "An Efficient Parts-Based Near-Duplicate and Sub-Image Retrieval System,"
Proc. 12th ACM Int'l Conf. Multimedia, 2004.- [23] R. Kohavi and D. Sommerfield, "Feature Subset Selection Using the Wrapper Model: Overfitting and Dynamic Search Space Topology,"
Proc. Int'l Conf. Knowledge Discovery and Data Mining, 1995.- [24] F. Korn, H. Jagadish, and C. Faloutsos, "Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, 1997.- [25] M. Koskela, J. Laaksonen, and E. Oja, "Use of Image Subset Features in Image Retrieval with Self-Organizing Maps,"
Proc. Conf. Image and Video Retrieval, 2004.- [26] H.-P. Kriegel, P. Kroger, M. Schubert, and Z. Zhu, "Efficient Query Processing in Arbitrary Subspaces Using Vector Approximations,"
Proc. Int'l Conf. Scientific and Statistical Database Management (SSDBM), 2006.- [27] H.-P. Kriegel, P. Kunath, M. Pfeifle, and M. Renz, "Probabilistic Similarity Join on Uncertain Data,"
Proc. Int'l Conf. Database Systems for Advanced Applications (DASFAA), 2006.- [28] F. Li, J. Sun, S. Papadimitriou, G.A. Mihaila, and I. Stanoi, "Hiding in the Crowd: Privacy Preservation on Evolving Streams through Correlation Tracking,"
Proc. 23rd Int'l Conf. Data Eng. (ICDE), 2007.- [29] M. Li, Y. He, Y. Liu, J. Zhao, S. Tang, X.-Y. Li, and G. Dai, "Canopy Closure Estimates with Greenorbs: Sustainable Sensing in the Forest,"
Proc. Seventh ACM Conf. Embedded Networked Sensor Systems (Sensys '09), http:/greenorbs.org, 2009.- [30] X. Lian and L. Chen, "Similarity Search in Arbitrary Subspaces under ${\rm L}_p$ -norm,"
Proc. 24th Int'l Conf. Data Eng. (ICDE), 2008.- [31] V. Ljosa and A.K. Singh, "APLA: Indexing Arbitrary Probability Distributions,"
Proc. 23rd Int'l Conf. Data Eng. (ICDE), 2007.- [32] D.G. Lowe, "Distinctive Image Features from Scale-Invariant Keypoints,"
Int'l J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.- [33] E. Müller, S. Günnemann, I. Assent, and T. Seidl, "Evaluating Clustering in Subspace Projections of High Dimensional Data,"
J. Very Large Data Bases Endowment, vol. 2, no. 1, pp. 1270-1281, 2009.- [34] R.T. Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining,"
Proc. 20th Int'l Conf. Very Large Data Bases (VLDB), 1994.- [35] W. Niblack, R. Barber, W. Equitz, M. Flickner, E. Glasman, D. Petkovic, P. Yanker, C. Faloutsos, and G. Taubin, "The QBIC Project: Querying Images by Content Using Color, Texture and Shape,"
Proc. Fifth Int'l Symp. Storage and Retrieval for Image and Video Databases, 1993.- [36] S. Papadimitriou, F. Li, G. Kollios, and P.S. Yu, "Time Series Compressibility and Privacy,"
Proc. 33rd Int'l Conf. Very Large Data Bases (VLDB), 2007.- [37] J. Pei, B. Jiang, X. Lin, and Y. Yuan, "Probabilistic Skylines on Uncertain Data,"
Proc. 33rd Int'l Conf. Very Large Data Bases (VLDB), 2007.- [38] H. Samet,
Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann, 2006.- [39] T. Skopal and P. Zezula,
Second Int'l Workshop Similarity Search and Applications (SISAP), IEEE CS, 2009.- [40] M.A. Soliman, I.F. Ilyas, and K.C. Chang, "Top-$k$ Query Processing in Uncertain Databases,"
Proc. 23rd Int'l Conf. Data Eng. (ICDE), 2007.- [41] Y. Tao, R. Cheng, X. Xiao, W.K. Ngai, B.K., and S. Prabhakar, "Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions,"
Proc. 31st Int'l Conf. Very Large Data Bases (VLDB), 2005.- [42] A.P. Vries, N. Mamoulis, N. Nes, and M. Kersten, "Efficient $k$ -nn Search on Vertically Decomposed Data,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, 2002.- [43] H. Wang, F. Chu, W. Fan, P.S. Yu, and J. Pei, "A Fast Algorithm for Subspace Clustering by Pattern Similarity,"
Proc. Int'l Conf. Scientific and Statistical Database Management (SSDBM), 2004.- [44] E.W. Weisstein, "Central Limit Theorem," http://mathworld. wolfram.comCentralLimitTheorem.html , 2011.
- [45] M.L. Yiu and N. Mamoulis, "Reverse Nearest Neighbors Search in Ad-Hoc Subspaces,"
Proc. 22nd Int'l Conf. Data Eng., p. 76, 2006.- [46] M.L. Yiu, N. Mamoulis, X. Dai, Y. Tao, and M. Vaitis, "Efficient Evaluation of Probabilistic Advanced Spatial Queries on Existentially Uncertain Data,"
IEEE Trans. Knowl. Data Eng., vol. 21, no. 1, pp. 108-122, Jan. 2009.- [47] P. Zezula, G. Amato, V. Dohnal, and M. Batko,
Similarity Search - the Metric Space Approach. Springer, 2006. |