The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - March (2008 vol.20)
pp: 321-336
ABSTRACT
Similarity-based search has been a key factor for many applications, such as multimedia retrieval, data mining, web search and retrieval, and so on. There are two important issues related to the the similarity search, namely the design of a distance function to measure the similarity, and improving the search efficiency. Many distance functions have been proposed that attempt to closely mimic human recognition. Unfortunately, some of these well-designed distance functions do not follow the triangle inequality, and are, therefore, non-metric. As a consequence, efficient retrieval using these non-metric distance functions becomes more challenging, since most existing index structures assume that the indexed distance functions are metric. In this paper, we address this challenging problem by proposing an efficient method, local constant embedding (LCE), which divides the data set into disjoint groups, so that the triangle inequality holds within each group by constant shifting. Furthermore, we design a pivot selection approach for the converted metric distance and create an index structure to speed up the retrieval efficiency. Extensive experiments show that, our method works well on various non-metric distance functions and improves the retrieval efficiency by an order of magnitude compared to the linear scan and existing retrieval approaches with no false dismissals.
INDEX TERMS
Multimedia databases, Query processing
CITATION
Lei Chen, Xiang Lian, "Efficient Similarity Search in Nonmetric Spaces with Local Constant Embedding", IEEE Transactions on Knowledge & Data Engineering, vol.20, no. 3, pp. 321-336, March 2008, doi:10.1109/TKDE.2007.190700
REFERENCES
[1] Wikipeadia, http://en.wikipedia.org/wikiChi-square_test , 2007.
[2] R. Agrawal, C. Faloutsos, and A. Swami, “Efficient Similarity Search in Sequence Databases,” Proc. Fourth Int'l Conf. Foundations of Data Organization and Algorithms (FODO), 1993.
[3] V. Athitsos, M. Hadjieleftheriou, G. Kollios, and S. Sclaroff, “Query-Sensitive Embeddings,” Proc. ACM SIGMOD, 2005.
[4] D.J. Berndt and J. Clifford, “Finding Patterns in Time Series: A Dynamic Programming Approach,” Advances in Knowledge Discovery and Data Mining, 1996.
[5] J.S. Boreczky and L.A. Rowe, “Comparison of Video Shot Boundary Detection Techniques,” Proc. Int'l Symp. Storage and Retrieval for Image and Video Databases, 1996.
[6] T. Bozkaya and M. Ozsoyoglu, “Indexing Large Metric Spaces for Similarity Search Queries,” ACM Trans. Database Systems, vol. 24, no. 3, pp. 361-404, 1999.
[7] T. Bozkaya, N. Yazdani, and Z.M. Ozsoyoglu, “Matching and Indexing Sequences of Different Lengths,” Proc. Sixth Int'l Conf. Information and Knowledge Management (CIKM), 1997.
[8] B. Bustos, G. Navarro, and E. Chávez, “Pivot Selection Techniques for Proximity Searching in Metric Spaces,” Pattern Recognition Letters, 2003.
[9] E. Chávez, G. Navarro, R. Baeza-Yates, and J.L. Marroquín, “Searching in Metric Spaces,” ACM Computing Surveys, 2001.
[10] L. Chen and R. Ng, “On the Marriage of Edit Distance and $L_{p}$ Norms,” Proc. 30th Int'l Conf. Very Large Data Bases (VLDB), 2004.
[11] L. Chen, M.T. Özsu, and V. Oria, “Robust and Fast Similarity Search for Moving Object Trajectories,” Proc. ACM SIGMOD, 2005.
[12] P. Ciaccia, M. Patella, and P. Zezula, “M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces,” Proc. 23rd Int'l Conf. Very Large Data Bases (VLDB), 1997.
[13] G. Das, D. Gunopulos, and H. Mannila, “Finding Similar Time Series,” Proc. First European Symp. Principles of Data Mining and Knowledge Discovery (PKDD), 1997.
[14] C. Faloutsos and K.-I. Lin, “FastMap: A Fast Algorithm for Indexing, Data Mining and Visualization of Traditional and Multimedia Datasets,” Proc. ACM SIGMOD, 1995.
[15] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, “Fast Subsequence Matching in Time-Series Databases,” Proc. ACM SIGMOD, 1994.
[16] C. Faloutsos and S. Roseman, “Fractals for Secondary Key Retrieval,” Proc. Eighth ACM Symp. Principles of Database Systems (PODS), 1989.
[17] R.F.S. Filho, A.J.M. Traina, C. Faloutsos, “Similarity Search without Tears: The OMNI Family of All-Purpose Access Methods,” Proc. 17th Int'l Conf. Data Eng. (ICDE), 2001.
[18] V. Ganti, R. Ramakrishnan, J. Gehrke, A.L. Powell, and J.C. French, “Clustering Large Datasets in Arbitrary Metric Spaces,” Proc. 15th Int'l Conf. Data Eng. (ICDE), 1999.
[19] K-S. Goh, B.T. Li, and E. Chang, “Dyndex: A Dynamic and Non-Metric Space Indexer,” Proc. 10th ACM Int'l Conf. Multimedia (Multimedia), 2002.
[20] G.R. Hjaltason and H. Samet, “Properties of Embedding Methods for Similarity Searching in Metric Spaces,” IEEE Trans. Pattern Analysis and Machine Intelligence, 2003.
[21] G. Hristescu and M. Farach-Colton, “Cluster-Preserving Embedding of Proteins,” technical report, Center for Discrete Math. and Theoretical Computer Science, 1999.
[22] D.P. Huttenlocher, G.A. Klanderman, and W.A. Rucklidge, “Comparing Images Using the Hausdorff Distance,” IEEE Trans. Pattern Analysis and Machine Intelligence, 1993.
[23] D.W. Jacobs, D. Weinshall, and Y. Gdalyahu, “Classification with Nonmetric Distances: Image Retrieval and Class Representation,” IEEE Trans. Pattern Analysis and Machine Intelligence, 2000.
[24] H.V. Jagadish, B.C. Ooi, K.-L. Tan, C. Yu, and R. Zhang, “iDistance: An Adaptive ${\rm B}^{+}\hbox{-}{\rm Tree}$ -Based Indexing Method for Nearest Neighbor Search,” ACM Trans. Database Systems, vol. 30, no. 2, pp. 364-397, 2005.
[25] E. Keogh, “Exact Indexing of Dynamic Time Warping,” Proc. 28th Int'l Conf. Very Large Data Bases (VLDB), 2002.
[26] S. Kim, S. Park, and W. Chu, “An Indexed-Based Approach for Similarity Search Supporting Time Warping in Large Sequence Databases,” Proc. 17th Int'l Conf. Data Eng. (ICDE), 2001.
[27] V. Roth, J. Laub, J. Buhmann, and K.-R. Muller, “Going Metric: Denoising Pairwise Data,” Proc. Int'l Conf. Neural Information Processing Systems (NIPS), 2002.
[28] H. Samet, Foundations of Multidimensional and Metric Data Structures. Addison-Wesley, 2006.
[29] R. Schapire and Y. Singer, “Improved Boosting Algorithms Using Confidence-Rated Predictions,” Machine Learning, 1999.
[30] T. Seidl and H. Kriegel, “Optimal Multi-Step $k{\hbox{-}}{\rm Nearest}$ Neighbor Search,” Proc. ACM SIGMOD, 1998.
[31] T. Skopal, “On Fast Non-Metric Similarity Search by Metric Access Methods,” Proc. 10th Int'l Conf. Extending Database Technology (EDBT), 2006.
[32] C. Traina Jr., A.J.M. Traina, B. Seeger, and C. Faloutsos, “Slim-Trees: High-Performance Metric Trees Minimizing Overlap between Nodes,” Proc. Fourth Int'l Conf. Extending Database Technology (EDBT), 2000.
[33] A. Tversky, “Features of Similarity,” Psychological Rev., 1977.
[34] M. Vlachos, G. Kollios, and D. Gunopulos, “Discovering Similar Multidimensional Trajectories,” Proc. 18th Int'l Conf. Data Eng. (ICDE), 2002.
[35] C.Z. Wang and X. Wang, “Supporting Content-Based Searches on Time Series via Approximation,” Proc. 12th Int'l Conf. Scientific and Statistical Database Management (SSDBM), 2000.
[36] X. Wang, J. Wang, K. Lin, D. Shasha, B. Shapiro, and K. Zhang, “An Index Structure for Data Mining and Clustering,” Knowledge and Information Systems, vol. 2, no. 2, pp. 161-184, 2000.
[37] B.-K. Yi and C. Faloutsos, “Fast Time Sequence Indexing for Arbitrary $L_{p}$ Norms,” Proc. 26th Int'l Conf. Very Large Data Bases (VLDB), 2000.
[38] B.-K. Yi, H. Jagadish, and C. Faloutsos, “Efficient Retrieval of Similar Time Sequences under Time Warping,” Proc. 14th Int'l Conf. Data Eng. (ICDE), 1998.
[39] P. Zezula, G. Amato, V. Dohnal, and M. Batko, Similarity Search—The Metric Space Approach. Springer, 2006.
38 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool