The Community for Technology Leaders
2013 IEEE 29th International Conference on Data Engineering (ICDE) (2013)
Brisbane, Australia Australia
Apr. 8, 2013 to Apr. 12, 2013
ISSN: 1063-6382
ISBN: 978-1-4673-4909-3
pp: 925-936
Dong Deng , Dept. of Comput. Sci., Tsinghua Univ., Beijing, China
Guoliang Li , Dept. of Comput. Sci., Tsinghua Univ., Beijing, China
Jianhua Feng , Dept. of Comput. Sci., Tsinghua Univ., Beijing, China
Wen-Syan Li , SAP Labs., Shanghai, China
ABSTRACT
String similarity search is a fundamental operation in many areas, such as data cleaning, information retrieval, and bioinformatics. In this paper we study the problem of top-k string similarity search with edit-distance constraints, which, given a collection of strings and a query string, returns the top-k strings with the smallest edit distances to the query string. Existing methods usually try different edit-distance thresholds and select an appropriate threshold to find top-k answers. However it is rather expensive to select an appropriate threshold. To address this problem, we propose a progressive framework by improving the traditional dynamic-programming algorithm to compute edit distance. We prune unnecessary entries in the dynamic-programming matrix and only compute those pivotal entries. We extend our techniques to support top-k similarity search. We develop a range-based method by grouping the pivotal entries to avoid duplicated computations. Experimental results show that our method achieves high performance, and significantly outperforms state-of-the-art approaches on real-world datasets.
INDEX TERMS
Search problems, Indexes, Time complexity, Cleaning, Bioinformatics
CITATION

Dong Deng, Guoliang Li, Jianhua Feng and Wen-Syan Li, "Top-k string similarity search with edit-distance constraints," 2013 29th IEEE International Conference on Data Engineering (ICDE 2013)(ICDE), Brisbane, QLD, 2013, pp. 925-936.
doi:10.1109/ICDE.2013.6544886
101 ms
(Ver 3.3 (11022016))