The Community for Technology Leaders
2015 IEEE 31st International Conference on Data Engineering (ICDE) (2015)
Seoul, South Korea
April 13, 2015 to April 17, 2015
ISBN: 978-1-4799-7964-6
pp: 519-530
Jin Wang , Department of Computer Science and Technology, TNList, Tsinghua University, Beijing 100084, China
Guoliang Li , Department of Computer Science and Technology, TNList, Tsinghua University, Beijing 100084, China
Dong Deng , Department of Computer Science and Technology, TNList, Tsinghua University, Beijing 100084, China
Yong Zhang , Department of Computer Science and Technology, TNList, Tsinghua University, Beijing 100084, China
Jianhua Feng , Department of Computer Science and Technology, TNList, Tsinghua University, Beijing 100084, China
ABSTRACT
String similarity search is a fundamental operation in data cleaning and integration. It has two variants, threshold-based string similarity search and top-k string similarity search. Existing algorithms are efficient either for the former or the latter; most of them can't support both two variants. To address this limitation, we propose a unified framework. We first recursively partition strings into disjoint segments and build a hierarchical segment tree index (HS-Tree) on top of the segments. Then we utilize the HS-Tree to support similarity search. For threshold-based search, we identify appropriate tree nodes based on the threshold to answer the query and devise an efficient algorithm (HS-Search). For top-k search, we identify promising strings with large possibility to be similar to the query, utilize these strings to estimate an upper bound which is used to prune dissimilar strings, and propose an algorithm (HS-Topk). We also develop effective pruning techniques to further improve the performance. Experimental results on real-world datasets show our method achieves high performance on the two problems and significantly outperforms state-of-the-art algorithms.
INDEX TERMS
Indexes, Heuristic algorithms, Search problems, Blogs, Partitioning algorithms, Upper bound, Silicon
CITATION

J. Wang, G. Li, D. Deng, Y. Zhang and J. Feng, "Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search," 2015 IEEE 31st International Conference on Data Engineering (ICDE), Seoul, South Korea, 2015, pp. 519-530.
doi:10.1109/ICDE.2015.7113311
213 ms
(Ver 3.3 (11022016))