Web-Age Information Management, International Conference on (2008)
July 20, 2008 to July 22, 2008
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/WAIM.2008.17
Data collections often have inconsistencies that arise due to a variety of reasons, and it is desirable to be able to identify and resolve them efficiently. Similarity queries are commonly used in data cleaning for matching similar data. In this work we concentrate on the following problem of approximate string matching based on edit distance: from a collection of strings, how to find those strings similar to a given string, or the strings in another collection of strings with similarity greater than some threshold? We propose an NFA-based (Nondeterministic Finitestate Automation) method for effective approximate string search. We model strings as a trie and construct an NFA on top of the trie. We identify the similar strings by running the NFA based on the tree automata theory. Moreover, we propose grouped trie to further improve the performance of similarity search by incorporating some effective pruning techniques. We have implemented our method and the experimental results show that our approach achieves high performance and out performs the existing state-of-the-art methods by orders of magnitude.
similarity search, similarity join, indexing
X. Liu, J. Feng, L. Zhou and G. Li, "Effective Indices for Efficient Approximate String Search and Similarity Join," Web-Age Information Management, International Conference on(WAIM), vol. 00, no. , pp. 127-134, 2008.