The Community for Technology Leaders
Green Image
Issue No. 02 - Feb. (2013 vol. 25)
ISSN: 1041-4347
pp: 298-310
Xiaofang Zhou , The University of Queensland, Brisbane
Xiaoyong Du , Renmin University of China, Beijing
Liwei Wang , Wuhan University, Wuhan
Laurianne Sitbon , Queensland University of Technology, Brisbane
Zhixu Li , The University of Queensland, Brisbane
ABSTRACT
In this paper, we propose a new type of Dictionary-based Entity Recognition Problem, named Approximate Membership Localization (AML). The popular Approximate Membership Extraction (AME) provides a full coverage to the true matched substrings from a given document, but many redundancies cause a low efficiency of the AME process and deteriorate the performance of real-world applications using the extracted substrings. The AML problem targets at locating nonoverlapped substrings which is a better approximation to the true matched substrings without generating overlapped redundancies. In order to perform AML efficiently, we propose the optimized algorithm P-Prune that prunes a large part of overlapped redundant matched substrings before generating them. Our study using several real-word data sets demonstrates the efficiency of P-Prune over a baseline method. We also study the AML in application to a proposed web-based join framework scenario which is a search-based approach joining two tables using dictionary-based entity recognition from web documents. The results not only prove the advantage of AML over AME, but also demonstrate the effectiveness of our search-based approach.
INDEX TERMS
Dictionaries, Redundancy, Approximation methods, Approximation algorithms, Correlation, Web search, Pattern matching, AML, Web-based join, approximate membership location
CITATION
Xiaofang Zhou, Xiaoyong Du, Liwei Wang, Laurianne Sitbon, Zhixu Li, "AML: Efficient Approximate Membership Localization within a Web-Based Join Framework", IEEE Transactions on Knowledge & Data Engineering, vol. 25, no. , pp. 298-310, Feb. 2013, doi:10.1109/TKDE.2011.178
103 ms
(Ver )