The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.02 - Feb. (2013 vol.25)
pp: 298-310
Zhixu Li , The University of Queensland, Brisbane
Laurianne Sitbon , Queensland University of Technology, Brisbane
Liwei Wang , Wuhan University, Wuhan
Xiaofang Zhou , The University of Queensland, Brisbane
Xiaoyong Du , Renmin University of China, Beijing
ABSTRACT
In this paper, we propose a new type of Dictionary-based Entity Recognition Problem, named Approximate Membership Localization (AML). The popular Approximate Membership Extraction (AME) provides a full coverage to the true matched substrings from a given document, but many redundancies cause a low efficiency of the AME process and deteriorate the performance of real-world applications using the extracted substrings. The AML problem targets at locating nonoverlapped substrings which is a better approximation to the true matched substrings without generating overlapped redundancies. In order to perform AML efficiently, we propose the optimized algorithm P-Prune that prunes a large part of overlapped redundant matched substrings before generating them. Our study using several real-word data sets demonstrates the efficiency of P-Prune over a baseline method. We also study the AML in application to a proposed web-based join framework scenario which is a search-based approach joining two tables using dictionary-based entity recognition from web documents. The results not only prove the advantage of AML over AME, but also demonstrate the effectiveness of our search-based approach.
INDEX TERMS
Dictionaries, Redundancy, Approximation methods, Approximation algorithms, Correlation, Web search, Pattern matching, AML, Web-based join, approximate membership location
CITATION
Zhixu Li, Laurianne Sitbon, Liwei Wang, Xiaofang Zhou, Xiaoyong Du, "AML: Efficient Approximate Membership Localization within a Web-Based Join Framework", IEEE Transactions on Knowledge & Data Engineering, vol.25, no. 2, pp. 298-310, Feb. 2013, doi:10.1109/TKDE.2011.178
REFERENCES
[1] S. Agrawal, K. Chakrabarti, S. Chaudhuri, V. Ganti, A. Konig, and D. Xin, "Exploiting Web Search Engines to Search Structured Databases," Proc. 18th WWW Int'l Conf. World Wide Web, pp. 501-510, 2009.
[2] A. Aho and M. Corasick, "Efficient String Matching: an Aid to Bibliographic Search," Comm. ACM, vol. 18, no. 6, pp. 333-340, 1975.
[3] A. Arasu, V. Ganti, and R. Kaushik, "Efficient Exact Set-Similarity Joins," Proc. 32nd VLDB Int'l Conf. Very Large Data Bases, pp. 918-929, 2006.
[4] R. Bayardo, Y. Ma, and R. Srikant, "Scaling Up All Pairs Similarity Search," Proc. 16th WWW Int'l Conf. World Wide Web, pp. 131-140, 2007.
[5] B. Bloom, "Space/Time Trade-Offs in Hash Coding with Allowable Errors," Comm. ACM, vol. 13, no. 7, pp. 422-426, 1970.
[6] B. Bocek, E. Hunt, and B. Stiller, "Fast Similarity Search in Large Dictionaries," Technical Report ifi-2007.02, Dept. of Informatics Univ. of Zurich, 2007.
[7] A. Borthwick, "A Maximum Entropy Approach to Named Entity Recognition," PhD thesis, New York Univ., 1999.
[8] G. Brodal and L. Gasieniec, "Approximate Dictionary Queries," Proc. Seventh Symp. Combinatorial Pattern Matching, vol. 1075, pp. 65-74, 1996.
[9] K. Chakrabarti, S. Chaudhuri, V. Ganti, and D. Xin, "An Efficient Filter for Approximate Membership Checking," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 805-818, 2008.
[10] H. Chan, T. Lam, W. Sung, S. Tam, and S. Wong, "A Linear Size Index for Approximate Pattern Matching," Proc. 17th Ann. Symp. Combinatorial Pattern Matching, pp. 49-59, 2006.
[11] A. Chandel, P. Nagesh, and S. Sarawagi, "Efficient Batch Top-K Search for Dictionary-Based Entity Recognition," Proc. 22nd Int'l Conf. Data Eng., p. 28, 2006.
[12] S. Chaudhuri, V. Ganti, and R. Kaushik, "A Primitive Operator for Similarity Joins in Data Cleaning," Proc. 22nd Int'l Conf. Data Eng., p. 5, 2006.
[13] S. Chaudhuri, V. Ganti, and D. Xin, "Exploiting Web Search to Generate Synonyms for Entities," Proc. 18th Int'l Conf. World Wide Web (WWW ), pp. 151-160, 2009.
[14] H. Chieu and H. Ng, "Named Entity Recognition: A Maximum Entropy Approach Using Global Information," Proc. 19th Int'l Conf. Computational Linguistics, p. 7, 2002.
[15] W. Cohen, "Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 201-212, 1998.
[16] W. Cohen, P. Ravikumar, and S. Fienberg, "A Comparison of String Distance Metrics for Name-Matching Tasks," Proc. IJCAI '03 Workshop Information Integration on the Web (IIWeb '03), pp. 9-10, 2003.
[17] W. Cohen and S. Sarawagi, "Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods," Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 89-98, 2004.
[18] I. Dagan, S. Marcus, and S. Markovitch, "Contextual word Similarity and Estimation from Sparse Data," Proc. 31st Ann. Meeting on Assoc. for Computational Linguistics, pp. 164-171, 1993.
[19] A. Elmagarmid, P. Ipeirotis, and V. Verykios, "Duplicate Record Detection: A Survey," IEEE Trans. Knowledge and Data Eng., vol. 19, no. 1, pp. 1-16, Jan. 2007.
[20] L. Getoor and C. Diehl, "Link Mining: A Survey," ACM SIGKDD Explorations Newsletter, vol. 7, no. 2, pp. 3-12, 2005.
[21] A. Gionis, P. Indyk, and R. Motwani, "Similarity Search in High Dimensions via Hashing," Proc. 25th VLDB Int'l Conf. Very Large Data Bases, pp. 518-529, 1999.
[22] L. Gravano, P. Ipeirotis, H. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava, "Approximate String Joins in a Database (Almost) for Free," Proc. 27th VLDB Int'l Conf. Very Large Data Bases, pp. 491-500, 2001.
[23] D. Gusfield, Algorithms on Strings Trees and Sequences: Computer Science and Computational Biology. Cambridge Univ. Press, 1997.
[24] W. Hon, T. Lam, R. Shah, S. Tam, and J. Vitter, "Cache-Oblivious Index for Approximate String Matching," Theoretical Computer Science, vol. 412, pp. 3579-3588, 2011.
[25] M. Jaro, "Probabilistic Linkage of Large Public Health Data Files," Statistics in Medicine, vol. 14, pp. 491-491, 1995.
[26] K. Jarvelin and J. Kekalainen, "Cumulated Gain-Based Evaluation of IR Techniques," ACM Trans. Information Systems, vol. 20, no. 4, pp. 422-446, 2002.
[27] D. Karch, D. Luxen, and P. Sanders, "Improved Fast Similarity Search in Dictionaries," Proc. 17th Int'l Conf. String Processing and Information Retrieval, pp. 173-178, 2010.
[28] N. Koudas, S. Sarawagi, and D. Srivastava, "Record linkage: Similarity Measures and Algorithms," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 802-803, 2006.
[29] Z. Li, L. Sitbon, L. Wang, X. Zhou, and X. Du, "Approximate Membership Localization (AML) for Web-Based Join," Proc. 19th CIKM Int'l Conf. Information and Knowledge Management, 2010.
[30] D. Lin, "Automatic Retrieval and Clustering of Similar Words," Proc. Ann. Meeting Assoc. for Computation Linguistics, vol. 36, pp. 768-774, 1998.
[31] J. Lu, J. Han, and X. Meng, "Efficient Algorithms for Approximate Member Extraction Using Signature-Based Inverted Lists," Proc. 18th CIKM ACM Conf. Information and Knowledge Management, pp. 315-324, 2009.
[32] M. Maaß and J. Nowak, "Text Indexing with Errors," Combinatorial Pattern Matching, pp. 21-32, Springer, 2005.
[33] U. Manber and S. Wu, "An Algorithm for Approximate Membership Checking with Application to Password Security," Information Processing Letters, vol. 50, no. 4, pp. 191-197, 1994.
[34] C.D. Manning and H. Schütze, Foundations of Statistical Natural Language Processing, vol. 59. MIT Press, 1999.
[35] A. McCallum and W. Li, "Early Results for Named Entity Recognition with Conditional Random Fieldsfeature Induction and Web-Enhanced Lexicons," Proc. Seventh Conf. Natural Language Learning, pp. 181-191, 2003.
[36] A. Mikheev, M. Moens, and C. Grover, "Named Entity Recognition without Gazetteers," Proc. Ninth Conf. European Chapter of the Assoc. for Computational Linguistics, pp. 1-8, 1999.
[37] A. Monge and C. Elkan, "The Field Matching Problem: Algorithms and Applications," Proc. Second Int'l Conf. Knowledge Discovery and Data Mining, pp. 267-270, 1996.
[38] G. Navarro, "A Guided Tour to Approximate String Matching," ACM Computing Surveys, vol. 33, no. 1, pp. 31-88, 2001.
[39] G. Navarro, R. Baeza-Yates, E. Sutinen, and J. Tarhio, "Indexing Methods for Approximate String Matching," Bull. Technical Committee, vol. 24, pp. 19-27, 2001.
[40] G. Navarro and M. Raffinot, Flexible Pattern Matching in Strings: Practical On-line Search Algorithms for Texts and Biological Sequences. Cambridge Univ. Press, 2002.
[41] S. Needleman and C. Wunsch, "A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins," J. Molecular Biology, vol. 48, no. 3, pp. 443-453, 1970.
[42] H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser, "Identity Uncertainty and Citation Matching," Proc. Advances in Neural Information Processing Systems, pp. 1425-1432, 2003.
[43] S. Sarawagi and A. Bhamidipaty, "Interactive Deduplication Using Active Learning," Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 269-278, 2002.
[44] A. Singhal, "Modern Information Retrieval: A Brief Overview," IEEE Data Eng. Bull., vol. 24, no. 4, pp. 35-43, 2001.
[45] L. Tanabe and W. Wilbur, "Generation of a Large Gene/Protein Lexicon by Morphological Pattern Analysis," J. Bioinformatics and Computational Biology, vol. 1, no. 4, pp. 611-626, 2004.
[46] E. Tjong Kim Sang and F. De Meulder, "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition," Proc. Seventh Conf. Natural Language Learning, pp. 142-147, 2003.
[47] W. Wang, C. Xiao, X. Lin, and C. Zhang, "Efficient Approximate Entity Extraction with Edit Distance Constraints," Proc. 35th SIGMOD Int.l Conf. Management of Data, pp. 759-770, 2009.
[48] A. Yao and F. Yao, "Dictionary Look-Up with Small Errors," Proc. Sixth Ann. Symp. Combinatorial Pattern Matching, p. 387, 1995.
[49] G. Zhou and J. Su, "Named Entity Recognition Using an HMM-based Chunk Tagger," Proc. 40th Ann. Meeting on Assoc. for Computational Linguistics, pp. 473-480, 2002.
36 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool