The Community for Technology Leaders
RSS Icon
Issue No.01 - January (2012 vol.24)
pp: 86-99
Luis Tari , Arizona State University, Tempe
Phan Huy Tu , Arizona State University, Tempe
Jörg Hakenberg , Arizona State University, Tempe
Yi Chen , Arizona State University, Tempe
Tran Cao Son , New Mexico State University, Las Cruces
Graciela Gonzalez , Arizona State University, Tempe
Chitta Baral , Arizona State University, Tempe
Information extraction systems are traditionally implemented as a pipeline of special-purpose processing modules targeting the extraction of a particular kind of information. A major drawback of such an approach is that whenever a new extraction goal emerges or a module is improved, extraction has to be reapplied from scratch to the entire text corpus even though only a small part of the corpus might be affected. In this paper, we describe a novel approach for information extraction in which extraction needs are expressed in the form of database queries, which are evaluated and optimized by database systems. Using database queries for information extraction enables generic extraction and minimizes reprocessing of data by performing incremental extraction to identify which part of the data is affected by the change of components or goals. Furthermore, our approach provides automated query generation components so that casual users do not have to learn the query language in order to perform extraction. To demonstrate the feasibility of our incremental extraction approach, we performed experiments to highlight two important aspects of an information extraction system: efficiency and quality of extraction results. Our experiments show that in the event of deployment of a new module, our incremental extraction approach reduces the processing time by 89.64 percent as compared to a traditional pipeline approach. By applying our methods to a corpus of 17 million biomedical abstracts, our experiments show that the query performance is efficient for real-time applications. Our experiments also revealed that our approach achieves high quality extraction results.
Text mining, query languages, information storage and retrieval.
Luis Tari, Phan Huy Tu, Jörg Hakenberg, Yi Chen, Tran Cao Son, Graciela Gonzalez, Chitta Baral, "Incremental Information Extraction Using Relational Databases", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 1, pp. 86-99, January 2012, doi:10.1109/TKDE.2010.214
[1] D. Ferrucci and A. Lally, "UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment," Natural Language Eng., vol. 10, nos. 3/4, pp. 327-348, 2004.
[2] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan, "GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications," Proc. 40th Ann. Meeting of the ACL, 2002.
[3] D. Grinberg, J. Lafferty, and D. Sleator, "A Robust Parsing Algorithm for Link Grammars," Technical Report CMU-CS-TR-95-125, Carnegie Mellon Univ. 1995.
[4] F. Chen, A. Doan, J. Yang, and R. Ramakrishnan, "Efficient Information Extraction over Evolving Text Data," Proc IEEE 24th Int'l Conf. Data Eng. (ICDE '08), pp. 943-952, 2008.
[5] F. Chen, B. Gao, A. Doan, J. Yang, and R. Ramakrishnan, "Optimizing Complex Extraction Programs over Evolving Text Data," Proc 35th ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '09), pp. 321-334, 2009.
[6] S. Bird et al., "Designing and Evaluating an XPath Dialect for Linguistic Queries," Proc 22nd Int'l Conf. Data Eng. (ICDE '06), 2006.
[7] S. Sarawagi, "Information Extraction," Foundations and Trends in Databases, vol. 1, no. 3, pp. 261-377, 2008.
[8] D.D. Sleator and D. Temperley, "Parsing English with a Link Grammar," Proc Third Int'l Workshop Parsing Technologies, 1993.
[9] R. Leaman and G. Gonzalez, "BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition," Proc. Pacific Symp. Biocomputing, pp. 652-663, 2008.
[10] A.R. Aronson, "Effective Mapping of Biomedical Text to the UMLS Metathesaurus: The MetaMap Program," Proc. AMIA Symp., p. 17, 2001.
[11] M.J. Cafarella and O. Etzioni, "A Search Engine for Natural Language Applications," Proc. 14th Int'l Conf. World Wide Web (WWW '05), 2005.
[12] T. Cheng and K. Chang, "Entity Search Engine: Towards Agile Best-Effort Information Integration over the Web," Proc. Conf. Innovative Data Systems Research (CIDR), 2007.
[13] H. Bast and I. Weber, "The CompleteSearch Engine: Interactive, Efficient, and Towards IR& DB Integration," Proc Conf. Innovative Data Systems Research (CIDR), 2007.
[14] S. Bird, Y. Chen, S.B. Davidson, H. Lee, and Y. Zheng, "Extending XPath to Support Linguistic Queries," Proc. Workshop Programming Language Technologies for XML (PLAN-X), 2005.
[15] J. Clark and S. DeRose, "XML Path Language (XPath),", Nov. 1999.
[16] "XQuery 1.0: An XML Query Language,", June 2001.
[17] C. Lai, "A Formal Framework for Linguistic Tree Query," Master's thesis, Dept. of Computer Science and Software Eng., Univ. of Melbourne, 2005.
[18] E. Agichtein and L. Gravano, "Querying Text Databases for Efficient Information Extraction," Proc. Int'l Conf. Data Eng. (ICDE), pp. 113-124, 2003.
[19] M. Krallinger, F. Leitner, and A. Valencia, "Assessment of the Second Biocreative PPI Task: Automatic Extraction of Protein-Protein Interactions," Proc. Second BioCreative Challenge Evaluation Workshop, 2007.
[20] J.T. Chang and R.B. Altman, "Extracting and Characterizing Gene-Drug Relationships from the Literature," Pharmacogenetics, vol. 14, no. 9, pp. 577-586, Sept. 2004.
[21] J. Hakenberg, C. Plake, R. Leaman, M. Schroeder, and G. Gonzalez, "Inter-Species Normalization of Gene Mentions with GNAT," Proc. European Conf. Computational Biology (ECCB '08), 2008.
[22] W. Baumgartner, Z. Lu, H. Johnson, J. Caporaso, J. Paquette, E. White, O. Medvedeva, K. Cohen, and L. Hunter, "An Integrated Approach to Concept Recognition in Biomedical Text," Proc. Second BioCreative Challenge, 2006.
[23] M. Huang, S. Ding, H. Wang, and X. Zhu, "Mining Physical Protein-Protein Interactions by Exploiting Abundant Features," Proc. Second BioCreative Challenge, pp. 237-245, 2007.
[24] J. Hakenberg, C. Plake, L. Royer, H. Strobelt, U. Leser, and M. Schroeder, "Gene Mention Normalization and Interaction Extraction with Context Models and Sentence Motifs," Genome Biology, vol. 9, Suppl 2, p. S14, 2008.
[25] L. Hunter, Z. Lu, J. Firby, W. Baumgartner, H. Johnson, P. Ogren, and K.B. Cohen, "OpenDMAP: An Open Source, Ontology-Driven Concept Analysis Engine, with Applications to Capturing Knowledge Regarding Protein Transport, Protein Interactions and Celltype-Specific Gene Expression," BMC Bioinformatics, vol. 9, article no. 78, 2008.
[26] A. Doan, L. Gravano, R. Ramakrishnan, and S. Vaithyanathan, "Introduction to the Special Issue on Managing Information Extraction," ACM SIGMOD Record, vol. 37, no. 4, p. 5, 2008.
[27] E. Agichtein and L. Gravano, "Snowball: Extracting Relations from Large Plain-Text Collections," Proc. Fifth ACM Conf. Digital Libraries, pp. 85-94, 2000.
[28] A. Doan, J.F. Naughton, R. Ramakrishnan, A. Baid, X. Chai, F. Chen, T. Chen, E. Chu, P. DeRose, B. Gao, C. Gokhale, J. Huang, W. Shen, and B.-Q. Vuong, "Information Extraction Challenges in Managing Unstructured Data," ACM SIGMOD Record, vol. 37, no. 4, pp. 14-20, 2008.
[29] R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu, "SystemT: A System for Declarative Information Extraction," ACM SIGMOD Record, vol. 37, no. 4, pp. 7-13, 2009.
[30] P.G. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano, "Towards a Query Optimizer for Text-Centric Tasks," ACM Trans. Database Systems, vol. 32, no. 4, p. 21, 2007.
[31] A. Jain, A. Doan, and L. Gravano, "Optimizing SQL Queries over Text Databases," Proc. IEEE 24th Int'l Conf. Data Eng. (ICDE '08), pp. 636-645, 2008.
[32] Y. Miyao, T. Ohta, K. Masuda, Y. Tsuruoka, K. Yoshida, T. Ninomiya, and J. Tsujii, "Semantic Retrieval for the Accurate Identification of Relational Concepts in Massive Textbases," Proc. 21st Int'l Conf. Computational Linguistics and the 44th Ann. Meeting of the Assoc. for Computational Linguistics (ACL '06), pp. 1017-1024, 2006.
[33] R. Feldman, Y. Regev, E. Hurvitz, and M. Finkelstein-Landau, "Mining the Biomedical Literature Using Semantic Analysis and Natural Language Processing Techniques," Information Technology in Drug Discovery Today, vol. 1, no. 2, pp. 69-80, 2003.
[34] J.D. Martin, "Fast and Furious Text Mining," IEEE Data Eng. Bull., vol. 28, no. 4, pp. 11-20, 2005.
[35] F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan, "An Algebraic Approach to Rule-Based Information Extraction," Proc IEEE 24th Int'l Conf. Data Eng. (ICDE '08), 2008.
[36] M. Cafarella, D. Downey, S. Soderland, and O. Etzioni, "Knowitnow: Fast, Scalable Information Extraction from the Web," Proc. Conf. Human Language Technology and Empirical Methods in Natural Language Processing (HLT '05), pp. 563-570, 2005.
[37] W. Shen, A. Doan, J.F. Naughton, and R. Raghu, "Declarative Information Extraction Using Datalog with Embedded Extraction Predicates," Proc 33rd Int'l Conf. Very Large Data Bases (VLDB '07), pp. 1033-1044, 2007.
[38] K. Fundel, R. Kuffner, and R. Zimmer, "RelEx-Relation Extraction Using Dependency Parse Trees," Bioinformatics, vol. 23, no. 3, pp. 365-371, 2007.
[39] S. Kim, J. Yoon, and J. Yang, "Kernel Approaches for Genic Interaction Extraction," Bioinformatics, vol. 24, no. 1, p. 118, 2008.
[40] F. Suchanek, G. Ifrim, and G. Weikum, "LEILA: Learning to Extract Information by Linguistic Analysis," Proc. ACL Workshop Ontology Learning and Population, pp. 18-25, 2006.
[41] F. Peng and A. Mccallum, "Accurate Information Extraction from Research Papers Using Conditional Random Fields," Proc. Human Language Technology Conf. and North Am. Chapter of the Assoc. for Computational Linguistics (HLT-NAACL), pp. 329-336, 2004.
[42] S. Sekine, "On-Demand Information Extraction," Proc. COLING/ACL Poster Session, pp. 731-738, 2006.
[43] M. Banko, M.J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni, "Open Information Extraction from the Web," Proc. Joint Conf. Artificial Intelligence (IJCAI), 2007.
24 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool