This Article 
 Bibliographic References 
 Add to: 
Efficient Extraction of Protein-Protein Interactions from Full-Text Articles
July-September 2010 (vol. 7 no. 3)
pp. 481-494
Jörg Hakenberg, Arizona State University, Tempe, AZ
Robert Leaman, Arizona State University, Phoenix, AZ
Nguyen Ha Vo, Arizona State University, Tempe, AZ
Siddhartha Jonnalagadda, Arizona State University, Phoenix, AZ
Ryan Sullivan, Arizona State University, Phoenix, AZ
Christopher Miller, Arizona State University, Phoenix, AZ
Luis Tari, Hoffmann-La Roche Inc., Nutley, NJ
Chitta Baral, Arizona State University, Tempe, AZ
Graciela Gonzalez, Arizona State University, Phoenix, AZ
Proteins and their interactions govern virtually all cellular processes, such as regulation, signaling, metabolism, and structure. Most experimental findings pertaining to such interactions are discussed in research papers, which, in turn, get curated by protein interaction databases. Authors, editors, and publishers benefit from efforts to alleviate the tasks of searching for relevant papers, evidence for physical interactions, and proper identifiers for each protein involved. The BioCreative II.5 community challenge addressed these tasks in a competition-style assessment to evaluate and compare different methodologies, to make aware of the increasing accuracy of automated methods, and to guide future implementations. In this paper, we present our approaches for protein-named entity recognition, including normalization, and for extraction of protein-protein interactions from full text. Our overall goal is to identify efficient individual components, and we compare various compositions to handle a single full-text article in between 10 seconds and 2 minutes. We propose strategies to transfer document-level annotations to the sentence-level, which allows for the creation of a more fine-grained training corpus; we use this corpus to automatically derive around 5,000 patterns. We rank sentences by relevance to the task of finding novel interactions with physical evidence, using a sentence classifier built from this training corpus. Heuristics for paraphrasing sentences help to further remove unnecessary information that might interfere with patterns, such as additional adjectives, clauses, or bracketed expressions. In BioCreative II.5, we achieved an f-score of 22 percent for finding protein interactions, and 43 percent for mapping proteins to UniProt IDs; disregarding species, f-scores are 30 percent and 55 percent, respectively. On average, our best-performing setup required around 2 minutes per full text. All data and pattern sets as well as Java classes that extend third-party software are available as supplementary information (see Appendix).

[1] S. Jones and J.M. Thornton, "Principles of Protein-Protein Interactions," Proc. Nat'l Academy of Science USA, vol. 93, no. 1, pp. 13-20, Jan. 1996.
[2] J.A. Miernyk and J.J. Thelen, "Biochemical Approaches for Discovering Protein-Protein Interactions," Plant J., vol. 53, no. 4, pp. 597-609, Feb. 2008.
[3] S. Lalonde, D.W. Ehrhardt, D. Loqué, J. Chen, S.Y. Rhee, and W.B. Frommer, "Molecular and Cellular Approaches for the Detection of Protein-Protein Interactions: Latest Techniques and Current Limitations," Plant J., vol. 53, no. 4, pp. 610-635, Feb. 2008.
[4] B. Aranda, P. Achuthan, Y. Alam-Faruque, I. Armean, A. Bridge, C. Derow, M. Feuermann, A.T. Ghanbarian, S. Kerrien, J. Khadake, J. Kerssemakers, C. Leroy, M. Menden, M. Michaut, L. Montecchi-Palazzi, S.N. Neuhauser, S. Orchard, V. Perreau, B. Roechert, K. van Eijk, and H. Hermjakob, "The IntAct Molecular Interaction Database in 2010," Nucleic Acids Research, vol. 38, pp. D525-D531, 2010.
[5] C.M. Deane, L. Salwinski, I. Xenarios, and D. Eisenberg, "Protein Interactions: Two Methods for Assessment of the Reliability of High Throughput Observations," Molecular and Cellular Proteomics, vol. 1, no. 5, pp. 349-356, May 2002.
[6] L. Salwinski, C.S. Miller, A.J. Smith, F.K. Pettit, J.U. Bowie, and D. Eisenberg, "The Database of Interacting Proteins: 2004 Update," Nucleic Acids Research, vol. 32, Database issue, pp. D449-D451, Jan. 2004.
[7] H. Yu, P. Braun, M.A. Yildirim, I. Lemmens, K. Venkatesan, J. Sahalie, T. Hirozane-Kishikawa, F. Gebreab, N. Li, N. Simonis, T. Hao, J.-F. Rual, A. Dricot, A. Vazquez, R.R. Murray, C. Simon, L. Tardivo, S. Tam, N. Svrzikapa, C. Fan, A.-S. de Smet, A. Motyl, M.E. Hudson, J. Park, X. Xin, M.E. Cusick, T. Moore, C. Boone, M. Snyder, F.P. Roth, A.-L. Barabási, J. Tavernier, D.E. Hill, and M. Vidal, "High-Quality Binary Protein Interaction Map of the Yeast Interactome Network," Science, vol. 322, no. 5898, pp. 104-110, Oct. 2008.
[8] K. Fundel, R. Küffner, and R. Zimmer, "RelEx-Relation Extraction Using Dependency Parse Trees," Bioinformatics, vol. 23, no. 3, pp. 365-371, 2007.
[9] J. Moult, K. Fidelis, A. Kryshtafovych, B. Rost, and A. Tramontano, "Critical Assessment of Methods of Protein Structure Prediction— Round VIII," Proteins, vol. 77, suppl. 9, pp. 1-4, 2009.
[10] W. Hersh and E. Voorhees, "TREC Genomics Special Issue Overview," Information Retrieval, vol. 12, no. 1, pp. 1-15, Feb. 2009.
[11] J.-D. Kim, T. Ohta, Y. Tsuruoka, Y. Tateisi, and N. Collier, "Introduction to the Bio-Entity Task at JNLPBA," Proc. Joint Workshop Natural Language Processing in Biomedicine and Its Applications (JNLPBA), pp. 70-75, 2004.
[12] J.-D. Kim, T. Ohta, S. Pyysalo, Y. Kano, and J. Tsujii, "Overview of BioNLP '09 Shared Task on Event Extraction," Proc. BioNLP 2009 Workshop Companion Volume for Shared Task, pp. 1-9, June 2009.
[13] L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia, "Overview of BioCreAtIvE: Critical Assessment of Information Extraction for Biology," BMC Bioinformatics, vol. 6, suppl 1, S1, 2005.
[14] M. Krallinger, A. Morgan, L. Smith, F. Leitner, L. Tanabe, J. Wilbur, L. Hirschman, and A. Valencia, "Evaluation of Text-Mining Systems for Biology: Overview of the Second BioCreative Community Challenge," Genome Biology, vol. 9, suppl 2, S1, 2008.
[15] L.A. Hirschman, S.A. Mardis, G. Cesareni, M. Krallinger, F. Leitner, and A. Valencia, "An Overview of BioCreative II.5," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 7, no. 3, pp. 385-399, July-Sept. 2010.
[16] A. Ceol, A. Chatr Aryamontri, L. Licata, D. Peluso, L. Briganti, L. Perfetto, L. Castagnoli, and G. Cesareni, "MINT, the Molecular Interaction Database: 2009 Update," Nucleic Acids Research, vol. 38, Database issue, pp. D532-D539, Jan. 2010.
[17] S. Orchard, S. Kerrien, P. Jones, A. Ceol, A. Chatr-Aryamontri, L. Salwinski, J. Nerothin, and H. Hermjakob, "Submit Your Interaction Data the IMEx Way: A Step by Step Guide to Trouble-Free Deposition," Proteomics, vol. 7, suppl. 1, pp. 28-34, Sept. 2007.
[18] F. Leitner, M. Krallinger, C. Rodriguez-Penagos, J. Hakenberg, C. Plake, C.-J. Kuo, C.-N. Hsu, R.T.-H. Tasi, H.-C. Hung, W.W. Lau, C.A. Johnson, R. Sætre, K. Yoshida, Y.H. Chen, S. Kim, S.-Y. Shin, B.-T. Zhang, W.A. Baumgartner, L. Hunter, B. Haddow, M. Matthew, X. Wang, P. Ruch, F. Ehrler, A. Özgür, G. Erkan, D.R. Radev, M. Krauthammer, T. Luong, R. Hoffmann, C. Sander, and A. Valencia, "Introducing Meta-Services for Biomedical Information Extraction," Genome Biology, vol. 9, suppl. 2, S6, 2008.
[19] J. Hakenberg, C. Plake, L. Royer, H. Strobelt, U. Leser, and M. Schroeder, "Gene Mention Normalization and Interaction Extraction with Context Models and Sentence Motifs," Genome Biology, vol. 9, suppl. 2, S14, 2008.
[20] J. Hakenberg, C. Plake, R. Leaman, M. Schroeder, and G. Gonzalez, "Inter-Species Normalization of Gene Mentions with GNAT," Bioinformatics, vol. 24, no. 16, pp. i126-i132, Sept. 2008.
[21] L. Hunter, Z. Lu, J. Firby, W.A.B.,Jr., H.L. Johnson, P.V. Ogren, and K.B. Cohen, "OpenDMAP: An Open Source, Ontology-Driven Concept Analysis Engine, with Applications to Capturing Knowledge Regarding Protein Transport, Protein Interactions and Cell-Type-Specific Gene Expression," BMC Bioinformatics, vol. 9:78, 2008.
[22] L. Smith, L.K. Tanabe, R.J. nee Ando, C.-J. Kuo, I.-F. Chung, C.-N. Hsu, Y.-S. Lin, R. Klinger, C.M. Friedrich, K. Ganchev, M. Torii, H. Liu, B. Haddow, C.A. Struble, R.J. Povinelli, A. Vlachos, W.A. Baumgartner, L. Hunter, B. Carpenter, R.T.-H. Tsai, H.-J. Dai, F. Liu, Y. Chen, C. Sun, S. Katrenko, P. Adriaans, C. Blaschke, R. Torres, M. Neves, P. Nakov, A. Divoli, M. Mana-Lopez, J. Mata-Vazquez, and W.J. Wilbur, "Overview of Biocreative II Gene Mention Recognition," Genome Biology, vol. 8, suppl. 2, S2, 2008.
[23] J. Wermter, K. Tomanek, and U. Hahn, "High-Performance Gene Name Normalization with GENO," Bioinformatics, vol. 25, no. 6, pp. 815-821, Mar. 2009.
[24] R. Leaman and G. Gonzalez, "BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition," Proc. Pacific Symp. Biocomputing, pp. 652-663, 2008.
[25] R. Klinger and K. Tomanek, "Classical Probabilistic Models and Conditional Random Fields," technical report, Dept. of Computer Science, Dortmund Univ. of Tech nology, 2007.
[26] E.W. Sayers, T. Barrett, D.A. Benson, E. Bolton, S.H. Bryant, K. Canese, V. Chetvernin, D.M. Church, M. Dicuccio, S. Federhen, M. Feolo, L.Y. Geer, W. Helmberg, Y. Kapustin, D. Landsman, D.J. Lipman, Z. Lu, T.L. Madden, T. Madej, D.R. Maglott, A. Marchler-Bauer, V. Miller, I. Mizrachi, J. Ostell, A. Panchenko, K.D. Pruitt, G.D. Schuler, E. Sequeira, S.T. Sherry, M. Shumway, K. Sirotkin, D. Slotta, A. Souvorov, G. Starchenko, T.A. Tatusova, L. Wagner, Y. Wang, W.J. Wilbur, E. Yaschenko, and J. Ye, "Database Resources of the National Center for Biotechnology Information," Nucleic Acids Research, vol. 33, Database issue, pp. D39-D45, Nov. 2009.
[27] P. Palaga, L. Nguyen, U. Leser, and J. Hakenberg, "High-Performance Information Extraction with AliBaba," Proc. Int'l Conf. Extending Database Technology (EDBT '09), Demo, Mar. 2009.
[28] D. Rebholz-Schuhmann, A. Jimeno-Yepes, M. Arregui, and H. Kirsch, "Measuring Prediction Capacity of Individual Verbs for the Identification of Protein Interactions," J. Biomedical Informatics, vol. 43, no. 2, pp. 200-207, Apr. 2010.
[29] W.A. Baumgartner, Z. Lu, H.L. Johnson, J.G. Caporaso, J. Paquette, A. Lindemann, E.K. White, O. Medvedeva, K.B. Cohen, and L. Hunter, "An Integrated Approach to Concept Recognition in Biomedical Text," Proc. Second BioCreative Challenge Evaluation Workshop, pp. 257-271, 2007.
[30] S. Jonnalagadda and G. Gonzalez, "Sentence Simplification Aids Protein-Protein Interaction Extraction," Proc Int'l Symp. Languages in Biology and Medicine (LBM), Nov. 2009.
[31] S. Clark and J.R. Curran, "Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models," Computational Linguistics, vol. 33, no. 4, pp. 493-552, 2007.
[32] LingPipe, http://alias-i.comlingpipe/, 2010.
[33] A. Siddharthan, "Syntactic Simplification and Text Cohesion," PhD dissertation, Univ. of Cambridge, UK, 2003.
[34] E. Ong, J. Damay, G. Lojico, K. Lu, and D. Tarantan, "Simplifying Text in Medical Literature," J. Research in Science Computing and Eng., vol. 4, no. 1, pp. 37-47, 2007.
[35] D. McClosky and E. Charniak, "Self-Training for Biomedical Parsing," Proc. Assoc. for Computational Linguistics (ACL '08), pp. 101-104, June 2008.
[36] A. Airola, S. Pyysalo, J. Björne, T. Pahikkala, F. Ginter, and T. Salakoski, "A Graph Kernel for Protein-Protein Interaction Extraction," Proc. Workshop Current Trends in Biomedical Natural Language Processing, pp. 1-9, June 2008.
[37] J. Ding, D. Berleant, D. Nettleton, and E.S. Wurtele, "Mining Medline: Abstracts, Sentences, or Phrases?" Proc. Pacific Symp. Biocomputing, pp. 326-337, Jan. 2002.
[38] S. Kim, S.-Y. Shin, I.-H. Lee, S.-J. Kim, R. Sriram, and B.-T. Zhang, "PIE: An Online Prediction System for Protein-Protein Interactions from Text," Nucleic Acids Research, 2008. vol no. 36, suppl. 2, pp. W411-415, 2008.
[39] R. Bunescu, R. Ge, R.J. Kate, E.M. Marcotte, R.J. Mooney, A.K. Ramani, and Y.W. Wong, "Comparative Experiments on Learning Information Extractors for Proteins and Their Interactions," Artificial Intelligence in Medicine, vol. 33, pp. 139-155, /, 2005.
[40] Y. Miyao, K. Sagae, R. Sætre, T. Matsuzaki, and J. Tsujii, "Evaluating Contributions of Natural Language Parsers to Protein-Protein Interaction Extraction," Bioinformatics, vol. 25, no. 3, pp. 394-400, 2009.

Index Terms:
Biology and genetics, text analysis, bioinformatics (genome or protein) databases.
Jörg Hakenberg, Robert Leaman, Nguyen Ha Vo, Siddhartha Jonnalagadda, Ryan Sullivan, Christopher Miller, Luis Tari, Chitta Baral, Graciela Gonzalez, "Efficient Extraction of Protein-Protein Interactions from Full-Text Articles," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 3, pp. 481-494, July-Sept. 2010, doi:10.1109/TCBB.2010.51
Usage of this product signifies your acceptance of the Terms of Use.