This Article 
 Bibliographic References 
 Add to: 
Extracting Protein Interactions from Text with the Unified AkaneRE Event Extraction System
July-September 2010 (vol. 7 no. 3)
pp. 442-453
Rune Sætre, University of Tokyo, Tokyo
Kazuhiro Yoshida, University of Tokyo, Tokyo
Makoto Miwa, University of Tokyo, Tokyo
Takuya Matsuzaki, University of Tokyo, Tokyo
Yoshinobu Kano, University of Tokyo, Tokyo
Jun'ichi Tsujii, University of Tokyo, Tokyo and University of Manchester, Manchester
Currently, relation extraction (RE) and event extraction (EE) are the two main streams of biological information extraction. In 2009, the majority of these RE and EE research efforts were centered around the BioCreative II.5 Protein-Protein Interaction (PPI) challenge and the “BioNLP event extraction shared task.” Although these challenges took somewhat different approaches, they share the same ultimate goal of extracting bio-knowledge from the literature. This paper compares the two challenge task definitions, and presents a unified system that was successfully applied in both these and several other PPI extraction task settings. The AkaneRE system has three parts: A core engine for RE, a pool of modules for specific solutions, and a configuration language to adapt the system to different tasks. The core engine is based on machine learning, using either Support Vector Machines or Statistical Classifiers and features extracted from given training data. The specific modules solve tasks like sentence boundary detection, tokenization, stemming, part-of-speech tagging, parsing, named entity recognition, generation of potential relations, generation of machine learning features for each relation, and finally, assignment of confidence scores and ranking of candidate relations. With these components, the AkaneRE system produces state-of-the-art results, and the system is freely available for academic purposes at

[1] R. Bunescu, R. Ge, R.J. Kate, E.M. Marcotte, R.J. Mooney, A.K. Ramani, and Y.W. Wong, "Comparative Experiments on Learning Information Extractors for Proteins and Their Interactions," J. Artificial Intelligence in Medicine, special issue on summarization and information extraction from medical documents, , 2004.
[2] R. Sætre, K. Sagae, and J. Tsujii, "Syntactic Features for Protein-Protein Interaction Extraction," Proc. Second Int'l Symp. Languages in Biology and Medicine (LBM '07), C.J. Baker and S. Jian, eds., CEUR Workshop Proc. (, vol. 319, pp. 6.1-6.14, Publications/CEUR-WS/Vol-319Paper6.pdf , Jan. 2008.
[3] M. Miwa, R. Sætre, Y. Miyao, and J. Tsujii, "A Rich Feature Vector for Protein-Protein Interaction Extraction from Multiple Corpora," Proc. 2009 Conf. Empirical Methods in Natural Language Processing, pp. 121-130, D09-1013.pdf , Aug. 2009.
[4] R. Sætre, M. Miwa, K. Yoshida, and J. Tsujii, "From Protein-Protein Interaction to Molecular Event Extraction," Proc. Natural Language Processing in Biomedicine (BioNLP) NAACL 2009 Workshop, pp. 103-106, papersbioShared2009_satre.pdf, 2009.
[5] R. Kabiljo, A. Clegg, and A. Shepherd, "A Realistic Assessment of Methods for Extracting Gene/Protein Interactions from Free Text," BMC Bioinformatics, vol. 10, no. 1, July 2009, .
[6] Y. Niu, D. Otasek, and I. Jurisica, "Evaluation of Linguistic Features Useful in Extraction of Interactions from PubMedApplication to Annotating Known and High-Throughput, Predicted Interactions in I2D," Bioinformatics, vol. 26, no. 1, pp. 111-119, Jan. 2010, btp602.
[7] T. Fayruzov, M. De Cock, C. Cornelis, and V. Hoste, "The Role of Syntactic Features in Protein Interaction Extraction," Proc. Second Int'l Workshop Data and Text Mining in Bioinformatics, http://portal.acm.orgcitation.cfm?id=1458463 , 2008.
[8] S. Van Landeghem, Y. Saeys, B. De Baets, and Y. Van de Peer, "Extracting Protein-Protein Interactions from Text Using Rich Feature Vectors and Feature Selection," Proc. Third Int'l Symp. Semantic Mining in Biomedicine (SMBM '08), T. Salakoski, D. Rebholz-Schuhmann, and S. Pyysalo, eds., pp. 77-84, smbmpaper_4.pdf, 2008.
[9] F. Leitner, M. Krallinger, C. Rodriguez-Penagos, J. Hakenberg, C. Plake, C.-J. Kuo, C.-N. Hsu, R.T.-H. Tsai, H.-C. Hung, W.W. Lau, C.A. Johnson, R. Sætre, K. Yoshida, Y.H. Chen, S. Kim, S.-Y. Shin, B.-T. Zhang, W.A. Baumgartner,Jr., L. Hunter, B. Haddow, M. Matthew, X. Wang, P. Ruch, F. Ehrler, A. Ozgur, G. Erkan, D.R. Radev, M. Krauthammer, T. Luong, R. Hoffmann, C. Sander, and A. Valencia, "Introducing Meta-Services for Biomedical Information Extraction," Genome Biology, vol. 9, no. S2, special issue on the biocreative challenge evaluation,, 2008.
[10] P. Roberts, A. Cohen, and W. Hersh, "Tasks, Topics Relevance Judging for the TREC Genomics Track: Five Years of Experience Evaluating Biomedical Text Information Retrieval Systems," Information Retrieval, vol. 12, no. 1, pp. 81-97, http://www. /, 2009.
[11] K. Fundel, R. Kuffner, and R. Zimmer, "RelEx-Relation Extraction Using Dependency Parse Tree," Bioinformatics, vol. 23, no. 3, pp. 365-371, Feb. 2007, btl616.
[12] S. Kim, S.-Y. Shin, I.-H. Lee, S.-J. Kim, R. Sriram, and B.-T. Zhang, "Pie: An Online Prediction System for Protein-Protein Interactions from Text," Nucleic Acids Research, vol. 36, no. Suppl_2, pp. W411-W415, July 2008,
[13] P. Palaga, L. Nguyen, U. Leser, and J. Hakenberg, "High-Performance Information Extraction with AliBaba," Proc. 12th Int'l Conf. Extending Database Technology (EDBT '09), pp. 1140-1143, 2009, .
[14] L. Hunter, Z. Lu, J. Firby, W. Baumgartner, H. Johnson, P. Ogren, and K.B. Cohen, "OpenDMAP: An Open Source Ontology-Driven Concept Analysis Engine with Applications to Capturing Knowledge Regarding Protein Transport Protein Interactions and Cell-Type-Specific Gene Expression," BMC Bioinformatics, vol. 9, no. 1, Jan. 2008, 10.11861471-2105-9-78 .
[15] R. Chowdhary, J. Zhang, and J.S. Liu, "Bayesian Inference of Protein-Protein Interactions from Biological Literature," Bioinformatics, vol. 25, no. 12, pp. 1536-1542, June 2009, btp245.
[16] M. Krallinger, A. Morgan, L. Smith, F. Leitner, L. Tanabe, J. Wilbur, L. Hirschman, and A. Valencia, "Evaluation of Text-Mining Systems for Biology: Overview of the Second Biocreative Community Challenge," Genome Biology, vol. 9, no. S2, 2008, .
[17] F. Leitner and A. Valencia, "A Text-Mining Perspective on the Requirements for Electronically Annotated Abstracts," FEBS Letters, vol. 582, no. 8, pp. 1178-1181, Apr. 2008, .
[18] J.-D. Kim, T. Ohta, S. Pyysalo, Y. Kano, and J. Tsujii, "Overview of Bionlp '09 Shared Task on Event Extraction," Proc. Natural Language Processing in Biomedicine (BioNLP) 2009 Workshop Companion Volume for Shared Task, pp. 1-9, http://www. , 2009.
[19] S. Pyysalo, F. Ginter, J. Heimonen, J. Bjorne, J. Boberg, J. Jarvinen, and T. Salakoski, "BioInfer: A Corpus for Information Extraction in the Biomedical Domain," BMC Bioinformatics, vol. 8, no. 1, 2007,
[20] J. Ding, D. Berleant, D. Nettleton, and E. Wurtele, "Mining MEDLINE: Abstracts Sentences, or Phrases?" Proc. Pacific Symp. Biocomputing, pp. 326-337, , 2002.
[21] C. Nédellec, "Learning Language in Logic—Genic Interaction Extraction Challenge," Proc. Fourth Learning Language in Logic Workshop (LLL '05), J. Cussens and C. Nédellec, eds., pp. 31-37, , Aug. 2005.
[22] S. Pyysalo, A. Airola, J. Heimonen, J. Bjorne, F. Ginter, and T. Salakoski, "Comparative Analysis of Five Protein-Protein Interaction Corpora," BMC Bioinformatics, vol. 9, no. Suppl 3, 2008, .
[23] J.D. Kim, T. Ohta, and J. Tsujii, "Corpus Annotation for Mining Biomedical Events from Literature," BMC Bioinformatics, vol. 9, no. 1, 2008,
[24] A. Yakushiji, "Relation Information Extraction Using Deep Syntactic Analysis," PhD dissertation, Univ. of Tokyo, papersdissertation_ yakushiji.pdf , 2006.
[25] R. Sætre, K. Yoshida, A. Yakushiji, Y. Miyao, Y. Matsubyashi, and T. Ohta, "AKANE System: Protein-Protein Interaction Pairs in BioCreAtIvE2 Challenge PPI-IPS Subtask," Proc. Second BioCreative Challenge Evaluation Workshop, L. Hirschman, M. Krallinger, and A. Valencia, eds., pp. 209-212, papersBC2_PPI_IPS_T19_BC2.pdf, Apr. 2007.
[26] Y. Kano, N. Nguyen, R. Sætre, K. Yoshida, Y. Miyao, Y. Tsuruoka, Y. Matsubayashi, S. Ananiadou, and J. Tsujii, "Filling the Gaps between Tools Users: A Tool Comparator and Using Protein-Protein Interactions as an Example," Proc. Pacific Symp. Biocomputing (PSB), no. 13, pp. 616-627, psb08kano.pdf, Jan. 2008.
[27] Y. Miyao, K. Sagae, R. Sætre, T. Matsuzaki, and J. Tsujii, "Evaluating Contributions of Natural Language Parsers to Protein-Protein Interaction Extraction," Bioinformatics, vol. 25, no. 3, pp. 394-400, cgi/content/abstract/25/3394, 2009.
[28] M. Miwa, R. Sætre, Y. Miyao, and J. Tsujii, "Protein-Protein Interaction Extraction by Leveraging Multiple Kernels and Parsers," Int'l J. Medical Informatics, Special Issue on Mining of Clinical and Biomedical Text and Data, vol. 78, no. 12, pp. e39-e46, /, 2009.
[29] D. Ferrucci and A. Lally, "UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment," Natural Language Eng., vol. 10, nos. 3/4, pp. 327-348, http://portal.acm.orgcitation.cfm?id=1030318.1030325 , 2004.
[30] R. Sætre, Akane System Home Page, akane/, 2009.
[31] H. Hermjakob, L. Montecchi-Palazzi, G. Bader, R. Wojcik, L. Salwinski, A. Ceol, S. Moore, S. Orchard, U. Sarkans, C. von Mering, B. Roechert, S. Poux, E. Jung, H. Mersch, P. Kersey, M. Lappe, Y. Li, R. Zeng, D. Rana, M. Nikolski, H. Husi, C. Brun, K. Shanker, S. Grant, C. Sander, P. Bork, W. Zhu, A. Pandey, A. Brazma, B. Jacq, M. Vidal, D. Sherman, P. Legrain, G. Cesareni, L. Xenarios, D. Eisenberg, B. Steipe, C. Hogue, and R. Apweiler, "The HUPOPSI's Molecular Interaction Format—a Community Standard for the Representation of Protein Interaction Data," Nature Biotechnology, vol. 22, no. 2, pp. 177-183, , Feb. 2004.
[32] U. Hahn, E. Buyko, K. Tomanek, S. Piao, J. McNaught, Y. Tsuruoka, and S. Ananiadou, "An Annotation Type System for a Data-Driven NLP Pipeline," Proc. Linguistic Annotation Workshop, pp. 33-40, W07-1505.pdf , June 2007.
[33] W.A. Baumgartner, B.K. Cohen, and L. Hunter, "An Open-Source Framework for Large-Scale and Flexible Evaluation of Biomedical Text Mining Systems," J. Biomedical Discovery and Collaboration, vol. 3, Jan. 2008,
[34] Y. Kano, W.A. Baumgartner, L. McCrohon, S. Ananiadou, K.B. Cohen, L. Hunter, and J. Tsujii, "U-Compare: Share and Compare Text Mining Tools with Uima," Bioinformatics, vol. 25, no. 15, pp. 1997-1998, Aug. 2009, bioinformatics btp289.
[35] J.-D. Kim, T. Ohta, Y. Tateishi, and J. Tsujii, "GENIA Corpus—a Semantically Annotated Corpus for Bio-Textmining," Bioinformatics, vol. 19, no. Suppl. 1, pp. i180-i182, http://bioinformatics. content/abstract/19/suppl_1i180, 2003.
[36] T. Hara, Y. Miyao, and J. Tsujii, "Adapting a Probabilistic Disambiguation Model of an HPSG Parser to a New Domain," Proc. Int'l Joint Conf. Natural Language Processing (IJCNLP '05), R. Dale, K.-F. Wong, J. Su, and O.Y. Kwong, eds., pp. 199-210, papersharasan-IJCNLP2005.pdf, Oct. 2005.
[37] R. Apweiler, A. Bairoch, C.H. Wu, W.C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M.J. Martin, D.A. Natale, C. O'Donovan, N. Redaschi, and L.-S.L. Yeh, "UniProt: The Universal Protein Knowledgebase," Nucleic Acids Research, vol. 32, no. Suppl_1, pp. D115-D119, Jan. 2004,
[38] D. Maglott, J. Ostell, K.D. Pruitt, and T. Tatusova, "Entrez Gene: Gene-Centered Information at NCBI," Nucleic Acids Research, vol. 33, no. Suppl_1, pp. D54-D58, Jan. 2005,
[39] A. Koike and T. Takagi, "Gene/Protein/Family Name Recognition in Biomedical Literature," Proc. BioLINK 2004: Linking Biological Literature, Ontologies, and Databases, pp. 9-16, papers/pdfBIO002.pdf, 2004.
[40] T. Joachims, "Optimizing Search Engines Using Clickthrough Data," Proc. ACM SIGKDD, pp. 133-142, 2002, http://doi.acm. org/10.1145775047.775067 .
[41] A. Moschitti, "Making Tree Kernels Practical for Natural Language Learning," Proc. Conf. European Chapter of the Assoc. for Computational Linguistics (EACL), , 2006.
[42] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, "LIBLINEAR: A Library for Large Linear Classification," J. Machine Learning Research, vol. 9, pp. 1871-1874, liblinear.pdf, 2008.
[43] L.A. Hirschman, S.A. Mardis, G. Cesareni, M. Krallinger, F. Leitner, and A. Valencia, "An Overview of BioCreative II.5," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 7, no. 3, pp. 385-399, July-Sept 2010.

Index Terms:
Text mining, machine learning, language parsing and understanding, bioinformatics (genome or protein) databases.
Rune Sætre, Kazuhiro Yoshida, Makoto Miwa, Takuya Matsuzaki, Yoshinobu Kano, Jun'ichi Tsujii, "Extracting Protein Interactions from Text with the Unified AkaneRE Event Extraction System," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 3, pp. 442-453, July-Sept. 2010, doi:10.1109/TCBB.2010.46
Usage of this product signifies your acceptance of the Terms of Use.