This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
An IR-Aided Machine Learning Framework for the BioCreative II.5 Challenge
July-September 2010 (vol. 7 no. 3)
pp. 454-461
Yonggang Cao, University of Wisconsin-Milwaukee, Milwaukee
Zuofeng Li, University of Wisconsin-Milwaukee, Milwaukee
Feifan Liu, University of Wisconsin-Milwaukee, Milwaukee
Shashank Agarwal, University of Wisconsin-Milwaukee, Milwaukee
Qing Zhang, University of Wisconsin-Milwaukee, Milwaukee
Hong Yu, University of Wisconsin-Milwaukee, Milwaukee
The team at the University of Wisconsin-Milwaukee developed an information retrieval and machine learning framework. Our framework requires only the standardized training data and depends upon minimal external knowledge resources and minimal parsing. Within the framework, we built our text mining systems and participated for the first time in all three BioCreative II.5 Challenge tasks. The results show that our systems performed among the top five teams for raw F1 scores in all three tasks and came in third place for the homonym ortholog F1 scores for the INT task. The results demonstrated that our IR-based framework is efficient, robust, and potentially scalable.

[1] D. Chen, H.M. Müller, and P.W. Sternberg, "Automatic Document Classification of Biological Literature," BMC Bioinformatics, vol. 7, p. 370, 2006.
[2] D. Hanisch, K. Fundel, H.T. Mevissen, R. Zimmer, and J. Fluck, "ProMiner: Rule-Based Protein and Gene Entity Recognition," BMC Bioinformatics, vol. 6, pp. S14-S22, 2005.
[3] K.J. Lee, Y.S. Hwang, S. Kim, and H.C. Rim, "Biomedical Named Entity Recognition Using Two-Phase Model Based on SVMs," J. Biomedical Informatics, vol. 37, pp. 436-447, 2004.
[4] R. Sætre and K. Sagae, "Syntactic Features for Protein-Protein Interaction Extraction," Proc. Int'l Symp. Languages in Biology and Medicine, 2007.
[5] A. Rzhetsky, I. Iossifov, T. Koike, M. Krauthammer, P. Kra, M. Morris, H. Yu, P.A. Duboué, W. Weng, W.J. Wilbur, V. Hatzivassiloglou, and C. Friedman, "GeneWays: A System for Extracting, Analyzing, Visualizing, and Integrating Molecular Pathway Data," J. Biomedical Informatics, vol. 37, pp. 43-53, Feb. 2004.
[6] M. Krauthammer, C.A. Kaufmann, T.C. Gilliam, and A. Rzhetsky, "Molecular Triangulation: Bridging Linkage and Molecular-Network Information for Identifying Candidate Genes in Alzheimer's Disease," Proc. Nat'l Academy of Sciences USA, vol. 101, pp. 15148-15153, Oct. 2004.
[7] B.J. Stapley and G. Benoit, "Biobibliometrics: Information Retrieval and Visualization from Co-Occurrences of Gene Names in Medline Abstracts," Proc. Pacific Symp. Biocomputing, pp. 529-540, 2000.
[8] J. Bandy, D. Milward, and S. McQuay, "Mining Protein-Protein Interactions from Published Literature Using Linguamatics I2E," Methods in Molecular Biology (Clifton, NJ), vol. 563, pp. 3-13, 2009.
[9] T. Sekimizu, H. Park, and J. Tsujii, "Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts," Proc. Workshop Genome Informatics, vol. 9, pp. 62-71, 1998.
[10] S.T. Ahmed, D. Chidambaram, H. Davulcu, and C. Baral, "Intex: A Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text," Proc. ISMB BioLINK Special Interest Group on Text Data Mining and the ACL Workshop Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, pp. 54-61, 2005.
[11] J. Chiang, H. Yu, and H. Hsu, "GIS: A Biomedical Text-Mining System for Gene Information Discovery," Bioinformatics, vol. 20, pp. 120-121, Jan. 2004.
[12] J. Xiao, J. Su, G.D. Zhou, and C.L. Tan, "Protein-Protein Interaction Extraction: A Supervised Learning Approach," Proc. Symp. Semantic Mining in Biomedicine, pp. 51-59, 2005.
[13] L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia, "Overview of BioCreAtIvE: Critical Assessment of Information Extraction for Biology," BMC Bioinformatics, vol. 6, suppl 1, pp. S1-S10, 2005.
[14] M. Krallinger, F. Leitner, C. Rodriguez-Penagos, and A. Valencia, "Overview of the Protein-Protein Interaction Annotation Extraction Task of Biocreative II," Genome Biology, vol. 9, suppl 2, pp. S4-S22, 2008.
[15] Y. Niu, D. Otasek, and I. Jurisica, "Evaluation of Linguistic Features Useful in Extraction of Interactions from PubMed; Application to Annotating Known, High-Throughput and Predicted Interactions in I2D," Bioinformatics, vol. 26, pp. 111-119, Jan. 2010.
[16] B.J. Stapley, L.A. Kelley, and M.J. Sternberg, "Predicting the Sub-Cellular Location of Proteins from Text Using Support Vector Machines," Proc. Pacific Symp. Biocomputing, 2002.
[17] H. Shatkay and R. Feldman, "Mining the Biomedical Literature in the Genomic Era: An Overview," J. Computational Biology, vol. 10, pp. 821-855, 2003.
[18] C. Blaschke, M.A. Andrade, C. Ouzounis, and A. Valencia, "Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions," Proc. Int'l Conf. Intelligent Systems for Molecular Biology, pp. 60-67, 1999.
[19] L. Wong, "A Protein Interaction Extraction System," Proc. Pacific Symp. Biocomputing, 2001.
[20] U. Pieper, N. Eswar, H. Braberg, M.S. Madhusudhan, F.P. Davis, A.C. Stuart, N. Mirkovic, A. Rossi, M.A. Marti-Renom, A. Fiser, B. Webb, D. Greenblatt, C.C. Huang, T.E. Ferrin, and A. Sali, "MODBASE, a Database of Annotated Comparative Protein Structure Models, and Associated Resources," Nucleic Acids Research, vol. 32, pp. D217-D222, 2004.
[21] J. Thomas, D. Milward, C. Ouzounis, S. Pulman, and M. Carroll, "Automatic Extraction of Protein Interactions from Scientific Abstracts," Proc. Pacific Symp. Biocomputing, pp. 541-552, 2000.
[22] C. Friedman, P. Kra, H. Yu, M. Krauthammer, and A. Rzhetsky, "GENIES: A Natural-Language Processing System for the Extraction of Molecular Pathways from Journal Articles," Bioinformatics (Oxford, England), vol. 17, pp. 74-82, 2001.
[23] N. Daraselia, A. Yuryev, S. Egorov, S. Novichkova, A. Nikitin, and I. Mazo, "Extracting Human Protein Interactions from MEDLINE Using a Full-Sentence Parser," Bioinformatics, vol. 20, pp. 604-611, Mar. 2004.
[24] D.R. Rhodes, S.A. Tomlins, S. Varambally, V. Mahavisno, T. Barrette, S. Kalyana-Sundaram, D. Ghosh, A. Pandey, and A.M. Chinnaiyan, "Probabilistic Model of the Human Protein-Protein Interaction Network," Nature Biotechnology, vol. 23, pp. 951-959, 2005.
[25] A. Koike and T. Takagi, "Prediction of Protein-Protein Interaction Sites Using Support Vector Machines," Protein Eng., Design and Selection, vol. 17, pp. 165-173, Feb. 2004.
[26] R. McDonald and F. Pereira, "Identifying Gene and Protein Mentions in Text Using Conditional Random Fields," BMC Bioinformatics, vol. 6, suppl 1, pp. S6-S12, 2005.
[27] T. Sandler, A.I. Schein, and L.H. Ungar, "Automatic Term List Generation for Entity Tagging," Bioinformatics, vol. 22, pp. 651-657, 2006.
[28] A. Morgan, Z. Lu, X. Wang, A. Cohen, J. Fluck, P. Ruch, A. Divoli, K. Fundel, R. Leaman, J. Hakenberg, C. Sun, H. Liu, R. Torres, M. Krauthammer, W. Lau, H. Liu, C. Hsu, M. Schuemie, K.B. Cohen, and L. Hirschman, "Overview of BioCreative II Gene Normalization," Genome Biology, vol. 9, pp. S3-S21, 2008.
[29] L. Smith, L. Tanabe, R. Ando, C. Kuo, I. Chung, C. Hsu, Y. Lin, R. Klinger, C. Friedrich, K. Ganchev, M. Torii, H. Liu, B. Haddow, C. Struble, R. Povinelli, A. Vlachos, W. Baumgartner, L. Hunter, B. Carpenter, R. Tsai, H. Dai, F. Liu, Y. Chen, C. Sun, S. Katrenko, P. Adriaans, C. Blaschke, R. Torres, M. Neves, P. Nakov, A. Divoli, M. Mana-Lopez, J. Mata, and W.J. Wilbur, "Overview of BioCreative II Gene Mention Recognition," Genome Biology, vol. 9, pp. S2-S20, 2008.
[30] Y. Sasaki, S. Montemagni, P. Pezik, D. Schuhman, J. Mcnaught, and S. Ananiadou, "{BioLexicon}: {A} Lexical Resource for the Biology Domain," Proc. Third Int'l Symp. Semantic Mining in Biomedicine (SMBM '08), pp. 109-116, 2008.
[31] H. Yu, G. Hripcsak, and C. Friedman, "Mapping Abbreviations to Full Forms in Biomedical Articles," J. Am. Medical Informatics Assoc., vol. 9, pp. 262-272, May 2002.
[32] H. Yu and E. Agichtein, "Extracting Synonymous Gene and Protein Terms from Biological Literature," Bioinformatics (Oxford, England), vol. 19, suppl 1, pp. i340-i349, 2003.
[33] D.S. Hirschberg, "Algorithms for the Longest Common Subsequence Problem," J. ACM, vol. 24, pp. 664-675, 1977.
[34] W.E. Winkler, "The State of Record Linkage and Current Research Problems," Technical Report RR99-04, Statistical Research Division, United States Census Bureau, 1999.
[35] A. Zanzoni, L. Montecchi-Palazzi, M. Quondam, G. Ausiello, M. Helmer-Citterich, and G. Cesareni, "MINT: a Molecular INTeraction Database," FEBS Letters, vol. 513, pp. 135-140, Feb. 2002.
[36] H. Yu, W. Kim, V. Hatzivassiloglou, and J. Wilbur, "A Large Scale, Corpus-Based Approach for Automatically Disambiguating Biomedical Abbreviations," ACM Trans. Information Systems, vol. 24, pp. 380-404, 2006.
[37] H. Yu, W. Kim, V. Hatzivassiloglou, and W.J. Wilbur, "Using MEDLINE as a Knowledge Source for Disambiguating Abbreviations and Acronyms in Full-Text Biomedical Journal Articles," J. Biomedical Informatics, vol. 40, pp. 150-159, 2007.
[38] J. Kim, T. Ohta, Y. Tateisi, and J. Tsujii, "GENIA Corpus— Semantically Annotated Corpus for Bio-Textmining," Bioinformatics, vol. 19, suppl 1, pp. i180-i182, 2003.
[39] B. Settles, "ABNER: An Open Source Tool for Automatically Tagging Genes, Proteins and Other Entity Names in Text," Bioinformatics, vol. 21, pp. 3191-3192, July 2005.
[40] Y. Regev, M. Finkelstein-Landau, R. Feldman, M. Gorodetsky, X. Zheng, S. Levy, R. Charlab, C. Lawrence, R.A. Lippert, Q. Zhang, and H. Shatkay, "Rule-Based Extraction of Experimental Evidence in the Biomedical Domain: The KDD Cup 2002 (Task 1)," ACM SIGKDD Exploration Newsletter, vol. 4, pp. 90-92, 2002.
[41] H. Yu and M. Lee, "Accessing Bioscience Images from Abstract Sentences," Bioinformatics, vol. 22, pp. e547-e556, 2006.

Index Terms:
Bioinformatics (genome or protein) databases, information search and retrieval, systems and software, text mining.
Citation:
Yonggang Cao, Zuofeng Li, Feifan Liu, Shashank Agarwal, Qing Zhang, Hong Yu, "An IR-Aided Machine Learning Framework for the BioCreative II.5 Challenge," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 3, pp. 454-461, July-Sept. 2010, doi:10.1109/TCBB.2010.56
Usage of this product signifies your acceptance of the Terms of Use.