The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - July-September (2010 vol.7)
pp: 428-441
Yifei Chen , Vrije Universiteit Brussel, Brussels
Feng Liu , Vrije Universiteit Brussel, Brussels
Bernard Manderick , Vrije Universiteit Brussel, Brussels
ABSTRACT
This paper describes a Biological Literature Miner (BioLMiner) system and its implementation. BioLMiner is a text mining system for biological literature, whose purpose is to extract useful information from biological literature, including gene and protein names, normalized gene and protein names, and protein-protein interaction pairs. BioLMiner has three main subsystems in a pipeline structure: a gene mention recognizer (GMRer), a gene normalizer (GNer), and a protein-protein interaction pair extractor (PPIEor). All these subsystems are developed based on the machine learning techniques including support vector machines (SVMs) and conditional random fields (CRFs) together with carefully designed informative features. At the same time, BioLMiner makes use of some biological specific resources and existing natural language processing tools. In order to evaluate and compare BioLMiner, it is adapted to participate in two tasks of the BioCreative II.5 challenge: interaction normalization task (INT) using GNer and interaction pair task (IPT) using PPIEor. Our system is among the highest performing systems on the two tasks from which it can be seen that GMRer provides a good support for the INT and IPT although its performance is not evaluated, and the methods developed in GNer and PPIEor are extended well to the BioCreative II.5 tasks.
INDEX TERMS
Text mining, machine learning, interaction protein normalization, interaction pair extraction, BioCreactive II.5 challenge.
CITATION
Yifei Chen, Feng Liu, Bernard Manderick, "BioLMiner System: Interaction Normalization Task and Interaction Pair Task in the BioCreative II.5 Challenge", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.7, no. 3, pp. 428-441, July-September 2010, doi:10.1109/TCBB.2010.47
REFERENCES
[1] M. Krallinger, F. Leitner, and A. Valencia, "Assessment of the Second Biocreative Ppi Task: Automatic Extraction of Protein-Protein Interactions," Proc. Second BioCreative Challenge Evaluation Workshop, pp. 41-54, 2007.
[2] B. De Bruijn and J. Martin, "Literature Mining in Molecular Biology," Proc. EFMI Workshop Natural Language, pp. 1-5, 2002.
[3] M. Krallinger, F. Leitner, and A. Valencia, "The Biocreative II.5 Challenge Overview," Proc. BioCreative II.5 Workshop 2009, p. 9, 2009.
[4] M. Krallinger, A. Morgan, L. Smith, F. Leitner, L. Tanabe, J. Wilbur, L. Hirschman, and A. Valencia, "Evaluation of Text-Mining Systems for Biology: Overview of the Second Biocreative Community Challenge," Genome Biology, vol. 9 (Suppl 2):S1, 2008.
[5] The UniProt Consortium, "The Universal Protein Resource (Uniprot) 2009," Nucleic Acids Research, vol. 37, pp. D169-D174, 2009.
[6] H. Isozaki and H. Kazawa, "Efficient Support Vector Classifiers for Named Entity Recognition," Proc. 19th Int'l Conf. Computational Linguistics, vol. 1, pp. 390-396, 2002.
[7] T. Joachims, "Text Categorization with Support Vector Machines: Learning with Many Relevant Features," Proc. 10th European Conf. Machine Learning (ECML '98), pp. 137-142, 1998.
[8] F. Peng and A. McCallum, "Accurate Information Extraction from Research Papers Using Conditional Random Fields," Proc. Human Language Technology Conf. and North Am. Chapter of the Assoc. for Computational Linguistics (HLT-NAACL '04), pp. 329-336, 2004.
[9] V. Vapnik, The Nature of Statistical Learning Theory. Springer, 1995.
[10] J. Lafferty, A. McCallum, and F. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," Proc. 18th Int'l Conf. Machine Learning (ICML '01), pp. 282-289, 2001.
[11] The Lexical Systems Group of the Lister Hill Nat'l Center for Biomedical Comm., "Lexical Variant Generation," http://lexsrv3.nlm.nih.gov/SPECIALIST/Projects/ lvg2009/, Mar. 2010.
[12] S. Ananiadou and J. McNaught, Text Mining for Biology and Biomedicine. Artech House, Inc., 2006.
[13] L. Smith, L.K. Tanabe, R.J. Ando, C. Kuo, I. Chung, C. Hsu, Y. Lin, R. Klinger, C.M. Friedrich, K. Ganchev, M. Torii, H. Liu, B. Haddow, C.A. Struble, R.J. Povinelli, A. Vlachos, W.A. Baumgartner,Jr., L. Hunter, B. Carpenter, R.T. Tsai, H. Dai, F. Liu, Y. Chen, C. Sun, S. Katrenko, P. Adriaans, C. Blaschke, R. Torres, M. Neves, P. Nakov, A. Divoli, M. Ma na-López, J. Mata, and W.J. Wilbur, "Overview of Biocreative II Gene Mention Recognition," Genome Biology, vol. 9 (Suppl 2):S2, 2008.
[14] C. Chang and C. Lin, "LIBSVM: A Library for Support Vector Machines," http://www.csie.ntu.edu.tw/~cjlinlibsvm/, 2001.
[15] T. Kudo, "Yet Another CRF Toolkit," http:/crfpp.sourceforge. net/, Mar. 2010.
[16] F. Liu, Y. Chen, and B. Manderick, "Named Entity Recognition in Biomedical Literature Using Two-Layer Support Vector Machines," Proc. Ninth Int'l Conf. Enterprise Information Systems (ICEIS '07), pp. 39-48, 2007.
[17] L. Smith, T. Rindflesch, and W.J. Wilbur, "Medpost: A Part-of-Speech Tagger for Biomedical Text," Bioinformatics, vol. 20, no. 14, pp. 2320-2321, 2004.
[18] Y. Chen, F. Liu, and B. Manderick, "Improving the Performance of Gene Mention Recognition System Using Reformed Lexicon-Based Support Vector Machine," Proc. 2007 Int'l Conf. Data Mining (DMIN '07), pp. 228-234, 2007.
[19] H. Liu, Z.-Z. Hu, J. Zhang, and C. Wu, "Biothesaurus: A Web-Based Thesaurus of Protein and Gene Names," Bioinformatics, vol. 22, no. 1, pp. 103-105, 2006.
[20] W.W. Cohen, P. Ravikumar, and S.E. Fienberg, "A Comparison of String Distance Metrics for Name-Matching Tasks," Proc. IJCAI 2003 Workshop Information Integration on the Web, pp. 73-78, 2003.
[21] E. Ejerbed, "Finding Clauses in Unrestricted Text by Finitary and Stochastic Methods," Proc. Second Conf. Applied Natural Language Processing, pp. 219-227, 1988.
[22] E. Charniak and M. Johnson, "Coarse-to-Fine N-Best Parsing and Maxent Discriminative Reranking," Proc. 43rd Ann. Meeting on Assoc. for Computational Linguistics, pp. 173-180, 2005.
[23] B. Santorini, "Part-of-Speech Tagging Guidelines for the Penn Treebank Project," technical report, Dept. of Computer and Information Science, Univ. of Pennsylvania, 1990.
[24] C. Plake, J. Hakenberg, and U. Leser, "Optimizing Syntax Patterns for Discovering Protein-Protein Interactions," Proc. 2005 ACM Symp. Applied Computing, pp. 195-201, 2005.
[25] A. Chatr-aryamontri, A. Ceol, L.M. Palazzi, G. Nardelli, M.V. Schneider, L. Castagnoli, and G. Cesareni, "Mint: The Molecular Interaction Database," Nucleic Acids Research, vol. 35, pp. D572-D574, 2007.
[26] R.K. Ando, "Biocreative II Gene Mention Tagging System at IBM Watson," Proc. Second BioCreative Challenge Evaluation Workshop, pp. 101-103, 2007.
[27] B. Efron and R.J. Tibshirani, An Introduction to the Bootstrap. Chapman & Hall/CRC, 1994.
[28] Y. Chen, "Biological Literature Miner: Gene Mention Recognition and Protein-Protein Interaction Pair Extraction," PhD dissertation, Vrije Universiteit Brussel, 2010.
[29] A. Ceol, A. Chatr-Aryamontri, L. Licata, and G. Cesareni, "Linking Entries in Protein Interaction Database to Structured Text: The Febs Letters Experiment," FEBS Letters, vol. 582, no. 8, pp. 1171-1177, 2008.
6 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool