The Community for Technology Leaders
RSS Icon
Issue No.03 - July-September (2010 vol.7)
pp: 421-427
Man Lan , East China Normal University, Shanghai
Jian Su , Institute for Infocomm Research, Singapore
The selection of protein interaction documents is one important application for biology research and has a direct impact on the quality of downstream BioNLP applications, i.e., information extraction and retrieval, summarization, QA, etc. The BioCreative II.5 Challenge Article Categorization task (ACT) involves doing a binary text classification to determine whether a given structured full-text article contains protein interaction information. This may be the first attempt at classification of full-text protein interaction documents in wide community. In this paper, we compare and evaluate the effectiveness of different section types in full-text articles for text classification. Moreover, in practice, the less number of true-positive samples results in unstable performance and unreliable classifier trained on it. Previous research on learning with skewed class distributions has altered the class distribution using oversampling and downsampling. We also investigate the skewed protein interaction classification and analyze the effect of various issues related to the choice of external sources, oversampling training sets, classifiers, etc. We report on the various factors above to show that 1) a full-text biomedical article contains a wealth of scientific information important to users that may not be completely represented by abstracts and/or keywords, which improves the accuracy performance of classification and 2) reinforcing true-positive samples significantly increases the accuracy and stability performance of classification.
Protein interaction, text classification, full-text article, BioCreative.
Man Lan, Jian Su, "Empirical Investigations into Full-Text Protein Interaction Article Categorization Task (ACT) in the BioCreative II.5 Challenge", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.7, no. 3, pp. 421-427, July-September 2010, doi:10.1109/TCBB.2010.49
[1] L.H. Alexander, S. Yeh, and A.A. Morgan, "Evaluation of Text Data Mining for Database Curation: Lessons Learned from the Kdd Challenge Cup," Bioinformatics, vol. 19, suppl. 1, pp. i331-i339, 2003.
[2] A.M. Cohen and W.R. Hersh, "The TREC 2004 Genomics Track Categorization Task: Classifying Full Text Biomedical Documents," J. Biomedical Discovery and Collaboration, vol. 1, no. 4, 2006.
[3] M. Krallinger and A. Valencia, "Evaluating the Detection and Ranking of Protein Interaction Relevant Articles: The BioCreative Challenge Interaction Article Sub-Task (IAS)," Proc. Second BioCreAtIvE Challenge Workshop, pp. 29-39, 2007.
[4] P. Zweigenbaum et al., "New Frontiers in Biomedical Text Mining," Proc. Pacific Symp. Biocomputing, vol. 12, pp. 205-208, 2007.
[5] H. Shatkay, N. Chen, and D. Blostein, "Integrating Image Data into Biomedical Text Categorization," Bioinformatics, vol. 22, no. 14, pp. e446-e453, 2006.
[6] Y. Regev, M. Finkelstein-Landau, R. Feldman, M. Gorodetsky, X. Zheng, S. Levy, R. Charlab, C. Lawrence, R.A. Lippert, Q. Zhang, and H. Shatkay, "Rule-Based Extraction of Experimental Evidence in the Biomedical Domain: The Kdd Cup 2002 (Task 1)," SIGKDD Exploring Newsletter, vol. 4, no. 2, pp. 90-92, 2002.
[7] Proc. Am. Assoc. Artificial Intelligence (AAAI '00) Workshop Learning from Imbalanced Data Sets, N. Japkowicz, ed., 2000.
[8] Proc. Int'l Conf. Machine Learning (ICML '00) Workshop Cost-Sensitive Learning, T. Dietterich, D. Margineantu, F. Provost, and P. Turney, eds., 2000.
[9] Proc. Int'l Conf. Machine Learning (ICML '03) Workshop Learning from Imbalanced Data Sets, N.V. Chawla, N. Japkowicz, and A. Kolcz, eds., 2003.
[10] D. Hanisch, J. Fluck, H.T. Mevissen, and R. Zimmer, "Laying Biology's Name Game: Identifying Protein Names in Scientific Text," Proc. Eighth Pacific Symp. Biocomputing, pp. 403-414, Jan. 2003.
[11] G. Zhou and J. Su, "Exploring Deep Knowledge Resources in Biomedical Name Recognization," Proc. Joint Workshop Natural Language Processing in Biomedicine and Its Applications (JNLPBA '04) Shared Task, pp. 99-102, 2004.
[12] M. Porter, "An Algorithm for Suffix Stripping," Program, vol. 14, no. 3, pp. 130-137, 1980.
[13] Y. Yang and J.O. Pedersen, "A Comparative Study on Feature Selection in Text Categorization," Proc. 14th Int'l Conf. Machine Learning (ICML '97), pp. 412-420, 1997.
[14] M. Lan, C.L. Tan, and H.B. Low, "Proposing a New Term Weighting Scheme for Text Categorization," Proc. 21st Nat'l Conf. Artificial Intelligence (AAAI '06), 2006.
[15] M. Lan, C.L. Tan, J. Su, and Y. Lu, "Supervised and Traditional Term Weighting Methods for Automatic Text Categorization," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 4, pp. 721-735, Apr. 2009.
[16] M. Lan, C. Lim Tan, and J. Su, "A Term Investigation and Majority Voting for Protein Interaction Article Sub-task 1 (IAS)," Proc. Second BioCreative Challenge Evaluation Workshop, 2007.
[17] R.T.-H. Tsai, H.-C. Hung, H.-J. Dai, and Y.-W. Lin, "Protein-Protein Interaction Abstract Identification with Contextual Bag of Words," Proc. Second Int'l Symp. Languages in Biology and Medicine (LBM '07), 2007.
19 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool