The Community for Technology Leaders
Green Image
Issue No. 03 - July-September (2010 vol. 7)
ISSN: 1545-5963
pp: 421-427
Man Lan , East China Normal University, Shanghai
Jian Su , Institute for Infocomm Research, Singapore
ABSTRACT
The selection of protein interaction documents is one important application for biology research and has a direct impact on the quality of downstream BioNLP applications, i.e., information extraction and retrieval, summarization, QA, etc. The BioCreative II.5 Challenge Article Categorization task (ACT) involves doing a binary text classification to determine whether a given structured full-text article contains protein interaction information. This may be the first attempt at classification of full-text protein interaction documents in wide community. In this paper, we compare and evaluate the effectiveness of different section types in full-text articles for text classification. Moreover, in practice, the less number of true-positive samples results in unstable performance and unreliable classifier trained on it. Previous research on learning with skewed class distributions has altered the class distribution using oversampling and downsampling. We also investigate the skewed protein interaction classification and analyze the effect of various issues related to the choice of external sources, oversampling training sets, classifiers, etc. We report on the various factors above to show that 1) a full-text biomedical article contains a wealth of scientific information important to users that may not be completely represented by abstracts and/or keywords, which improves the accuracy performance of classification and 2) reinforcing true-positive samples significantly increases the accuracy and stability performance of classification.
INDEX TERMS
Protein interaction, text classification, full-text article, BioCreative.
CITATION
Man Lan, Jian Su, "Empirical Investigations into Full-Text Protein Interaction Article Categorization Task (ACT) in the BioCreative II.5 Challenge", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. , pp. 421-427, July-September 2010, doi:10.1109/TCBB.2010.49
89 ms
(Ver )