Issue No.03 - July-September (2010 vol.7)
pp: 400-411
Alaa Abi-Haidar , Indiana University, Bloomington
Artemy Kolchinsky , Indiana University, Bloomington
Ahmed Abdeen Hamed , Indiana University, Bloomington
Luis M. Rocha , Indiana University, Bloomington
We participated (as Team 9) in the Article Classification Task of the Biocreative II.5 Challenge: binary classification of full-text documents relevant for protein-protein interaction. We used two distinct classifiers for the online and offline challenges: 1) the lightweight Variable Trigonometric Threshold (VTT) linear classifier we successfully introduced in BioCreative 2 for binary classification of abstracts and 2) a novel Naive Bayes classifier using features from the citation network of the relevant literature. We supplemented the supplied training data with full-text documents from the MIPS database. The lightweight VTT classifier was very competitive in this new full-text scenario: it was a top-performing submission in this task, taking into account the rank product of the Area Under the interpolated precision and recall Curve, Accuracy, Balanced F-Score, and Matthew's Correlation Coefficient performance measures. The novel citation network classifier for the biomedical text mining domain, while not a top performing classifier in the challenge, performed above the central tendency of all submissions, and therefore indicates a promising new avenue to investigate further in bibliome informatics.
Text mining, literature mining, binary classification, protein-protein interaction, citation network.
Alaa Abi-Haidar, Artemy Kolchinsky, Ahmed Abdeen Hamed, Luis M. Rocha, "Classification of Protein-Protein Interaction Full-Text Documents Using Text and Citation Network Features", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.7, no. 3, pp. 400-411, July-September 2010, doi:10.1109/TCBB.2010.55
[1] L. Hunter and K. Cohen, "Biomedical Language Processing: What's Beyond Pubmed?" Molecular Cell, vol. 21, no. 5, pp. 589-594, 2006.
[2] Pubmed, http:/, 2010.
[3] H. Shatkay and R. Feldman, "Mining the Biomedical Literature in the Genomic Era: An Overview," J. Computational Biology, vol. 10, no. 6, pp. 821-856, 2003.
[4] L.J. Jensen, J. Saric, and P. Bork, "Literature Mining for the Biologist: From Information Retrieval to Biological Discovery," Nature Rev. Genetics, vol. 7, no. 2, pp. 119-129, Feb. 2006.
[5] A. Abi-Haidar, J. Kaur1, A. Maguitman, P. Radivojac, A. Retchsteiner, K. Verspoor, Z. Wang, and L.M. Rocha, "Uncovering Protein Interaction in Abstracts and Text Using a Novel Linear Model and Word Proximity Networks," Genome Biology, vol. 9, suppl. 2: S11.1-19, 2008.
[6] L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia, "Overview of Biocreative: Critical Assessment of Information Extraction for Biology," BMC Bioinformatics, vol. 6, suppl. 1: S1, 2005.
[7] Proc. Second BioCreative Challenge Evaluation Workshop, 2007.
[8] S. Chakrabarti, Mining the Web: Analysis of Hypertext and Semi Structured Data. Morgan Kaufmann, 2002.
[9] I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, and C.D. Spyropoulos, "An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-Mail Messages," Proc. Ann. ACM Conf. Research and Development in Information Retrieval, pp. 160-167, 2000.
[10] T. Joachims, Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms. Kluwer Academic Publishers, 2002.
[11] R. Feldman and J. Sanger, The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge Univ. Press, 2006.
[12] F. Sebastiani, "Machine Learning in Automated Text Categorization," ACM Computing Surveys vol. 34, no. 1, pp. 1-47, 2002.
[13] M. Krallinger and A. Valencia, "Evaluating the Detection and Ranking of Protein Interaction Relevant Articles: The Biocreative Challenge Interaction Article Sub-Task (ias)," Proc. Second Biocreative Challenge Evaluation Workshop, pp. 29-39, 2007.
[14] H.W. Mewes, C. Amid, R. Arnold, D. Frishman, U. Guldener, G. Mannhaupt, M. Munsterkotter, P. Pagel, N. Strack, V. Stumpflen, J. Warfsmann, and A. Ruepp, "Mips: Analysis Annotation of Proteins from Whole Genomes," Nucleic Acids Research, vol. 32, Database issue, pp. D41-D44, Jan. 2004.
[15] F. Fdez-Riverola, E. Iglesias, F. Diaz, J. Mendez, and J. Corchado, "Spamhunting: An Instance-Based Reasoning System for Spam Labelling Filtering," Decision Support Systems, vol. 43, no. 3, pp. 722-736, 2007.
[16] G. Salton and C. Buckley, "Term-Weighting Approaches in Automatic Text Retrieval," Information Processing and Management, vol. 24, no. 5, pp. 513-523, 1988.
[17] M. Porter, "An Algorithm for Suffix Stripping," Program, vol. 13, no. 3, pp. 130-137, 1980.
[18] R. Breitling, P. Armengaud, A. Amtmann, and P. Herzyk, "Rank Products: A Simple yet Powerful and New Method to Detect Differentially Regulated Genes in Replicated Microarray Experiments," FEBS Letters, vol. 573, nos. 1-3, pp. 83-92, Aug. 2004.
[19] B. Settles, "Abner: An Open Source Tool for Automatically Tagging Genes, Proteins and Other Entity Names in Text," Bioinformatics, vol. 21, no. 14, pp. 3191-3192, 2005.
[20] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addison-Wesley Longman, 1999.
[21] P. Baldi, "Assessing the Accuracy of Prediction Algorithms for Classification: An Overview," Bioinformatics, vol. 16, no. 5, pp. 412-424, May 2000.
[22] L.E. Dodd and M.S. Pepe, "Partial AUC Estimation Regression," Biometrics, vol. 59, no. 3, pp. 614-623, 2003.
[23] T. Fawcett, "An Introduction to ROC Analysis," Pattern Recognition Letters, vol. 27, no. 8, pp. 861-874, 2006.
[24] B.W. Matthews, "Comparison of the Predicted and Observed Secondary Structure of t4 Phage Lysozyme," Biochimica Biophysica Acta, vol. 405, no. 2, pp. 442-451, Oct. 1975.
[25] T. Cover and J. Thomas, Elements of Information Theory. John Wiley and Sons, 2006.
[26] I. Councill, C. Giles, and M. Kan, "Parscit: An Open-Source CRF Reference String Parsing Package," Proc. Int'l Conf. Language Resources and Evaluation (LREC), 2008.
[27] U. Laemmli et al., "Cleavage of Structural Proteins During the Assembly of the Head of Bacteriophage t4," Nature, vol. 227, no. 5259, pp. 680-685, 1970.
[28] D. Perkins et al., "Probability-Based Protein Identification by Searching Sequence Databases Using Mass Spectrometry Data," Electrophoresis, vol. 20, no. 18, pp. 3551-3567, 1999.
[29] Y. Yang and J.O. Pedersen, "A Comparative Study on Feature Selection in Text Categorization," Proc. 14th Int'l Conf. Machine Learning, pp. 412-420, 1997.
[30] T. Joachims, "Making Large-Scale Support Vector Machine Learning Practical," Advances in Kernel Methods: Support Vector Learning, MIT Press, 1999.
[31] P. Nakov, A. Schwartz, and M. Hearst, "Citances: Citation Sentences for Semantic Analysis of Bioscience Text," Proc. SIGIR04 Workshop Search and Discovery in Bioinformatics, 2004.
[32] K. Lai and S. Wu, "Using the Patent Co-Citation Approach to Establish a New Patent Classification System," Information Processing and Management, vol. 41, no. 2, pp. 313-330, 2005.
[33] X. Li, H. Chen, Z. Zhang, and J. Li, "Automatic Patent Classification Using Citation Network Information: An Experimental Study in Nanotechnology," Proc. Seventh ACM/IEEE Computer Soc. Joint Conf. Digital Libraries, pp. 419-427, 2007.