This Article 
 Bibliographic References 
 Add to: 
Classification of Protein-Protein Interaction Full-Text Documents Using Text and Citation Network Features
July-September 2010 (vol. 7 no. 3)
pp. 400-411
Artemy Kolchinsky, Indiana University, Bloomington
Alaa Abi-Haidar, Indiana University, Bloomington
Jasleen Kaur, Indiana University, Bloomington
Ahmed Abdeen Hamed, Indiana University, Bloomington
Luis M. Rocha, Indiana University, Bloomington
We participated (as Team 9) in the Article Classification Task of the Biocreative II.5 Challenge: binary classification of full-text documents relevant for protein-protein interaction. We used two distinct classifiers for the online and offline challenges: 1) the lightweight Variable Trigonometric Threshold (VTT) linear classifier we successfully introduced in BioCreative 2 for binary classification of abstracts and 2) a novel Naive Bayes classifier using features from the citation network of the relevant literature. We supplemented the supplied training data with full-text documents from the MIPS database. The lightweight VTT classifier was very competitive in this new full-text scenario: it was a top-performing submission in this task, taking into account the rank product of the Area Under the interpolated precision and recall Curve, Accuracy, Balanced F-Score, and Matthew's Correlation Coefficient performance measures. The novel citation network classifier for the biomedical text mining domain, while not a top performing classifier in the challenge, performed above the central tendency of all submissions, and therefore indicates a promising new avenue to investigate further in bibliome informatics.

[1] L. Hunter and K. Cohen, "Biomedical Language Processing: What's Beyond Pubmed?" Molecular Cell, vol. 21, no. 5, pp. 589-594, 2006.
[2] Pubmed, http:/, 2010.
[3] H. Shatkay and R. Feldman, "Mining the Biomedical Literature in the Genomic Era: An Overview," J. Computational Biology, vol. 10, no. 6, pp. 821-856, 2003.
[4] L.J. Jensen, J. Saric, and P. Bork, "Literature Mining for the Biologist: From Information Retrieval to Biological Discovery," Nature Rev. Genetics, vol. 7, no. 2, pp. 119-129, Feb. 2006.
[5] A. Abi-Haidar, J. Kaur1, A. Maguitman, P. Radivojac, A. Retchsteiner, K. Verspoor, Z. Wang, and L.M. Rocha, "Uncovering Protein Interaction in Abstracts and Text Using a Novel Linear Model and Word Proximity Networks," Genome Biology, vol. 9, suppl. 2: S11.1-19, 2008.
[6] L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia, "Overview of Biocreative: Critical Assessment of Information Extraction for Biology," BMC Bioinformatics, vol. 6, suppl. 1: S1, 2005.
[7] Proc. Second BioCreative Challenge Evaluation Workshop, 2007.
[8] S. Chakrabarti, Mining the Web: Analysis of Hypertext and Semi Structured Data. Morgan Kaufmann, 2002.
[9] I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, and C.D. Spyropoulos, "An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-Mail Messages," Proc. Ann. ACM Conf. Research and Development in Information Retrieval, pp. 160-167, 2000.
[10] T. Joachims, Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms. Kluwer Academic Publishers, 2002.
[11] R. Feldman and J. Sanger, The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge Univ. Press, 2006.
[12] F. Sebastiani, "Machine Learning in Automated Text Categorization," ACM Computing Surveys vol. 34, no. 1, pp. 1-47, 2002.
[13] M. Krallinger and A. Valencia, "Evaluating the Detection and Ranking of Protein Interaction Relevant Articles: The Biocreative Challenge Interaction Article Sub-Task (ias)," Proc. Second Biocreative Challenge Evaluation Workshop, pp. 29-39, 2007.
[14] H.W. Mewes, C. Amid, R. Arnold, D. Frishman, U. Guldener, G. Mannhaupt, M. Munsterkotter, P. Pagel, N. Strack, V. Stumpflen, J. Warfsmann, and A. Ruepp, "Mips: Analysis Annotation of Proteins from Whole Genomes," Nucleic Acids Research, vol. 32, Database issue, pp. D41-D44, Jan. 2004.
[15] F. Fdez-Riverola, E. Iglesias, F. Diaz, J. Mendez, and J. Corchado, "Spamhunting: An Instance-Based Reasoning System for Spam Labelling Filtering," Decision Support Systems, vol. 43, no. 3, pp. 722-736, 2007.
[16] G. Salton and C. Buckley, "Term-Weighting Approaches in Automatic Text Retrieval," Information Processing and Management, vol. 24, no. 5, pp. 513-523, 1988.
[17] M. Porter, "An Algorithm for Suffix Stripping," Program, vol. 13, no. 3, pp. 130-137, 1980.
[18] R. Breitling, P. Armengaud, A. Amtmann, and P. Herzyk, "Rank Products: A Simple yet Powerful and New Method to Detect Differentially Regulated Genes in Replicated Microarray Experiments," FEBS Letters, vol. 573, nos. 1-3, pp. 83-92, Aug. 2004.
[19] B. Settles, "Abner: An Open Source Tool for Automatically Tagging Genes, Proteins and Other Entity Names in Text," Bioinformatics, vol. 21, no. 14, pp. 3191-3192, 2005.
[20] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addison-Wesley Longman, 1999.
[21] P. Baldi, "Assessing the Accuracy of Prediction Algorithms for Classification: An Overview," Bioinformatics, vol. 16, no. 5, pp. 412-424, May 2000.
[22] L.E. Dodd and M.S. Pepe, "Partial AUC Estimation Regression," Biometrics, vol. 59, no. 3, pp. 614-623, 2003.
[23] T. Fawcett, "An Introduction to ROC Analysis," Pattern Recognition Letters, vol. 27, no. 8, pp. 861-874, 2006.
[24] B.W. Matthews, "Comparison of the Predicted and Observed Secondary Structure of t4 Phage Lysozyme," Biochimica Biophysica Acta, vol. 405, no. 2, pp. 442-451, Oct. 1975.
[25] T. Cover and J. Thomas, Elements of Information Theory. John Wiley and Sons, 2006.
[26] I. Councill, C. Giles, and M. Kan, "Parscit: An Open-Source CRF Reference String Parsing Package," Proc. Int'l Conf. Language Resources and Evaluation (LREC), 2008.
[27] U. Laemmli et al., "Cleavage of Structural Proteins During the Assembly of the Head of Bacteriophage t4," Nature, vol. 227, no. 5259, pp. 680-685, 1970.
[28] D. Perkins et al., "Probability-Based Protein Identification by Searching Sequence Databases Using Mass Spectrometry Data," Electrophoresis, vol. 20, no. 18, pp. 3551-3567, 1999.
[29] Y. Yang and J.O. Pedersen, "A Comparative Study on Feature Selection in Text Categorization," Proc. 14th Int'l Conf. Machine Learning, pp. 412-420, 1997.
[30] T. Joachims, "Making Large-Scale Support Vector Machine Learning Practical," Advances in Kernel Methods: Support Vector Learning, MIT Press, 1999.
[31] P. Nakov, A. Schwartz, and M. Hearst, "Citances: Citation Sentences for Semantic Analysis of Bioscience Text," Proc. SIGIR04 Workshop Search and Discovery in Bioinformatics, 2004.
[32] K. Lai and S. Wu, "Using the Patent Co-Citation Approach to Establish a New Patent Classification System," Information Processing and Management, vol. 41, no. 2, pp. 313-330, 2005.
[33] X. Li, H. Chen, Z. Zhang, and J. Li, "Automatic Patent Classification Using Citation Network Information: An Experimental Study in Nanotechnology," Proc. Seventh ACM/IEEE Computer Soc. Joint Conf. Digital Libraries, pp. 419-427, 2007.

Index Terms:
Text mining, literature mining, binary classification, protein-protein interaction, citation network.
Artemy Kolchinsky, Alaa Abi-Haidar, Jasleen Kaur, Ahmed Abdeen Hamed, Luis M. Rocha, "Classification of Protein-Protein Interaction Full-Text Documents Using Text and Citation Network Features," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 3, pp. 400-411, July-Sept. 2010, doi:10.1109/TCBB.2010.55
Usage of this product signifies your acceptance of the Terms of Use.