The Community for Technology Leaders
RSS Icon
Issue No.01 - January/February (2012 vol.9)
pp: 305-310
K. H. Ambert , Dept. of Med. Inf. & Clinical Epidemiology, Oregon Health & Sci. Univ., Portland, OR, USA
A. M. Cohen , Dept. of Med. Inf. & Clinical Epidemiology, Oregon Health & Sci. Univ., Portland, OR, USA
Although publicly accessible databases containing protein-protein interaction (PPI)-related information are important resources to bench and in silico research scientists alike, the amount of time and effort required to keep them up to date is often burdonsome. In an effort to help identify relevant PPI publications, text-mining tools, from the machine learning discipline, can be applied to help in this process. Here, we describe and evaluate two document classification algorithms that we submitted to the BioCreative II.5 PPI Classification Challenge Task. This task asked participants to design classifiers for identifying documents containing PPI-related information in the primary literature, and evaluated them against one another. One of our systems was the overall best-performing system submitted to the challenge task. It utilizes a novel approach to k-nearest neighbor classification, which we describe here, and compare its performance to those of two support vector machine-based classification systems, one of which was also evaluated in the challenge task.
Proteins, Databases, Training, Bioinformatics, Support vector machines, Computational biology, Electronic mail,text classification., Protein-protein interaction, k-nearest neighbor, information gain, support vector machine
K. H. Ambert, A. M. Cohen, "k-Information Gain Scaled Nearest Neighbors: A Novel Approach to Classifying Protein-Protein Interaction-Related Documents", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.9, no. 1, pp. 305-310, January/February 2012, doi:10.1109/TCBB.2011.32
[1] I. Xenarios, L. Salwinski, X. Duan, P. Higney, S. Kim, and D. Eisenberg, “DIP, the Database of Interacting Proteins: A Research Tool for Studying Cellular Networks of Protein Interactions,” Nucleic Acids Research, vol. 30, no. 1, pp. 303-305, 2002.
[2] P. Pagel et al., “The MIPS Mammalian Protein-Protein Interaction Database,” Bioinformatics, vol. 21, no. 6, pp. 832-834, 2005.
[3] I. Donaldson et al., “Prebind and Textomy - Mining the Biomedical Literature for Protein-Protein Interactions Using a Support Vector Machine,” BMC Bioinformatics, vol. 4, no. 1, pp. 11-23, 2003.
[4] G. Bader and C. Hogue, “Bind-A Data Specification for Storing and Describing Biomolecular Interactions, Molecular Complexes and Pathways,” Bioinformatics, vol. 16, no. 5, pp. 465-477, 2000.
[5] G. Bader, I. Donaldson, C. Wolting, B. Ouellette, T. Pawson, and C. Hogue, “Bind-The Biomolecular Interaction Network Database,” Nucleic Acids Research, vol. 29, no. 1, pp. 242-245, 2001.
[6] C. von Mering, R. Krause, B. Snel, M. Cornell, S. Oliver, S. Fields, and P. Bork, “Comparative Assessment of Large-Scale Data Sets of Protein-Protein Interactions,” Nature, vol. 417, no. 6887, pp. 399-403, 2002.
[7] A. Cohen and W. Hersh, “A Survey of Current Work in Biomedical Text Mining,” Briefings in Bioinformatics, vol. 6, no. 1, pp. 57-61, 2005.
[8] J. Yang, A. Cohen, and M. McDonagh, “Syriac: The Systematic Review Information Automated Collection System a Data Warehouse for Facilitating Automated Biomedical Text Classification,” Proc. Am. Medical Informatics Assoc. (AMIA) Ann. Symp., pp. 825-829, 2008.
[9] A. Cohen, K. Ambert, and M. McDonagh, “Cross-Topic Learning for Work Prioritization in Systematic Review Creation and Update,” J. Am. Medical Informatics Assoc., vol. 16, pp. 690-704, 2009.
[10] F. Leitner, S. Mardis, M. Krallinger, G. Cesareni, L. Hirschman, and A. Valencia, “An Overview of Biocreative ii. 5,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 7, no. 3, pp. 385-399, July-Sept. 2010.
[11] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. 10th European Conf. Machine Learning, pp. 137-142, 1998.
[12] C.-J. Lin and C-C. Chang, “Libsvm: A Library for Support Vector Machines,”, 2011.
[13] Y. Tsuruoka, J. McNaught, and S. Ananiadou, “Normalizing Biomedical Terms by Minimizing Ambiguity and Variability,” BMC Bioinformatics, vol. 9, no. Suppl 3, p. S2, 2008.
[14] J. Hakenberg, C. Plake, L. Royer, H. Strobelt, U. Leser, and M. Schroeder, “Gene Mention Normalization and Interaction Extraction with Context Models and Sentence Motifs,” Genome Biology, vol. 9(Suppl 2): S14, 2008.
[15] U. Fayyad and K. Irani, “Multi-Interval Discretization of Continuous Attributes as Preprocessing for Classification Learning,” Proc. 13th Int'l Join Conf. Artificial Intelligence, pp. 1022-1027, 1993.
[16] A. Cohen, “An Effective General Purpose Approach for Automated Biomedical Document Classification,” Proc. Am. Medical Informatics Assoc. (AMIA) Ann. Symp., pp. 161-165, 2006.
[17] K. Ambert and A. Cohen, “A System for Classifying Disease Comorbidity Status from Medical Discharge Summaries Using Automated Hotspot and Negated Concept Detection,” J. Am. Medical Informatics Assoc., vol. 16, no. 4, pp. 590-595, 2009.
[18] I. Mani, “knn Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction,” Proc. Workshop Learning from Imbalanced Data Sets II, 2009.
[19] S. Tan, “Neighbor-Weighted k-Nearest Neighbor for Unbalanced Text Corpus,” Expert Systems with Applications, vol. 28, no. 4, pp. 667-671, 2005.
[20] G. Guo, H. Wang, D. Bell, Y. Bi, and K. Greer, “Knn Model-Based Approach in Classification,” Proc. on the Move to Meaningful Internet Systems, pp. 986-996, 2003.
[21] E. Han, G. Karypis, and V. Kumar, “Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification,” Advances in Knowledge Discovery and Data Mining, pp. 53-65, 2001.
[22] L. Baoli, L. Qin, and Y. Shiwen, “An Adaptive k-Nearest Neighbor Text Categorization Strategy,” ACM Trans. Asian Language Information Processing, vol. 3, no. 4, pp. 215-226, 2004.
[23] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[24] J. Kent, “Information Gain and a General Measure of Correlation,” Biometrika, vol. 70, no. 1, pp. 163-173, 1983.
[25] M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document Clustering Techniques,” Proc. KDD Workshop Text Mining, pp. 525-526, 2000.
[26] B. Zadrozny, J. Langford, and N. Abe, “Cost-Sensitive Learning by Cost-Proportionate Example Weighting,” Proc. Third IEEE Int'l Conf. Data Mining, pp. 435-442, 2003.
[27] G. Forman, “Tackling Concept Drift by Temporal Inductive Transfer,” Proc. 29th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 252-259, 2006.
34 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool