The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - May/June (2011 vol.8)
pp: 851-857
Aditya Kumar Sehgal , Parity Computing, La Jolla
Sanmay Das , Rensselaer Polytechnic Institute, Troy
Keith Noto , University of California at San Diego, La Jolla
Milton H. Saier, Jr. , University of California at San Diego, La Jolla
Charles Elkan , University of California at San Diego, La Jolla
ABSTRACT
With well over 1,000 specialized biological databases in use today, the task of automatically identifying novel, relevant data for such databases is increasingly important. In this paper, we describe practical machine learning approaches for identifying MEDLINE documents and Swiss-Prot/TrEMBL protein records, for incorporation into a specialized biological database of transport proteins named TCDB. We show that both learning approaches outperform rules created by hand by a human expert. As one of the first case studies involving two different approaches to updating a deployed database, both the methods compared and the results will be of interest to curators of many specialized databases.
INDEX TERMS
Bioinformatics (genome or protein) databases, clustering, classification, association rules, text mining, biomedical text classification, data mining.
CITATION
Aditya Kumar Sehgal, Sanmay Das, Keith Noto, Milton H. Saier, Jr., Charles Elkan, "Identifying Relevant Data for a Biological Database: Handcrafted Rules versus Machine Learning", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.8, no. 3, pp. 851-857, May/June 2011, doi:10.1109/TCBB.2009.83
REFERENCES
[1] J.A. Aslam, V. Pavlu, and E. Yilmaz, "A Statistical Method for System Evaluation Using Incomplete Judgments," Proc. ACM SIGIR, pp. 541-548, 2006.
[2] E. Camon et al., "The Gene Ontology Annotation (GOA) Project: Implementation of GO in Swiss-Prot, TrEMBL, and InterPro," Genome Research, vol. 13, no. 4, pp. 662-672, 2003.
[3] Y. Chen et al., "SPD—A Web-Based Secreted Protein Database," Nucleic Acids Research, vol. 33, pp. D169-D173, 2005.
[4] F.M. Couto, B. Martins, and M.J. Silva, "Classifying Biological Articles Using Web Resources," Proc. ACM Symp. Applied Computing, pp. 111-115, 2004.
[5] M. Craven and J. Kumlien, "Constructing Biological Knowledge Bases by Extracting Information from Text Sources," Proc. Seventh Int'l. Conf. Intelligent Systems for Molecular Biology, 1999.
[6] S. Das, M.H. Saier,Jr., and C. Elkan, "Finding Transport Proteins in a General Protein Database," Proc. 11th European Conf. Principles and Practice of Knowledge Discovery in Databases, pp. 54-66, 2007.
[7] G. Dellaire, R. Farrall, and W.A. Bickmore, "The Nuclear Protein Database (NPD): Sub-Nuclear Localisation and Functional Annotation of the Nuclear Proteome," Nucleic Acids Research, vol. 31, pp. 328-330, 2003.
[8] F. Denis, "PAC Learning from Positive Statistical Queries," Proc. Ninth Int'l Conf. Algorithmic Learning Theory (ALT '98), pp 112-126, 1998.
[9] F. Denis, R. Gilleron, and M. Tommasi, "Text Classification from Positive and Unlabeled Examples," Proc. Conf. Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU '02), pp. 1927-1934, 2002.
[10] I. Donaldson et al., "PreBIND and Textomy—Mining the Biomedical Literature for Protein-Protein Interactions Using a Support Vector Machine," BMC Bioinformatics, vol. 4, no. 11, Mar. 2003.
[11] C. Elkan and K. Noto, "Learning Classifiers from Only Positive and Unlabeled Data," Proc. 14th ACM SIGKDD (KDD '08), pp. 213-220, 2008.
[12] G.P.C. Fung, J.X. Yu, H. Lu, and P.S. Yu, "Text Classification without Negative Examples Revisit (sic)," IEEE Trans. Knowledge and Data Eng., vol. 18, no. 1, pp. 6-20, Jan. 2006.
[13] M.Y. Galperin and G.R. Cochrane, "Nucleic Acids Research Annual Database Issue and the NAR Online Molecular Biology Database Collection in 2009," Nucleic Acids Research, vol. 37, pp. D1-D4, 2009.
[14] M. Ashburner et al., "Gene Ontology: Tool for the Unification of Biology," Nature Genetics, vol. 25, pp. 25-29, May 2000.
[15] G. Grumbling and V. Strelets, "The FlyBase Consortium. Flybase: Anatomical Data, Images and Queries," Nucleic Acids Research, vol. 34, pp. D484-D488, 2006.
[16] W. Hersh, "Evaluation of Biomedical Text-Mining Systems: Lessons Learned from Information Retrieval," Briefings in Bioinformatics, vol. 6, no. 4, pp. 344-356, Dec. 2005.
[17] W. Hersh, A. Cohen, J. Yang, R.T. Bhupatiraju, P. Roberts, and M. Hearst, "Trec 2005 Genomics Track Overview," Proc. Text REtrieval Conf. (TREC), 2005.
[18] W. Hou, C. Lee, and H. Chen, "Classifying Biological Full-Text Articles for Multi-Database Curation," Proc. European Chapter of Assoc. for Computational Linguistics (ACL), pp. 159-162, Apr. 2006.
[19] M. Iwayama, A. Fujii, N. Kando, and Y. Marukawa, "An Empirical Study on Retrieval Models for Different Document Genres: Patents and Newspaper Articles," Proc. ACM SIGIR, pp. 251-258, 2003.
[20] T. Joachims, "Making Large-Scale Support Vector Machine Learning Practical," Advances in Kernel Methods: Support Vector Machines, B. Schölkopf, C. Burges, and A. Smola, eds., pp. 169-184, MIT Press, 1998.
[21] T. Joachims, "Transductive Inference for Text Classification Using Support Vector Machines," Proc. 16th Int'l Conf. Machine Learning, 1999.
[22] M. Krallinger and A. Valencia, "Text-Mining and Information-Retrieval Services for Molecular Biology," Genome Biology, vol. 6, no. 7, pp. 224-230, 2005.
[23] E. Kretschmann, W. Fleischmann, and R. Apweiler, "Automatic Rule Generation for Protein Annotation with the C4.5 Data Mining Algorithm Applied on SWISS-PROT," Bioinformatics, vol. 17, no. 10, pp. 920-926, 2001.
[24] B. Liu, Y. Dai, X. Li, W.S. Lee, and P.S. Yu, "Building Text Classifiers Using Positive and Unlabeled Examples," Proc. Third IEEE Int'l Conf. Data Mining (ICDM '03), pp. 179-188, 2003.
[25] H. Liu, M. Torii, G. Xu, Z. Hu, and J. Goll, "Learning from Positive and Unlabeled Documents for Retrieval of Bacterial Protein-Protein Interaction Literature," Lecture Notes in Computer Science, vol. 6004, pp. 62-70, 2010, doi: 10.1007/978-3-642-13131-8.
[26] A.K. McCallum, "Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering," Unpublished, 1996.
[27] A.K. McCallum., "MALLET: A Machine Learning for Language Toolkit," http:/mallet.cs.umass.edu, 2002.
[28] Y. Regev, M. Finkelstein-Landau, R. Feldman, R. Gorodetsky, X. Zheng, S. Levy, R. Charlab, C. Lawrence, R.A. Lippert, Q. Zhang, and H. Shatkay, "Rule-Based Extraction of Experimental Evidence in the Biomedical Domain—the KDD Cup (Task 1)," ACM SIGKDD Explorations Newsletter, vol. 4, no. 2, pp. 90-92, 2002.
[29] M.H. Saier,Jr., C.V. Tran, and R.D. Barabote, "TCDB: The Transporter Classification Database for Membrane Transport Protein Analyses and Information," Nucleic Acids Research, vol. 36, pp. D181-D186, 2006.
[30] G. Salton, The SMART Retrieval System; Experiments in Automatic Document Processing. Prentice-Hall, Inc., 1971.
[31] G. Salton and C. Buckley, "Term-Weighting Approaches in Automatic Text Retrieval," Information and Process Management, vol. 24, no. 5, pp. 513-523, 1988.
[32] B. Schölkopf, J. Platt, J. Shawe-Taylor, A.J. Smola, and R.C. Williamson, "Estimating the Support of a High-Dimensional Distribution," Neural Computation, vol. 13, pp. 1443-1471, 2001.
[33] B. Schölkopf, A. Smola, R. Williamson, and P.L. Bartlett, "New Support Vector Algorithms," Neural Computation, vol. 12, pp. 1207-1245, 2000.
[34] F. Sebastiani, "Machine Learning in Automated Text Categorization," ACM Computing Surveys, vol. 34, pp. 1-47, 2002.
[35] H. Shatkay, N. Chen, and D. Blostein, "Integrating Image Data into Biomedical Text Categorization," Bioinformatics, vol. 22, no. 14, pp. e446-e453, 2006.
[36] A.S. Yeh, L. Hirschman, and A.A. Morgan, "Evaluation of Text Data Mining for Database Curation: Lessons Learned from the KDD Challenge Cup," Bioinformatics, vol. 19, suppl. 1, pp. i331-i339, 2003.
[37] D. Zhang and W.S. Lee, "A Simple Probabilistic Approach to Learning from Positive and Unlabeled Examples," Proc. Fifth Ann. UK Workshop Computational Intelligence (UKCI), pp. 83-87, Sept. 2005.
18 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool