Issue No. 03 - May/June (2011 vol. 8)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TCBB.2009.83
Aditya Kumar Sehgal , Parity Computing, La Jolla
Sanmay Das , Rensselaer Polytechnic Institute, Troy
Keith Noto , University of California at San Diego, La Jolla
Milton H. Saier, Jr. , University of California at San Diego, La Jolla
Charles Elkan , University of California at San Diego, La Jolla
With well over 1,000 specialized biological databases in use today, the task of automatically identifying novel, relevant data for such databases is increasingly important. In this paper, we describe practical machine learning approaches for identifying MEDLINE documents and Swiss-Prot/TrEMBL protein records, for incorporation into a specialized biological database of transport proteins named TCDB. We show that both learning approaches outperform rules created by hand by a human expert. As one of the first case studies involving two different approaches to updating a deployed database, both the methods compared and the results will be of interest to curators of many specialized databases.
Bioinformatics (genome or protein) databases, clustering, classification, association rules, text mining, biomedical text classification, data mining.
K. Noto, M. H. Saier, Jr., S. Das, A. K. Sehgal and C. Elkan, "Identifying Relevant Data for a Biological Database: Handcrafted Rules versus Machine Learning," in IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. , pp. 851-857, 2009.