The Community for Technology Leaders
RSS Icon
Issue No.02 - March/April (2011 vol.8)
pp: 308-315
Jong Cheol Jeong , University of Kansas, Lawrence
Xue-Wen Chen , The University of Kansas, Lawrence
While genome sequencing projects have generated tremendous amounts of protein sequence data for a vast number of genomes, substantial portions of most genomes are still unannotated. Despite the success of experimental methods for identifying protein functions, they are often lab intensive and time consuming. Thus, it is only practical to use in silico methods for the genome-wide functional annotations. In this paper, we propose new features extracted from protein sequence only and machine learning-based methods for computational function prediction. These features are derived from a position-specific scoring matrix, which has shown great potential in other bininformatics problems. We evaluate these features using four different classifiers and yeast protein data. Our experimental results show that features derived from the position-specific scoring matrix are appropriate for automatic function annotation.
Clustering, classification, and association rules, data mining, feature extraction or construction, mining methods and algorithms.
Jong Cheol Jeong, Xue-Wen Chen, "On Position-Specific Scoring Matrix for Protein Function Prediction", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.8, no. 2, pp. 308-315, March/April 2011, doi:10.1109/TCBB.2010.93
[1] S.F. Altschul et al., "Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs," Nucleic Acids Research, vol. 25, pp. 3389-3402, Sept. 1997.
[2] X.W. Chen et al., "Protein Function Assignment through Mining Cross-Species Protein-Protein Interactions," Public Library of Science (PLoS) One, vol. 3, pp. 1-10, 2008.
[3] C.M. Zmasek and S.R. Eddy, "RIO: Analyzing Proteomes by Automated Phylogenomics Using Resampled Inference of Orthologs," BMC Bioinformatics, vol. 3, article no. 14, May 2002.
[4] B.E. Engelhardt et al., "Protein Molecular Function Prediction by Bayesian Phylogenomics," Public Library of Science (PLoS) Computational Biology, vol. 1, pp. 1-14, Oct. 2005.
[5] A. Jocker et al., "Protein Function Prediction and Annotation in an Integrated Environment Powered by Web Services (AFAWE)," Bioinformatics, vol. 24, pp. 2393-2394, Oct. 2008.
[6] T.K. Attwood et al., "PRINTS and Its Automatic Supplement, prePRINTS," Nucleic Acids Research, vol. 31, pp. 400-402, Jan. 2003.
[7] N. Hulo et al., "The PROSITE Database," Nucleic Acids Research, vol. 34, pp. D227-D230, Jan. 2006.
[8] N.J. Mulder et al., "In silico Characterization of Proteins: UniProt, InterPro and Integr8," Molecular Biotechnology, vol. 38, pp. 165-177, Feb. 2008.
[9] S.V. Date and E.M. Marcotte, "Discovery of Uncharacterized Cellular Systems by Genome-Wide Analysis of Functional Linkages," Nature Biotechnology , vol. 21, pp. 1055-1062, Sept. 2003.
[10] D. Eisenberg et al., "Protein Function in the Post-Genomic Era," Nature, vol. 405, pp. 823-826, June 2000.
[11] M. Pellegrini et al., "Assigning Protein Functions by Comparative Genome Analysis: Protein Phylogenetic Profiles," Proc. Nat'l Academy of Sciences USA, vol. 96, pp. 4285-4288, Apr. 1999.
[12] E.M. Marcotte et al., "Detecting Protein Function and Protein-Protein Interactions from Genome Sequences," Science, vol. 285, pp. 751-753, July 1999.
[13] C.J. Marcotte and E.M. Marcotte, "Predicting Functional Linkages from Gene Fusions with Confidence," Applied Bioinformatics, vol. 1, pp. 93-100, 2002.
[14] I. Yanai et al., "Genes Linked by Fusion Events Are Generally of the Same Functional Category: A Systematic Analysis of 30 Microbial Genomes," Proc. Nat'l Academy of Sciences USA, vol. 98, pp. 7940-7945, July 2001.
[15] M. Strong et al., "Inference of Protein Function and Protein Linkages in Mycobacterium Tuberculosis Based on Prokaryotic Genome Organization: A Combined Computational Approach," Genome Biology , vol. 4, pp. 1-16, 2003.
[16] S.E. Brenner et al., "Understanding Protein Structure: Using Scop for Fold Interpretation," Methods Enzymology, vol. 266, pp. 635-643, 1996.
[17] L. Holm and C. Sander, "Protein Structure Comparison by Alignment of Distance Matrices," J. Molecular Biology, vol. 233, pp. 123-138, Sept. 1993.
[18] F. Pazos and M.J. Sternberg, "Automated Prediction of Protein Function and Detection of Functional Sites from Structure," Proc. Nat'l Academy of Sciences USA, vol. 101, pp. 14754-14759, Oct. 2004.
[19] I.N. Shindyalov and P.E. Bourne, "A Database and Tools for 3D Protein Structure Comparison and Alignment Using the Combinatorial Extension (CE) Algorithm," Nucleic Acids Research, vol. 29, pp. 228-289, Jan. 2001.
[20] A.E. Todd et al., "Evolution of Function in Protein Superfamilies, from a Structural Perspective," J. Molecular Biology, vol. 307, pp. 1113-1143, Apr. 2001.
[21] H.B. Fraser et al., "Coevolution of Gene Expression among Interacting Proteins," Proc. Nat'l Academy of Sciences USA, vol. 101, pp. 9033-9038, June 2004.
[22] M. Liu et al., "Knowledge-Guided Inference of Domain-Domain Interactions from Incomplete Protein-Protein Interaction Networks," Bioinformatics, vol. 25, pp. 2492-2499, Oct. 2009.
[23] A. Vazquez et al., "Global Protein Function Prediction from Protein-Protein Interaction Networks," Nature Biotechnology, vol. 21, pp. 697-700, June 2003.
[24] E.M. Marcotte et al., "A Combined Algorithm for Genome-Wide Prediction of Protein Function," Nature, vol. 402, pp. 83-86, Nov. 1999.
[25] K. Wabnik et al., "Gene Expression Trends and Protein Features Effectively Complement Each Other in Gene Function Prediction," Bioinformatics, vol. 25, pp. 322-330, Feb. 2009.
[26] U. Karaoz et al., "Whole-Genome Annotation by Using Evidence Integration in Functional-Linkage Networks," Proc. Nat'l Academy of Sciences USA, vol. 101, pp. 2888-2893, Mar. 2004.
[27] C.Z. Cai et al., "SVM-Prot: Web-Based Support Vector Machine Software for Functional Classification of a Protein from Its Primary Sequence," Nucleic Acids Research, vol. 31, pp. 3692-3697, July 2003.
[28] L.J. Jensen et al., "Prediction of Human Protein Function According to Gene Ontology Categories," Bioinformatics, vol. 19, pp. 635-642, Mar. 2003.
[29] A.E. Lobley et al., "FFPred: An Integrated Feature-Based Function Prediction Server for Vertebrate Proteomes," Nucleic Acids Research, vol. 36, pp. W297-W302, July 2008.
[30] M. Gribskov et al., "Profile Analysis: Detection of Distantly Related Proteins," Proc. Nat'l Academy of Sciences USA, vol. 84, pp. 4355-4358, July 1987.
[31] D.T. Jones, "Protein Secondary Structure Prediction Based on Position-Specific Scoring Matrices," J. Molecular Biology, vol. 292, pp. 195-202, Sept. 1999.
[32] X.W. Chen and J.C. Jeong, "Sequence-Based Prediction of Protein Interaction Sites with an Integrative Method," Bioinformatics, vol. 25, pp. 585-591, Mar. 2009.
[33] M.N. Wass and M.J. Sternberg, "ConFunc—Functional Annotation in the Twilight Zone," Bioinformatics, vol. 24, pp. 798-806, Mar. 2008.
[34] D.T. Jones and J.J. Ward, "Prediction of Disordered Regions in Proteins from Position Specific Score Matrices," Proteins, vol. 53, suppl 6, pp. 573-578, 2003.
[35] S.F. Altschul and E.V. Koonin, "Iterated Profile Searches with PSI-BLAST—A Tool for Discovery in Protein Databases," Trends Biochemical Sciences, vol. 23, pp. 444-447, Nov. 1998.
[36] M.O. Dayhoff et al., "A Model of Evolutionary Change in Proteins," Atlas of Protein Sequence and Structure, vol. 5, pp. 345-352, 1978.
[37] P. Domingos and M. Pazzani, "ON the Optimality of the Simple Bayesian Classifier under Zero-One Loss," Machine Learning , vol. 29, pp. 103-130, 1997.
[38] V.N. Vapnik, "An Overview of Statistical Learning Theory," IEEE Trans Neural Networks, vol. 10, pp. 988-999, Sept. 1999.
[39] L. Brieiman et al., Classification and Regression Trees. Chapman and Hall/CRC 1984.
[40] L. Breiman, "Random Forests," Machine Learning, vol. 45, pp. 5-32, 2001.
[41] X.W. Chen and M. Liu, "Prediction of Protein-Protein Interactions Using Random Decision Forest Framework," Bioinformatics, vol. 21, pp. 4394-4400, Dec. 2005.
[42] G.D. Bader et al., "BIND: The Biomolecular Interaction Network Database," Nucleic Acids Research, vol. 31, pp. 248-250, Jan. 2003.
[43] U. Guldener et al., "CYGD: The Comprehensive Yeast Genome Database," Nucleic Acids Research, vol. 33, pp. D364-D368, Jan. 2005.
[44] C.C. Chang and C.J. Lin, "LIBSVM: A Library for Support Vector Machines," 2001.
[45] A. Jaiantilal, "Randomforest-Matlab: Random Forest (Regression, Classification and Clustering) Implementation for MATLAB (and Standalone)," 2010.
29 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool