The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.05 - Sept.-Oct. (2012 vol.9)
pp: 1432-1441
Bilge Karacali , Dept. of Electr. & Electron. Eng., Izmir Inst. of Technol., Urla Izmir, Turkey
ABSTRACT
We propose hierarchical motif vectors to represent local amino acid sequence configurations for predicting the functional attributes of amino acid sites on a global scale in a quasi-supervised learning framework. The motif vectors are constructed via wavelet decomposition on the variations of physico-chemical amino acid properties along the sequences. We then formulate a prediction scheme for the functional attributes of amino acid sites in terms of the respective motif vectors using the quasi-supervised learning algorithm that carries out predictions for all sites in consideration using only the experimentally verified sites. We have carried out comparative performance evaluation of the proposed method on the prediction of N-glycosylation of 55,184 sites possessing the consensus N-glycosylation sequon identified over 15,104 human proteins, out of which only 1,939 were experimentally verified Nglycosylation sites. In the experiments, the proposed method achieved better predictive performance than the alternative strategies from the literature. In addition, the predicted N-glycosylation sites showed good agreement with existing potential annotations, while the novel predictions belonged to proteins known to be modified by glycosylation.
INDEX TERMS
proteins, biochemistry, biological techniques, molecular biophysics, human protein sequence analysis, hierarchical motif vectors, amino acid sequence configuration, functional site prediction, quasisupervised learning framework, wavelet decomposition, physicochemical amino acid properties, consensus N-glycosylation sequon, Vectors, Amino acids, Proteins, Humans, Approximation methods, Prediction algorithms, Databases, quasi-supervised learning., Functional attribute prediction, hierarchical motif vectors, protein sequence analysis
CITATION
Bilge Karacali, "Hierarchical Motif Vectors for Prediction of Functional Sites in Amino Acid Sequences Using Quasi-Supervised Learning", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.9, no. 5, pp. 1432-1441, Sept.-Oct. 2012, doi:10.1109/TCBB.2012.68
REFERENCES
[1] L. Parida, Pattern Discovery in Bioinformatics: Theory & Algorithms. Chapman and Hall/CRC, 2008.
[2] M. Reczko and H. Bohr, "The Def Data-Base of Sequence Based Protein Fold Class Predictions," Nucleic Acids Research, vol. 22, pp. 3616-3619, Sept. 1994.
[3] M. Bhasin and G.P.S. Raghava, "Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition," J. Biological Chemistry, vol. 279, pp. 23262-23266, May 2004.
[4] S.J. Hua and Z.R. Sun, "Support Vector Machine Approach for Protein Subcellular Localization Prediction," Bioinformatics, vol. 17, pp. 721-728, Aug. 2001.
[5] J.K. Vries, X. Liu, and I. Bahar, "The Relationship between N-Gram Patterns and Protein Secondary Structure," Proteins-Structure Function and Bioinformatics, vol. 68, pp. 830-838, Sept. 2007.
[6] A.M. Facchiano and S. Costantini, "Prediction of the Protein Structural Class by Specific Peptide Frequencies," Biochimie, vol. 91, pp. 226-229, Feb. 2009.
[7] S. Anishetty, R. Anishetty, and G. Pennathur, "Understanding Mutations and Protein Stability through Tripeptides," FEBS Letters, vol. 580, pp. 2071-2080, Apr. 2006.
[8] A. Ceroni and P. Frasconi, "On the Role of Long-Range Dependencies in Learning Protein Secondary Structure," Proc. IEEE Int'l Joint Conf. Neural Networks, vol. 3, pp. 1899-1904, 2004.
[9] D. Kihara, "The Effect of Long-Range Interactions on the Secondary Structure Formation of Proteins," Protein Science, vol. 14, pp. 1955-1963, Aug. 2005.
[10] Z.R. Li, H.H. Lin, L.Y. Han, L. Jiang, X. Chen, and Y.Z. Chen, "PROFEAT: A Web Server for Computing Structural and Physicochemical Features of Proteins and Peptides from Amino Acid Sequence," Nucleic Acids Research, vol. 34, pp. W32-W37, 2006.
[11] Z.R. Li, H.B. Rao, F. Zhu, G.B. Yang, and Y.Z. Chen, "Update of PROFEAT: A Web Server for Computing Structural and Physicochemical Features of Proteins and Peptides from Amino Acid Sequence," Nucleic Acids Research, vol. 39, pp. W385-W390, July 2011.
[12] C. Chen, L.X. Chen, X.Y. Zou, and P.X. Cai, "Predicting Protein Structural Class Based on Multi-Features Fusion," J. Theoretical Biology, vol. 253, pp. 388-392, July 2008.
[13] T.L. Bailey and C. Elkan, "Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization," Machine Learning, vol. 21, pp. 51-80, Oct./Nov. 1995.
[14] T.L. Bailey, N. Williams, C. Misleh, and W.W. Li, "MEME: Discovering and Analyzing DNA and Protein Sequence Motifs," Nucleic Acids Research, vol. 34, pp. W369-W373, July 2006.
[15] C.E. Lawrence and A.A. Reilly, "An Expectation Maximization (Em) Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences," Proteins-Structure Function and Genetics, vol. 7, pp. 41-51, 1990.
[16] S. Balla, V. Thapar, S. Verma, T. Luong, T. Faghri, C.H. Huang, S. Rajasekaran, J.J. del Campo, J.H. Shinn, W.A. Mohler, M.W. Maciejewski, M.R. Gryk, B. Piccirillo, S.R. Schiller, and M.R. Schiller, "Minimotif Miner: A Tool for Investigating Protein Function," Nature Methods, vol. 3, pp. 175-177, Mar. 2006.
[17] P. Puntervoll, R. Linding, C. Gemund, S. Chabanis-Davidson, M. Mattingsdal, S. Cameron, D.M. Martin, G. Ausiello, B. Brannetti, A. Costantini, F. Ferre, V. Maselli, A. Via, G. Cesareni, F. Diella, G. Superti-Furga, L. Wyrwicz, C. Ramu, C. McGuigan, R. Gudavalli, I. Letunic, P. Bork, L. Rychlewski, B. Kuster, M. Helmer-Citterich, W.N. Hunter, R. Aasland, and T.J. Gibson, "ELM Server: A New Resource for Investigating Short Functional Sites in Modular Eukaryotic Proteins," Nucleic Acids Research, vol. 31, pp. 3625-3630, July 2003.
[18] A. Bairoch, "PROSITE: A Dictionary of Sites and Patterns in Proteins," Nucleic Acids Research, vol. 19, no. Suppl, pp. 2241-2245, Apr. 1991.
[19] N. Hulo, A. Bairoch, V. Bulliard, L. Cerutti, B.A. Cuche, E. de Castro, C. Lachaize, P.S. Langendijk-Genevaux, and C.J. Sigrist, "The 20 Years of PROSITE," Nucleic Acids Research, vol. 36, pp. D245-D249, Jan. 2008.
[20] N. Hulo, A. Bairoch, V. Bulliard, L. Cerutti, E. De Castro, P.S. Langendijk-Genevaux, M. Pagni, and C.J. Sigrist, "The PROSITE Database," Nucleic Acids Research, vol. 34, pp. D227-D230, Jan. 2006.
[21] L.Y. Geer, M. Domrachev, D.J. Lipman, and S.H. Bryant, "CDART: Protein Homology by Domain Architecture," Genome Research, vol. 12, pp. 1619-1623, Oct. 2002.
[22] N.C.W. Goonesekere and B. Lee, "Context-Specific Amino Acid Substitution Matrices and Their Use in the Detection of Protein Homologs," Proteins-Structure Function and Bioinformatics, vol. 71, pp. 910-919, May 2008.
[23] J.G. Henikoff, S. Pietrokovski, C.M. McCallum, and S. Henikoff, "Blocks-Based Methods for Detecting Protein Homology," Electrophoresis, vol. 21, pp. 1700-1706, May 2000.
[24] S. Hunter, R. Apweiler, T.K. Attwood, A. Bairoch, A. Bateman, D. Binns, P. Bork, U. Das, L. Daugherty, L. Duquenne, R.D. Finn, J. Gough, D. Haft, N. Hulo, D. Kahn, E. Kelly, A. Laugraud, I. Letunic, D. Lonsdale, R. Lopez, M. Madera, J. Maslen, C. McAnulla, J. McDowall, J. Mistry, A. Mitchell, N. Mulder, D. Natale, C. Orengo, A.F. Quinn, J.D. Selengut, C.J.A. Sigrist, M. Thimma, P.D. Thomas, F. Valentin, D. Wilson, C.H. Wu, and C. Yeats, "InterPro: The Integrative Protein Signature Database," Nucleic Acids Research, vol. 37, pp. D211-D215, Jan. 2009.
[25] R.D. Finn, J. Mistry, J. Tate, P. Coggill, A. Heger, J.E. Pollington, O.L. Gavin, P. Gunasekaran, G. Ceric, K. Forslund, L. Holm, E.L. Sonnhammer, S.R. Eddy, and A. Bateman, "The Pfam Protein Families Database," Nucleic Acids Research, vol. 38, pp. D211-D222, Jan. 2010.
[26] I. Letunic, T. Doerks, and P. Bork, "SMART 6: Recent Updates and New Developments," Nucleic Acids Research, vol. 37, pp. D229-D232, Jan. 2009.
[27] J. Schultz, F. Milpetz, P. Bork, and C.P. Ponting, "SMART, a Simple Modular Architecture Research Tool: Identification of Signaling Domains," Proc. Nat'l Academy Sciences USA, vol. 95, pp. 5857-5864, May 1998.
[28] C.J. Sigrist, L. Cerutti, E. de Castro, P.S. Langendijk-Genevaux, V. Bulliard, A. Bairoch, and N. Hulo, "PROSITE, a Protein Domain Database for Functional Characterization and Annotation," Nucleic Acids Research, vol. 38, pp. D161-D166, Jan. 2010.
[29] C. Caragea, J. Sinapov, A. Silvescu, D. Dobbs, and V. Honavar, "Glycosylation Site Prediction Using Ensembles of Support Vector Machine Classifiers," BMC Bioinformatics, vol. 8, article 438, 2007.
[30] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, second ed. Wiley-Interscience, 2000.
[31] N. Blom, T. Sicheritz-Ponten, R. Gupta, S. Gammeltoft, and S. Brunak, "Prediction of Post-Translational Glycosylation and Phosphorylation of Proteins from the Amino Acid Sequence," Proteomics, vol. 4, pp. 1633-1649, June 2004.
[32] K. Julenius, A. Molgaard, R. Gupta, and S. Brunak, "Prediction, Conservation Analysis, and Structural Characterization of Mammalian Mucin-Type O-Glycosylation Sites," Glycobiology, vol. 15, pp. 153-164, Feb. 2005.
[33] T.P. Knepper, B. Arbogast, J. Schreurs, and M.L. Deinzer, "Determination of the Glycosylation Patterns, Disulfide Linkages, and Protein Heterogeneities of Baculovirus-Expressed Mouse Interleukin-3 by Mass Spectrometry," Biochemistry, vol. 31, pp. 11651-11659, Nov. 1992.
[34] S.E. Hamby and J.D. Hirst, "Prediction of Glycosylation Sites Using Random Forests," BMC Bioinformatics, vol. 9, article 500, 2008.
[35] S. Li, B. Liu, R. Zeng, Y. Cai, and Y. Li, "Predicting O-Glycosylation Sites in Mammalian Proteins by Using SVMs," Computational Biology and Chemistry, vol. 30, pp. 203-238, June 2006.
[36] Y. Gavel and G. von Heijne, "Sequence Differences between Glycosylated and Non-Glycosylated Asn-X-Thr/Ser Acceptor Sites: Implications for Protein Engineering," Protein Eng., vol. 3, pp. 433-442, Apr. 1990.
[37] R.W. Carrell, J.O. Jeppsson, L. Vaughan, S.O. Brennan, M.C. Owen, and D.R. Boswell, "Human Alpha 1-antitrypsin: Carbohydrate Attachment and Sequence Homology," FEBS Letters, vol. 135, pp. 301-303, Dec. 1981.
[38] S. Mallat, A Wavelet Tour of Signal Processing. Academic Press, 1998.
[39] B. Karaçalı, "Hierarchical Motif Vectors for Amino Acid Sequence Alignment," Proc. Ninth IASTED Int'l Conf. Biomedical Eng., 2012.
[40] B. Karaçalı, "Quasi-Supervised Learning for Biomedical Data Analysis," Pattern Recognition, vol. 43, pp. 3674-3682, 2010.
[41] V.S. Mathura and D. Kolippakkam, "APDbase: Amino Acid Physico-Chemical Properties Database," Bioinformation, vol. 1, pp. 2-4, 2005.
[42] A. Varki, R.D. Cummings, J.D. Esko, H.H. Freeze, G.W. Hart, and M.E. Etzler, Essentials of Glycobiology, second ed. Cold Spring Harbor Laboratory Press, 2008.
[43] E. Weerapana and B. Imperiali, "Asparagine-Linked Protein Glycosylation: From Eukaryotic to Prokaryotic Systems," Glycobiology, vol. 16, pp. 91R-101R, June 2006.
[44] J.P. Miletich and G.J. BrozeJr., "Beta Protein C is Not Glycosylated at Asparagine 329, The Rate of Translation may Influence the Frequency of Usage at Asparagine-X-Cysteine Sites," J. Biological Chemistry, vol. 265, pp. 11397-11404, July 1990.
[45] V.N. Vapnik, The Nature of Statistical Learning Theory (Statistics for Engineering and Information Science), second ed. Springer-Verlag, 1999.
[46] C. Cortes and V. Vapnik, "Support-Vector Networks," Machine Learning, vol. 20, pp. 273-297, Sept. 1995.
[47] I. Daubechies, Ten Lectures on Wavelets. SIAM, 1992.
[48] E.M. Danielsen, H. Skovbjerg, O. Noren, and H. Sjostrom, "Biosynthesis of Intestinal Microvillar Proteins, Intracellular Processing of Lactase-Phlorizin Hydrolase," Biochemical and Biophysical Research Comm., vol. 122, pp. 82-90, July 1984.
[49] H.Y. Naim, E.E. Sterchi, and M.J. Lentze, "Biosynthesis and Maturation of Lactase-Phlorizin Hydrolase in the Human Small Intestinal Epithelial Cells," Biochemical J., vol. 241, pp. 427-434, Jan. 1987.
[50] N. Netzer, J.M. Goodenbour, A. David, K.A. Dittmar, R.B. Jones, J.R. Schneider, D. Boone, E.M. Eves, M.R. Rosner, J.S. Gibbs, A. Embry, B. Dolan, S. Das, H.D. Hickman, P. Berglund, J.R. Bennink, J.W. Yewdell, and T. Pan, "Innate Immune and Chemically Triggered Oxidative Stress Modifies Translational Fidelity," Nature, vol. 462, pp. 522-526, Nov. 2009.
40 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool