Issue No.06 - November/December (2011 vol.8)
pp: 1721-1726
Fernanda Caldas Cardoso , Federal University of Minas Gerais, Belo Horizonte
Thiago de Souza Rodrigues , Federal Center of Technological Education of Minas Gerais, Belo Horizonte
Sérgio Costa Oliveira , Federal University of Minas Gerais, Belo Horizonte
Antônio Pádua Braga , Federal University of Minas Gerais, Belo Horizonte
A large number of unclassified sequences is still found in public databases, which suggests that there is still need for new investigations in the area. In this contribution, we present a methodology based on Artificial Neural Networks for protein functional classification. A new protein coding scheme, called here Extended-Sequence Coding by Sliding Windows, is presented with the goal of overcoming some of the difficulties of the well method Sequence Coding by Sliding Window. The new protein coding scheme uses more than one sliding window length with a weight factor that is proportional to the window length, avoiding the ambiguity problem without ignoring the identity of small subsequences Accuracy for Sequence Coding by Sliding Windows ranged from 60.1 to 77.7 percent for the first bacterium protein set and from 61.9 to 76.7 percent for the second one, whereas the accuracy for the proposed Extended-Sequence Coding by Sliding Windows scheme ranged from 70.7 to 97.1 percent for the first bacterium protein set and from 61.1 to 93.3 percent for the second one. Additionally, protein sequences classified inconsistently by the Artificial Neural Networks were analyzed by CD-Search revealing that there are some disagreement in public repositories, calling the attention for the relevant issue of error propagation in annotated databases due the incorrect transferred annotations.
Artificial neural network, protein functional classification, protein coding, protein functional classification error.
Fernanda Caldas Cardoso, Thiago de Souza Rodrigues, Sérgio Costa Oliveira, Antônio Pádua Braga, "Protein Classification with Extended-Sequence Coding by Sliding Window", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.8, no. 6, pp. 1721-1726, November/December 2011, doi:10.1109/TCBB.2011.78
[1] R.H. White, “The Difficult Road from Sequence to Function,” J. Bacteriology, vol. 188, no. 10, pp. 3431-3432, 2006.
[2] A.K. Arakaki, W. Tian, and J. Skolnick, “High Precision Multi-Genome Scale Reannotation of Enzyme Function by Eficaz,” BMC Genomics, vol. 7, p. 315, 2006.
[3] C.A. Ouzounis and P.D. Karp, “The Past, Present and Future of Genome-Wide Re-Annotation,” Genome Biology, vol. 3, no. 2, 2002.
[4] P. Bork and E.V. Koonin, “Predicting Functions from Protein Sequences-Where are the Bottlenecks?” Nature Genetics, vol. 18, pp. 313-318, 1998.
[5] B. Rost, J. Liu, R. Nair, K. Wrzeszczynski, and Y. Ofran, “Automatic Prediction of Protein Function,” Cellular and Moleculer Life Sciences, vol. 60, no. 12, pp. 2637-2650, 2003.
[6] B. Audit, E.D. Levy, W.R. Gilks, L. Goldovsky, and C.A. Ouzounis, “Corrie: Enzyme Sequence Annotation with Confidence Estimates,” BMC Bioinformatics, vol. 8(Suppl 4), 2007.
[7] P.D. Karp, “What We Do Not Know about Sequence Analysis and Sequence Databases,” Bioinformatics, vol. 14, pp. 753-754, 1998.
[8] E. Mayoraz, I. Dubchak, and I. Muchnik, “Relation between Protein Structure, Sequence Homology and Composition of Amino Acids,” Proc. Int'l Conf. Intelligent Systems for Moleculer Biology, vol. 3, pp. 240-248, 1995.
[9] N. Blom, S. Gammeltoft, and S. Brunak, “Sequence and Structure-Based Prediction of Eukaryotic Protein Phosphorylation Sites,” J. Molecular Biology, vol. 294, no. 5, pp. 1351-1362, 1999.
[10] S. Haykin, Neural Networks: A Comprehensive Foundation, second ed., Prentice Hall, 1999.
[11] C. Wu, J. McLarty, and G. Whitson, “Neural Networks for Molecular Sequence Database Management,” Proc. ACM 19th Ann. Conf. Computer Science (CSC '91), pp. 588-594, 1991.
[12] C. Wu, G. Whitson, J. McLarty, A. Ermongkonchai, and T. Chang, “Protein Classification Artificial Neural System,” Protein Science, vol. 1, pp. 667-677, 1992.
[13] T.S. Rodrigues, A.P. Braga, L.G. Pacífico, S.M.R. Teixeira, and S.C. Oliveira, “Clustering and Artificial Neural Networks: Classification of Variable Lengths of Helminth Antigens in Set of Domains,” Genetics and Molecular Biology, vol. 4, no. 27, pp. 673-678, 2004.
[14] B.E. Blaisdell, “Effectiveness of Measures Requiring and not Requiring Prior Sequence Alignment for Estimating the Dissimilarity of Natural Sequences,” J. Molecular Evolution, vol. 29, pp. 526-537, 1989.
[15] P.A. Pevzner, “DNA Physical Mapping and Alternating Eulerian Cycles in Colored Graphs,” Algorithmica, vol. 13, pp. 77-105, 1995.
[16] A.T.R. Vasconcelos, D.F. Almeida, M. Hungria, C.T. Guimarães, R.V. Antônio et al., “, The Complete Genome Sequence of Chromobacterium Violaceum Reveals Remarkable and Exploitable Bacterial Adaptability,” Proc. Nat'l Academy of Science USA, vol. 100, no. 20, pp. 11660-11665, 2003.
[17] Y. Azuma, H. Hirakawa, A. Yamashita, Y. Cai, M.A. Rahman, H. Suzuki, S. Mitaku, H. Toh, S. Goto, T. Murakami, K. Sugi, H. Hayashi, H. Fukushi, M. Hattori, S. Kuhara, and M. Shirai, “Genome Sequence of the Cat Pathogen, Chlamydophila Felis,” DNA Research, vol. 13, pp. 15-23, 2006.
[18] W.R. Gilks, B. Audit, D. De Angelis, S. Tsoka, and C.A. Ouzounis, “Modeling the Percolation of Annotation Errors in a Database of Protein Sequences,” Bioinformatics, vol. 18, pp. 1641-1649, 2002.
[19] P. Petrilli, “Classification of Protein Sequences by Their Dipeptide Comnposition,” Computer Applications in the Biosciences, vol. 9, pp. 205-209, 1993.
[20] C.H. Wu, “Artificial Neural Networks for Molecular Sequence Analysis,” Computers Chemistry, vol. 21, no. 4, pp. 237-256, 1997.
[21] G. Reinert, S. Schbath, and M.S. Waterman, “Probabilistic and Statistical Properties of Words: An Overview,” J. Computational Biology, vol. 7, nos. 1/2, pp. 1-46, 2000.
[22] M.O. Dayhoff, “Survey of New Data and Computer Methods of Analysis,” Atlas of Protein Sequence and Structure, vol. 5, p. 29, 1978.
[23] S. Henikoff and J.G. Henikoff, “Amino Acid Substitution Matrices from Protein Blocks,” Proc. Nat'l Academy of Sciences USA, vol. 22, no. 89, pp. 10915-10919, 1992.
[24] M.J. Zvelebil, G.J. Barton, W.R. Taylor, and M.J.E. Sternberg, “Prediction of Protein Secondary Structure and Active Sites Using the Alignment of Homologous Sequences,” J. Molecular Biology, vol. 4, pp. 957-961, 1987.
[25] H.S. Kim, M.A. Schell, Y. Yu, R.L. Ulrich, S.H. Sarria, W.C. Nierman, and D. DeShazer, “Bacterial Genome Adaptation to Niches: Divergence of the Potential Virulence Genes in Three Burkholderia Species of Different Survival Strategies,” BMC Genomics, vol. 6, pp. 1-13, 2005.
[26] M. Wu, Q.R., A.S. Durkin, S.C. Daugherty, L.M. Brinkac, R.J. Dodson, R. Madupu, S.A. Sullivan, J.F. Kolonay, W.C. Nelson, L.J. Tallon, K.M. Jones, L.E. Ulrich, J.M. Gonzalez, I.B. Zhulin, F.T. Robb, and J.A. Eisen1, “Life in Hot Carbon Monoxide: The Complete Genome Sequence of Carboxydothermus Hydrogenoformans z-2901,” PLoS Genetics, vol. 1, pp. 563-574, 2005.
[27] B.A. Methé, K.E. Nelson, J.W. Deming, B. Momen, E. Melamud, X. Zhang, J. Moult, R. Madupu, W.C. Nelson, R.J. Dodson, L.M. Brinkac, S.C. Daugherty, A.S. Durkin, R.T. DeBoy, J.F. Kolonay, S.A. Sullivan, L. Zhou, T.M. Davidsen, M. Wu, A.L. Huston, M. Lewis, B. Weaver, J.F. Weidman, H. Khouri, T.R. Utterback, T.V. Feldblyum, and C.M. Fraser, “The Psychrophilic Lifestyle as Revealed by the Genome Sequence of Colwellia Psychrerythraea 34h through Genomic and Proteomic Analyses,” Proc Nat'l Academy of Sciences USA, vol. 102, pp. 10913-10918, 2005.
[28] H. Jeong, J.H. Yim, C. Lee, S. Choi, Y.K. Park, S.H. Yoon, C. Hur, H. Kang, D. Kim, H.H. Lee, K.H. Park, S. Park, H. Park, H.K. Lee, T.K. Oh, and J.F. Kim, “Genomic Blueprint of Hahella Chejuensis, a Marine Microbe Producing an Algicidal Agent,” Nucleic Acids Research, vol. 33, no. 22, pp. 7066-7073, 2005.
[29] T. Matsunaga, Y. Okamura, Y. Fukuda, A.T. Wahyudi, Y. Murase, and H. Takeyama, “Complete Genome Sequence of the Facultative Anaerobic Magnetotactic Bacterium Magnetospirillum Sp. Strain Amb-1,” DNA Research, vol. 12, no. 3, pp. 157-166, 2005.
[30] V. Joardar, M. Lindeberg, R.W. Jackson, J. Selengut, R. Dodson, L.M. Brinkac, S.C. Daugherty, R. DeBoy, A.S. Durkin, M.G. Giglio, R. Madupu, W.C. Nelson, M.J. Rosovitz, S. Sullivan, J. Crabtree, T. Creasy, T. Davidsen, D.H. Haft, N. Zafar, L. Zhou, R. Halpin, T. Holley, H. Khouri, T. Feldblyum, O. White, C.M. Fraser, A.K. Chatterjee, S. Cartinhour, D.J. Schneider, J. Mansfield, A. Collmer, and C.R. Buell1, “Whole-Genome Sequence Analysis of Pseudomonas Syringae Pv. Phaseolicola 1448a Reveals Divergence among Pathovars in Genes Involved in Virulence and Transposition,” J. Bacteriology, vol. 187, no. 18, pp. 6488-6498, 2005.
[31] E.F. Mongodin, K.E. Nelson, S. Daugherty, R.T. DeBoy, J. Wister, H. Khouri, J. Weidman, D.A. Walsh, R.T. Papke, G.S. Perez, A.K. Sharma, C.L. Nesbø, D. MacLeod, E. Bapteste, W.F. Doolittle, R.L. Charlebois, B. Legault, and F. Rodriguez-Valera, “The Genome of Salinibacter Ruber: Convergence and Gene Exchange among Hyperhalophilic Bacteria and Archaea,” Proc Nat'l Academy of Sciences USA, vol. 102, no. 50, pp. 18147-18152, 2005.
[32] F. Yang, J. Yang, X. Zhang, L. Chen, Y. Jiang, Y. Yan, X. Tang, J. Wang, Z. Xiong, J. Dong, Y. Xue, Y. Zhu, X. Xu, L. Sun, S. Chen, H. Nie, J. Peng, J. Xu, Y. Wang, Z. Yuan, Y. Wen, Z. Yao, Y. Shen, B. Qiang, Y. Hou, J. Yu, and Q. Jin, “Genome Dynamics and Diversity of Shigella Species, the Etiologic Agents of Bacillary Dysentery,” Nucleic Acids Research, vol. 33, no. 19, pp. 6445-6458, 2005.
[33] H. Tettelin, V. Masignani, M.J. Cieslewicz, C. Donati, D. Medini, N.L. Ward, S.V. Angiuoli, J. Crabtree, A.L. Jones, A.S. Durkin, R.T. DeBoy, T.M. Davidsen, M. Mora, M. Scarselli, I.M. y Ros, J.D. Peterson, C.R. Hauser, J.P. Sundaram, W.C. Nelson, R. Madupu, L.M. Brinkac, R.J. Dodson, M.J. Rosovitz, S.A. Sullivan, S.C. Daugherty, D.H. Haft, J. Selengut, M.L. Gwinn, L. Zhou, N. Zafar, H. Khouri, D. Radune, G. Dimitrov, K. Watkins, K.J.B. O'Connor, S. Smith, T.R. Utterback, O. White, C.E. Rubens, G. Grandi, L.C. Madoff, D.L. Kasper, J.L. Telford, M.R. Wessels, R. Rappuoli, and C.M. Fraserabkm, “Genome Analysis of Multiple Pathogenic Isolates of Streptococcus Agalactiae: Implications for the Microbial,” Proc Nat'l Academy of Sciences USA, vol. 102, pp. 13950-13955, 2005.
[34] W. Qian, Y. Jia1, S.-X. Ren, Y.-Q. He, J.-X. Feng, L.-F. Lu, Q. Sun, G. Ying, D.-J. Tang, H. Tang, W. Wu, P. Hao, L. Wang, B.-L. Jiang, S. Zeng, W.-Y. Gu, G. Lu, L. Rong, Y. Tian, Z. Yao, G. Fu, B. Chen, R. Fang, B. Qiang, Z. Chen, G.-P. Zhao, J.-L. Tang, and C. He, “Comparative and Functional Genomic Analyses of the Pathogenicity of Phytopathogen Xanthomonas Campestris pv. Campestris,” Genome Research, vol. 15, pp. 757-767, 2005.
[35] L.L. Leon, C.C. Miranda, A.O. Souza, and N. Duran, “Antileishmanial Activity of the Violacein Extracted from Chromobacterium Violaceum,” J. Antimicrobial Chemotherapy, vol. 48, no. 3, pp. 449-450, 2001.
[36] A. Steinbüchel, E.M. Debzi, R.H. Marchessault, and A. Timm, “Synthesis and Production of Poly (3-Hydroxyvaleric Acid) Homopolyester By Chromabacterium Violaceum.” Applied Microbiology Biotechnology, vol. 39, no. 4, pp. 443-449, 1993.
[37] S.C. Campbell, G.J. Olson, T.R. Clark, and G. McFeters, “Biogenic Production of Cyanide and Its Application to Gold Recovery,” J. Industrial Microbiology and Biotechnology, vol. 26, pp. 134-139, 2001.
[38] M. Travnicek, S. Mardzinova, L. Cislakova, I. Valocky, and T. Weissova, “Chlamydial Infection of Cats and Human Health,” Folia Microbiologica, vol. 47, pp. 441-444, 2002.
[39] C. Hsu and C. Lin, “A Comparison of Methods for Multiclass Support Vector Machines,” IEEE Trans. Neural Networks, vol. 13, no. 2, pp. 415-425, Mar. 2002.
[40] P.E. Hart, “The Condensed Nearest Neighbour Rule,” IEEE Trans. Information Theory, vol. 14, no. 3, pp. 515-516, May 1968.
[41] X. Guo, Y. Yin, C. Dong, G. Yang, and G. Zhou, “On the Class Imbalance Problem,” Proc. IEEE—Fourth Int'l Conf. Natural Computation (ICNC '08), pp. 192-201, 2008.
[42] D. Mackay, “Bayesian Interpolation,” Neural Computation, vol. 4, no. 3, pp. 415-447, 1992.
[43] A. Marchler-Bauer and S.H. Bryant, “Cd-Search: Protein Domain Annotations on the Fly,” Nucleic Acids Research, vol. 32, pp. W327-W331, 2004.
[44] E.D. Levy, C.A. Ouzounis, W.R. Gilks, and B. Audit, “Probabilistic Annotation of Protein Sequences Based on Functional Classifications,” BMC Bioinformatics, vol. 6, p. 302, 2005.
[45] M.A. Andrade and C. Sander, “Bioinformatics: From Genome Data to Biological Knowledge,” Current Opinion in Biotechnology, vol. 8, pp. 675-683, 1997.
[46] N.C. Kyrpides and C.A. Ouzounis, “Whole-Genome Sequence Annotation: Going Wrong with Confidence,” Molecular Microbiology, vol. 32, pp. 886-887, 1999.
[47] D. Devos and A. Valencia, “Intrinsic Errors in Genome Annotation,” Trends in Genetics, vol. 17, pp. 429-431, 2001.
[48] J.A. Gerlt and P.C. Babbitt, “Can Sequence Determine Function?,” Genome Biology, vol. 1, pp. 01-10, 2000.
[49] V. Kunin, A. Copeland, A. Lapidus, K. Mavromatis, and P. Hugenholtz, “A Bioinformatician's Guide to Metagenomics,” Microbiology and Molecular Biology Rev., vol. 72, no. 4, pp. 557-578, 2008.
[50] M. Pallen, B. Wren, and J. Parkhill, “Going Wrong with Confidence: Misleading Sequence Analyses of Ciab and Clpx,” Molecular Microbiology, vol. 34, no. 1, p. 195, 1999.
[51] C.E. Jones, A.L. Brown, and U. Baumann, “Estimating the Annotation Error Rate of Curated Go Database Sequence Annotations,” BMC Bioinformatics, vol. 8, p. 170, 2007.
[52] S.E. Brenner, “Errors in Genome Annotation,” Trends in Genetics, vol. 15, pp. 132-133, 1999.
[53] C. Andorf, D. Dobbs, and V. Honavar, “Exploring Inconsistencies in Genome-Wide Protein Function Annotations: a Machine Learning Approach,” BMC Bioinformatics, vol. 8, p. 284, 2007.
[54] J.S. Fraser, Z. Yu, L. Maxwell K, and R. Davidson A, “Ig-Like Domains on Bacteriophages: A Tale of Promiscuity and Deceit,” J. Moleculer Biology, vol. 359, pp. 496-507, 2006.
[55] D.G. Naumoff, Y. Xu, N. Glansdorff, and B. Labedan, “Retrieving Sequences of Enzymes Experimentally Characterized but Erroneously Annotated: The Case of the Putrescine Carbamoyltransferase,” BMC Genomics, vol. 5, p. 52, 2004.
[56] A.M. Schnoes, S.D. Brown, I. Dodevski, and P.C. Babbitt, “Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies,” PLoS Computational Biology, vol. 5, no. 12:e1000605, 2009.
[57] M.I. Bidartondo, “Preserving Accuracy in Genbank,” Science, vol. 319, no. 5870, p. 1616, 2008.
[58] C. Hadley, “Righting the Wrongs,” EMBO Reports, vol. 4, no. 9, pp. 829-831, 2003.