The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.02 - March-April (2013 vol.10)
pp: 457-467
De-Shuang Huang , Sch. of Electron. & Inf. Eng., Tongji Univ., Shanghai, China
Hong-Jie Yu , Dept. of Math., Anhui Sci. & Technol. Univ., Fengyang, China
ABSTRACT
Based on all kinds of adjacent amino acids (AAA), we map each protein primary sequence into a 400 by (L-1) matrix M. In addition, we further derive a normalized 400-tuple mathematical descriptors D, which is extracted from the primary protein sequences via singular values decomposition (SVD) of the matrix. The obtained 400-D normalized feature vectors (NFVs) further facilitate our quantitative analysis of protein sequences. Using the normalized representation of the primary protein sequences, we analyze the similarity for different sequences upon two data sets: 1) ND5 sequences from nine species and 2) transferrin sequences of 24 vertebrates. We also compared the results in this study with those from other related works. These two experiments illustrate that our proposed NFV-AAA approach does perform well in the field of similarity analysis of sequence.
INDEX TERMS
Proteins, Amino acids, Vectors, Feature extraction, Bioinformatics, Educational institutions,alignment free, Proteins, Amino acids, Vectors, Feature extraction, Bioinformatics, Educational institutions, similarity analysis, Adjacent amino acids, normalized feature vector, singular value decomposition (SVD)
CITATION
De-Shuang Huang, Hong-Jie Yu, "Normalized Feature Vectors: A Novel Alignment-Free Sequence Comparison Method Based on the Numbers of Adjacent Amino Acids", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.10, no. 2, pp. 457-467, March-April 2013, doi:10.1109/TCBB.2013.10
REFERENCES
[1] M.R. Kantorovitz, G.E. Robinson, and S. Sinha, "A Statistical Method for Alignment-Free Comparison of Regulatory Sequences," Bioinformatics, vol. 23, pp. i249-i255, 2007.
[2] S. Vinga and J. Almeida, "Alignment-Free Sequence Comparison— A Review," Bioinformatics, vol. 19, pp. 513-523, 2003.
[3] M. Randić, J. Zupan, A.T. Balaban, D. Vikić-Topić, and D. Plavšić, "Graphical Representation for Protein," Chemistry Rev., vol. 111, no. 2, pp. 790-862, 2011.
[4] L. Gao and J. Qi, "Whole Genome Molecular Phylogeny of Large dsDNA Viruses Using Composition Vector Method," BMC Evolution Biology, vol. 7, article 41, 2007.
[5] J. Qi, B. Wang, and B.L. Hao, "Whole Proteome Prokaryote Phylogeny without Sequence Alignment: A K-String Composition Approach," J. Molecular Evolution, vol. 58, pp. 1-11, 2004.
[6] A. Nandy, A. Ghosh, and P. Nandy, "Numerical Characterization of Protein Sequences and Application to Voltage-Gated Sodium Channel A Subunit Phylogeny," Silico Biology, vol. 9, pp. 77-87, 2009.
[7] Z.-P. Feng, "Prediction of the Subcellular Location of Prokaryotic Proteins Based on a New Representation of the Amino Acid Composition," Biopolymers, vol. 58, pp. 491-499, 2000.
[8] M. Novič and M. Randić, "Representation of Proteins as Walks in 20-D Space," SAR and QSAR in Environmental Research, vol. 19, pp. 317-337, 2008.
[9] M. Randić and J. Zupan, "Highly Compact 2D Graphical Representation of DNA Sequences," SAR and QSAR in Environmental Research, vol. 15, pp. 191-205, 2004.
[10] Y.-H. Yao, Q. Dai, C. Li, P.-A. He, X.-Y. Nan, and Y.-Z. Zhang, "Analysis of Similarity/Dissimilarity of Protein Sequences," Proteins: Structure, Function, and Bioinformatics, vol. 73, pp. 864-871, 2008.
[11] J. Wen and Y. Zhang, "A 2D Graphical Representation of Protein Sequence and Its Numerical Characterization," Chemical Physics Letters, vol. 476, pp. 281-286, 2009.
[12] M.I. Abo el Maaty, M.M. Abo-Elkhier, and M.A. Abd Elwahaab, "3D Graphical Representation of Protein Sequences and Their Statistical Characterization," Physica A: Statistical Mechanics and Its Applications, vol. 389, pp. 4668-4676, 2010.
[13] P.-A. He, Y.-P. Zhang, Y.-H. Yao, Y.-F. Tang, and X.-Y. Nan, "The Graphical Representation of Protein Sequences Based on the Physicochemical Properties and Its Applications," J. Computational Chemistry, vol. 31, pp. 2136-2142, 2010.
[14] B. Liao, X. Sun, and Q. Zeng, "A Novel Method for Similarity Analysis and Protein Sub-Cellular Localization Prediction," Bioinformatics, vol. 26, pp. 2678-2683, 2010.
[15] Z.-C. Wu, X. Xiao, and K.-C. Chou, "2D-MH: A Web-Server for Generating Graphic Representation of Protein Sequences Based on the Physicochemical Properties of Their Constituent Amino Acids," J. Theoretical Biology, vol. 267, pp. 29-34, 2010.
[16] P.-a. He, J. Wei, Y. Yao, and Z. Tie, "A Novel Graphical Representation of Protein and Its Application," Physica A: Statistical Mechanics and Its Applications, vol. 391, pp. 93-99, 2011.
[17] C. Yu, S.-Y. Cheng, R.L. He, and S.S.T. Yau, "Protein Map: An Alignment-Free Sequence Comparison Method Based on Various Properties of Amino Acids," Gene, vol. 486, pp. 110-118, 2011.
[18] M. Randić, "Condensed Representation of DNA Primary Sequences," J. Chemistry Information and Computer Science, vol. 40, pp. 50-56, 2000.
[19] H.J. Jeffrey, "Chaos Game Representation of Gene Structure," Nucleic Acids Research, vol. 18, pp. 2163-2170, 1990.
[20] G. Chang and T. Wang, "Phylogenetic Analysis of Protein Sequences Based on Distribution of Length About Common Substring," Protein J., vol. 30, pp. 167-172, 2011.
[21] X. Xia and Z. Xie, "Protein Structure, Neighbor Effect, and a New Index of Amino Acid Dissimilarities," Molecular Biology and Evolution, vol. 19, pp. 58-67, 2002.
[22] A. Nandy, M. Harle, and S.C. Basak, "Mathematical Descriptors of DNA Sequences: Development and Applications," Archive for Organic Chemistry, vol. 9, pp. 211-238, 2006.
[23] K. Nguyen, "On the Edge of Web-Based Multiple Sequence Alignment Services," Tsinghua Science and Technology, vol. 17, pp. 629-637, Dec. 2012.
[24] D. Bieliska-W, "Graphical and Numerical Representations of DNA Sequences: Statistical Aspects of Similarity," J. Math. Chemistry, vol. 49, pp. 2345-2407, 2011.
[25] M. Randić, "Withdrawn: 2-D Graphical Representation of Proteins Based on Physico-Chemical Properties of Amino Acids," Chemical Physics Letters, vol. 444, pp. 176-180, 2007.
[26] P.-a. He, J. Wei, Y. Yao, and Z. Tie, "A Novel Graphical Representation of Proteins and Its Application," Physica A: Statistical Mechanics and Its Applications, vol. 391, pp. 93-99, 2012.
[27] H.-J. Yu and D.-S. Huang, "Novel 20-D Descriptors of Protein Sequences and It's Applications in Similarity Analysis," Chemical Physics Letters, vol. 531, pp. 261-266, 2012.
[28] Z.-H. You, Y.-K. Lei, D.-S. Huang, and X. Zhou, "Using Manifold Embedding for Assessing and Predicting Protein Interactions from High-Throughput Experimental Data," Bioinformatics, vol. 26, pp. 2744-2751, 2010.
[29] J.-F. Xia, X.-M. Zhao, J.-N. Song, and D.-S. Huang, "APIS: Accurate Prediction of Hot Spots in Protein Interfaces by Combining Protrusion Index with Solvent Accessibility," BMC Bioinformatics, vol. 11, article 174, 2010.
[30] J.-F. Xia, X.-M. Zhao, and D.-S. Huang, "Predicting Protein-Protein Interactions from Protein Sequences Using Meta Predictor," Amino Acids, vol. 39, pp. 1595-1599, 2010.
[31] J.-F. Xia, K. Han, and D.-S. Huang, "Sequence-Based Prediction of Protein-Protein Interactions by Means of Rotation Forest and Autocorrelation Descriptor," Protein and Peptide Letters, vol. 17, pp. 137-145, 2010.
[32] B. Wang, P. Chen, D.-S. Huang, J.-J. Li, T.-M. Lok, and M.R. Lyu, "Predicting Protein Interaction Sites from Residue Spatial Sequence Profile and Evolution Rate," FEBS Letters, vol. 580, pp. 380-384, 2006.
[33] G.H. Golub and C.F.V. Loan, Matrix Computations, third ed. Johns Hopkins Univ. Press, 1996.
[34] Y. Cao, A. Janke, P.J. Waddell, M. Westerman, O. Takenaka, S. Murata, N. Okada, S. Pääbo, and M. Hasegawa, "Conflict among Individual Mitochondrial Proteins in Resolving the Phylogeny of Eutherian Orders," J. Molecular Evolution, vol. 47, pp. 307-322, 1998.
[35] H.H. Otu and K. Sayood, "A New Sequence Distance Measure for Phylogenetic Tree Construction," Bioinformatics, vol. 19, pp. 2122-2130, 2003.
[36] M. Li, J.H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang, "An Information-Based Sequence Distance and Its Application to Whole Mitochondrial Genome Phylogeny," Bioinformatics, vol. 17, pp. 149-154, 2001.
[37] V. Afreixo, C.A.C. Bastos, A.J. Pinho, S.P. Garcia, and P.J.S.G. Ferreira, "Genome Analysis with Inter-Nucleotide Distances," Bioinformatics, vol. 25, pp. 3064-3070, 2009.
[38] I. Ulitsky, D. Burnstein, T. Tuller, and B. Chor, "The Average Common Substring Approach to Phylogenomic Reconstruction," J. Computational Biology, vol. 13, pp. 336-350, 2006.
[39] M.J. Ford, "Molecular Evolution of Transferrin: Evidence for Positive Selection in Salmonids," Molecular Biology and Evolution, vol. 18, pp. 639-647, 2001.
64 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool