This Article 
 Bibliographic References 
 Add to: 
Composition Vector Method Based on Maximum Entropy Principle for Sequence Comparison
January/February 2012 (vol. 9 no. 1)
pp. 79-87
R. H. Chan, Dept. of Math., Chinese Univ. of Hong Kong, Hong Kong, China
T. H. Chan, Dept. of Math., Chinese Univ. of Hong Kong, Hong Kong, China
Hau Man Yeung, Dept. of Math., Chinese Univ. of Hong Kong, Hong Kong, China
R. W. Wang, CAS-MPG Partner Inst. for Comput. Biol., Shanghai, China
The composition vector (CV) method is an alignment-free method for sequence comparison. Because of its simplicity when compared with multiple sequence alignment methods, the method has been widely discussed lately; and some formulas based on probabilistic models, like Hao's and Yu's formulas, have been proposed. In this paper, we improve these formulas by using the entropy principle which can quantify the nonrandomness occurrence of patterns in the sequences. More precisely, existing formulas are used to generate a set of possible formulas from which we choose the one that maximizes the entropy. We give the closed-form solution to the resulting optimization problem. Hence, from any given CV formula, we can find the corresponding one that maximizes the entropy. In particular, we show that Hao's formula is itself maximizing the entropy and we derive a new entropy-maximizing formula from Yu's formula. We illustrate the accuracy of our new formula by using both simulated and experimental data sets. For the simulated data sets, our new formula gives the best consensus and significant values for three different kinds of evolution models. For the data set of tetrapod 18S rRNA sequences, our new formula groups the clades of bird and reptile together correctly, where Hao's and Yu's formulas failed. Using real data sets with different sizes, we show that our formula is more accurate than Hao's and Yu's formulas even for small data sets.

[1] D.R. Arahal, F.E. Dewhirst, B.J. Paster, B.E. Volcani, and A. Ventosa, “Phylogenetic Analyses of Some Extremely Halophilic Archaea Isolated from Dead Sea Water, Determined on the Basis of Their 16S rRNA Sequences,” Applied and Environmental Microbiology, vol. 62, pp. 3779-3786, 1996.
[2] M.W. Berry, Z. Drmac, and E.R. Jessup, “Matrices, Vector Spaces, and Information Retrieval,” SIAM Rev., vol. 41, pp. 335-362, 1999.
[3] D.P. Bertsekas, “Nonlinear Programming,” Athena Scientific, 1999.
[4] V. Brendel, J.S. Beckmann, and E.N. Trifonov, “Linguistics of Nucleotide Sequences: Morphology and Comparison of Vocabularies,” J. Biomolecular Structure and Dynamics, vol. 4, pp. 11-20, 1986.
[5] K.H. Chu, J. Qi, Z.G. Yu, and V. Anh, “Origin and Phylogeny of Chloroplasts: A Simple Correlation Analysis of Complete Genomes,” Molecular Biology and Evolution, vol. 21, pp. 200-206, 2004.
[6] L. Gao and J. Qi, “Whole Genome Molecular Phylogeny of Large dsDNA Viruses Using Composition Vector Method,” BMC Evolutionary Biology, vol 7, pp. 1-7, 2007.
[7] L. Gao, J. Qi, H. Wei, Y. Sun, and B.L. Hao, “Molecular Phylogeny of Coronaviruses Including Human SARS-CoV,” Chinese Science Bull., vol. 48, pp. 1170-1174, 2003.
[8] Z. Gu, X. Zhao, N. Li, and C. Wu, “Complete Sequence of the Yak (Bos Grunniens) Mitochondrial Genome and its Evolutionary Relationship with Other Ruminants,” Molecular Phylogenetics and Evolution, vol. 42, pp. 248-255, 2007.
[9] B.L. Hao, J. Qi, and B. Wang, “Prokaryotic Phylogeny Based on Complete Genomes without Sequence Alignment,” Modern Physics Letters B, vol. 2, pp. 1-4, 2003.
[10] S.B. Hedges, K.D. Moberg, and L.R. Maxson, “Tetrapod Phylogeny Inferred from 18S and 28S Ribosomal RNA Sequences and a Review of the Evidence for Amniote Relationships,” Molecular Biology and Evolution, vol. 7, pp. 607-633, 1990.
[11] R. Hu and B. Wang, “Statistically Significant Strings are Related to Regulatory Elements in the Promoter Regions of Saccharomyces Cerevisiae,” Physica A: Statistical Mechanics and Its Applications, vol. 290, pp. 464-474, 2001.
[12] G. Lu, S. Zhang, and X. Fang, “An Improved String Composition Method for Sequence Comparison,” BMC Bioinformatics, vol. 9, Suppl 6, p. S15, 2008.
[13] T. Margush and F.R. McMorris, “Consensus n-Trees,” Bull. of Math. Biology, vol. 43, pp. 239-244, 1981.
[14] S.B. Needleman and C.D. Wunsch, “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins,” J. Molecular Biology, vol. 48, pp. 443-453, 1970.
[15] P.A.S. Nuin, Z. Wang, and E.R.M. Tillier, “The Accuracy of Several Multiple Sequence Alignment Programs for Proteins,” BMC Bioinformatics, vol 7, pp. 1-18, 2006.
[16] G.J. Phillips, J. Arnold, and R. Ivarie, “Mono-Through Hexanucleotide Composition of the Escherichia Coli Genome: A Markov Chain Analysis,” Nucleic Acids Research, vol. 15, pp. 2611-2626, 1987.
[17] J. Qi, B. Wang, and B.L. Hao, “Whole Proteome Prokaryote Phylogeny without Sequence Alignment: A k-String Composition Approach,” J. Molecular Evolution, vol. 58, pp. 1-11, 2004.
[18] M.S. Rosenberg, “MySSP: Non-Stationary Evolutionary Sequence Simulation, Including Indels,” Evolutionary Bioinformatics Online, vol. 1, pp. 51-53, 2005.
[19] N. Saitou and M. Nei, “The Neighbor-Joining Method: A New Method for Reconstructing Phylogenetic Trees,” Molecular Biology and Evolution, vol. 4, pp. 406-425, 1987.
[20] A.E. Shannon, “A Mathematical Theory of Communication,” Bell System Technical J., vol. 27, pp. 379-423, 1948.
[21] D. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, third ed. CRC Press, 2004.
[22] T.T. Smith and M.S. Waterman, “Identification of Common Molecular Subsequences,” J. Molecular Biology, vol. 147, pp. 195-197, 1981.
[23] S.J. Sogin, M.L. Sogin, and C.R. Woese, “Phylogenetic Measurement in Procaryotes by Primary Structural Characterization,” J. Molecular Evolution, vol. 1, pp. 173-184, 1972.
[24] G.W. Stuart and M.W. Berry, “An SVD-Based Comparison of Nine Whole Eukaryotic Genomes Supports a Coelomate Rather than Ecdysozoan Lineage,” BMC Bioinformatics, vol 5, p. 204, 2004.
[25] G.W. Stuart, K. Moffett, and J.J. Leader, “A Comprehensive Vertebrate Phylogeny Using Vector Representations of Protein Sequences from Whole Genomes,” Molecular Biology and Evolution, vol. 19, pp. 554-562, 2002.
[26] J.A. Studier and K.J. Keppler, “A Note of the Neighbor-Joining Algorithm of Saitou and Nei,” Molecular Biology and Evolution, vol. 5, pp. 729-731, 1988.
[27] K. Tamura, J. Dudley, M. Nei, and S. Kumar, “MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) Software Version 4.0,” Molecular Biology and Evolution, vol. 24, pp. 1596-1599, 2007.
[28] S. Vinga and J. Almeida, “Alignment Free Sequence Comparison-a Review,” Bioinformatics, vol. 19, pp. 513-523, 2003.
[29] C.R. Woese, “Bacterial Evolution,” Microbiology Rev., vol. 51, pp. 221-271, 1987.
[30] X. Wu, X.F. Wan, G. Wu, D. Xu, and G. Lin, “Phylogenetic Analysis Using Complete Signature Information of Whole Genomes and Clustered Neighbor-Joining Method,” Int'l J. Bioinformatics Research and Applications, vol. 2, pp. 219-248, 2006.
[31] X. Xia, Z. Xie, and K.M. Kjer, “18S Ribosomal RNA and Tetrapod Phylogeny,” Systematic Biology, vol. 52, pp. 283-295, 2003.
[32] H.M. Xie, Grammatical Complexity and One-Dimensional Dynamical Systems. World Scientific, 1996.
[33] Z.G. Yu, X.W. Zhan, G.S. Han, R.W. Wang, V. Anh, and K.H. Chu, “Proper Distance Metrics for Phylogenetic Analysis Using Complete Genomes without Sequence Alignment,” Int'l J. Molecular Sciences, vol. 11, pp. 1141-1154, 2010.
[34] Z.G. Yu, L.Q. Zhou, V. Anh, K.H. Chu, S.C. Long, and J.Q. Deng, “Phylogeny of Prokaryotes and Chloroplasts Revealed by a Simple Composition Approach on All Protein Sequences from Whole Genome without Sequence Alignment,” J. Molecular Evolution, vol. 60, pp. 538-545, 2005.

Index Terms:
probability,macromolecules,maximum entropy methods,molecular biophysics,organic compounds,physiological models,tetrapod 18S rRNA sequences,composition vector method,maximum entropy principle,sequence comparison,multiple sequence alignment methods,probabilistic models,Haos formula,Yus formula,closed-form solution,optimization problem,entropy-maximizing formula,Entropy,Estimation,Optimization,Computational modeling,Bioinformatics,Phylogeny,Strain,phylogenetics.,Composition vector method,maximum entropy principle,optimization model,alignment-free sequence comparison
R. H. Chan, T. H. Chan, Hau Man Yeung, R. W. Wang, "Composition Vector Method Based on Maximum Entropy Principle for Sequence Comparison," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 1, pp. 79-87, Jan.-Feb. 2012, doi:10.1109/TCBB.2011.45
Usage of this product signifies your acceptance of the Terms of Use.