Issue No.04 - July-Aug. (2013 vol.10)
pp: 884-896
Ken D. Nguyen , Dept. of Comput. Sci. & Inf. Technol., Clayton State Univ., Morrow, GA, USA
Yi Pan , Dept. of Comput. Sci., Georgia State Univ., Atlanta, GA, USA
A common and cost-effective mechanism to identify the functionalities, structures, or relationships between species is multiple-sequence alignment, in which DNA/RNA/protein sequences are arranged and aligned so that similarities between sequences are clustered together. Correctly identifying and aligning these sequence biological similarities help from unwinding the mystery of species evolution to drug design. We present our knowledge-based multiple sequence alignment (KB-MSA) technique that utilizes the existing knowledge databases such as SWISSPROT, GENBANK, or HOMSTRAD to provide a more realistic and reliable sequence alignment. We also provide a modified version of this algorithm (CB-MSA) that utilizes the sequence consistency information when sequence knowledge databases are not available. Our benchmark tests on BAliBASE, PREFAB, HOMSTRAD, and SABMARK references show accuracy improvements up to 10 percent on twilight data sets against many leading alignment tools such as ISPALIGN, PADT, CLUSTALW, MAFFT, PROBCONS, and T-COFFEE.
Databases, Bioinformatics, Computational biology, Knowledge based systems, Phylogeny, Amino acids,consistency MSA, Bioinformatics, multiple-sequence alignment, knowledge-based MSA, progressive MSA
Ken D. Nguyen, Yi Pan, "A Knowledge-Based Multiple-Sequence Alignment Algorithm", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.10, no. 4, pp. 884-896, July-Aug. 2013, doi:10.1109/TCBB.2013.102
[1] M.O. Dayhoff, R.M. Schwartz, and B.C. Orcutt, "A Model of Evolutionary Change in Proteins. Matrices for Detecting Distant Relationships," Atlas of Protein Sequence and Structure, vol. 5, no. Suppl 3, pp. 345-358, 1978.
[2] S. Henikoff and J.G. Henikoff, "Amino Acid Substitution Matrices from Protein Blocks," Proc. Nat'l Academy of Sciences USA, vol. 89, no. 22, pp. 10915-10919, 1992.
[3] S.B. Needleman and C.D. Wunsch, "A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins," J. Molecular Biology, vol. 48, no. 3, pp. 443-453, 1970.
[4] H. Carillo and D. Lipman, "The Multiple Sequence Alignment Problem in Biology," SIAM J. Applied Math., vol. 48, no. 5, pp. 1073-1082, 1988.
[5] T.F. Smith and M.S. Waterman, "Identification of Common Molecular Subsequences," J. Molecular Biology, vol. 147, pp. 195-197, 1981.
[6] L. Wang and T. Jiang, "On the Complexity of Multiple Sequence Alignment," J. Computational Biology, vol. 1, no. 4, pp. 337-348, 1994.
[7] D.F. Feng and R.F. Doolittle, "Progressive Sequence Alignment as a Prerequisite to Correct Phylogenetic Trees," J. Molecular Evolution, vol. 25, no. 4, pp. 351-360, 1987.
[8] W.C. Wheeler and D.S. Gladstein, "Malign: A Multiple Sequence Alignment Program," J. Heredity, vol. 85, pp. 417-418, 1994.
[9] C. Notredame, D. Higgins, and O. Journals, "SAGA: Sequence Alignment by Genetic Algorithm," Nucleic Acids Research, vol. 24, no. 8, pp. 1515-1524, 1996.
[10] C. Notredame, D. Higgins, and J. Heringa, "T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment," J. Molecular Biology, vol. 302, no. 1, pp. 205-217, 2000.
[11] J. Thompson et al., "CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Position-Specific Gap Penalties, and Weight Matrix Choice," Nucleic Acids Research, vol. 22, no. 22, pp. 4673-4680, 1994.
[12] R. Edgar, "MUSCLE: Multiple Sequence Alignment with High Accuracy and High Throughput," Nucleic Acids Research, vol. 32, no. 5, pp. 1792-1797, 2004.
[13] J. Stoye, "Multiple Sequence Alignment with the Divide-and-Conquer Method," Gene, vol. 211, no. 2, p. 56, 1998.
[14] R.F. Smith and T.F. Smith, "Pattern-Induced Multi-Sequence Alignment (PIMA) Algorithm Employing Secondary Structure-Dependent Gap Penalties for Use in Comparative Protein Modeling," Protein Eng. Design and Selection, vol. 5, pp. 35-41, 1992.
[15] C. Do, M. Mahabhashyam, M. Brudno, and S. Batzoglou, "ProbCons: Probabilistic Consistency-Based Multiple Sequence Alignment," Genome Research, vol. 15, pp. 330-340, 2005.
[16] K. Katoh, K. Misawa, K. Kuma, and T. Miyata, "MAFFT: A Novel Method for Rapid Multiple Sequence Alignment Based on Fast Fourier Transform," Nucleic Acids Research, vol. 30, no. 14, pp. 3059-3066, 2002.
[17] O. O'Sullivan, K. Suhre, C. Abergel, D.G. Higgins, and C. Notredame, "3DCoffee: Combining Protein Sequences and Structures within Multiple Sequence Alignments," J. Molecular Biology, vol. 340, pp. 385-395, 2004.
[18] Y. Lu and S.-H. Sze, "Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences," Proc. 11th Ann. Int'l Conf. Research in Computational Molecular Biology (RECOMB '07), pp. 283-295, 2007.
[19] K. Nguyen and Y. Pan, "Multiple Sequence Alignment Based on Dynamic Weighted Guidance Tree," Int'l J. Bioinformatics Research and Applications, vol. 7, no. 2, pp. 168-182, 2011.
[20] K. Nguyen, Y. Pan, and G. Nong, "Parallel Progressive Multiple Sequence Alignment on Reconfigurable Meshes," BMC Genomics, vol. 12, no. Suppl 5, article S4, 2011.
[21] S. Lloyd and Q. Snell, "Accelerated Large-Scale Multiple Sequence Alignment," BMC Bioinformatics, vol. 12, no. 1,article 466, 2011.
[22] J. Sun, X. Wu, W. Fang, Y. Ding, H. Long, and W. Xu, "Multiple Sequence Alignment Using the Hidden Markov Model Trained by an Improved Quantum-Behaved Particle Swarm Optimization," Information Sciences, vol. 182, no. 1, pp. 93-114, 2012.
[23] K. Nguyen, "On the Edge of Web-Based Multiple Sequence Alignment Services," Tsinghua Science and Technology, vol. 17, pp. 629-637, 2012.
[24] X. Kang, L. He, and L. Dong, "The Blast Algorithm Based on Multi-Threading in the DNA Multiple Sequence Alignment," Advances in Control and Communication, D. Zeng, ed., pp. 81-84, Springer, 2012.
[25] N. Saitou and M. Nei, The Method: A New Method for Reconstructing Phylogenetic Trees, vol. 4, pp. 406-425, Oxford Univ. Press, 1987.
[26] P.H.A. Sneath and R.R. Sokal, Numerical Taxonomy: The Principles and Practice of Numerical Classification, p. 573, Freeman, 1973.
[27] A. Bairoch et al., "The Universal Protein Resource (UniProt)," Nucleic Acids Research, vol. 33, pp. 154-159, 2005.
[28] B. Boeckmann et al., "The Swiss-Prot Protein Knowledgebase and its Supplement TrEMBL in 2003," Nucleic Acids Research, vol. 31, pp. 365-370, 2003.
[29] K. Mizuguchi, C. Deane, T. Blundell, and J. Overington, "HOMSTRAD: A Database of Protein Structure Alignments for Homologous Families," Protein Science, vol. 7, pp. 2469-2471, 1998.
[30] M.R. Aniba, O. Poch, A. Marchler-Bauer, and J.D. Thompson, "AlexSys: A Knowledge-Based Expert System for Multiple Sequence Alignment Construction and Analysis," Nucleic Acids Research, vol. 38, no. 19, pp. 6338-6349, 2010.
[31] K. Katoh and H. Toh, "Improved Accuracy of Multiple ncRNA Alignment by Incorporating Structural Information into a Mafft-Based Framework," BMC Bioinformatics, vol. 9, no. 1,article 212, 2008.
[32] C. Lee, "Generating Consensus Sequences from Partial Order Multiple Sequence Alignment Graphs," Bioinformatics, vol. 19, no. 8, pp. 999-1008, 2003.
[33] C. Grasso and C. Lee, "Combining Partial Order Alignment and Progressive Multiple Sequence Alignment Increases Alignment Speed and Scalability to Very Large Alignment Problems," J. Molecular Biology, vol. 20, no. 10, pp. 1546-1556, 2004.
[34] J.D. Thompson, P. Koehl, R. Riip, and O. Poch, "BALiBASE 3.0: Latest Development of Multiple Alignment Benchmark," Protein, vol. 61, pp. 127-136, 2005.
[35] K. Nguyen and Y. Pan, "An Improved Scoring Method for Protein Residue Conservation and Multiple Sequence Alignment," IEEE Trans. NanoBioscience, vol. 10, no. 4, pp. 275-285, Dec. 2011.
[36] X. Huang and W. Miller, "A Time-Efficient, Linear Space Local Similarity Algorithm," Advances in Applied Math., vol. 12, pp. 337-357, 1991.
[37] J.B. Dunn, "Maximum Likelihood from Incomplete Data via the Em Algorithm," J. Royal Statistical Soc., vol. 39, pp. 1-38, 1977.
[38] I.V. Walle, I. Lasters, and L. Wyns, "SABmark—A Benchmark for Sequence Alignment That Covers the Entire Known Fold Space," Bioinformatics, vol. 21, pp. 1267-1268, 2005.