The Community for Technology Leaders
Subscribe
Issue No.01 - January-February (2011 vol.8)
pp: 194-205
Ankit Agrawal , Iowa State University, Ames
Xiaoqiu Huang , Iowa State University, Ames
ABSTRACT
Pairwise sequence alignment is a central problem in bioinformatics, which forms the basis of various other applications. Two related sequences are expected to have a high alignment score, but relatedness is usually judged by statistical significance rather than by alignment score. Recently, it was shown that pairwise statistical significance gives promising results as an alternative to database statistical significance for getting individual significance estimates of pairwise alignment scores. The improvement was mainly attributed to making the statistical significance estimation process more sequence-specific and database-independent. In this paper, we use sequence-specific and position-specific substitution matrices to derive the estimates of pairwise statistical significance, which is expected to use more sequence-specific information in estimating pairwise statistical significance. Experiments on a benchmark database with sequence-specific substitution matrices at different levels of sequence-specific contribution were conducted, and results confirm that using sequence-specific substitution matrices for estimating pairwise statistical significance is significantly better than using a standard matrix like BLOSUM62, and than database statistical significance estimates reported by popular database search programs like BLAST, PSI-BLAST (without pretrained PSSMs), and SSEARCH on a benchmark database, but with pretrained PSSMs, PSI-BLAST results are significantly better. Further, using position-specific substitution matrices for estimating pairwise statistical significance gives significantly better results even than PSI-BLAST using pretrained PSSMs.
INDEX TERMS
Database statistical significance, homologs, pairwise statistical significance, position-specific scoring matrices (PSSMs), sequence alignment, substitution matrices.
CITATION
Ankit Agrawal, Xiaoqiu Huang, "Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.8, no. 1, pp. 194-205, January-February 2011, doi:10.1109/TCBB.2009.69
REFERENCES
 [1] W.R. Pearson and D.J. Lipman, "Improved Tools for Biological Sequence Comparison," Proc. Nat'l Academy of Sciences USA vol. 85, no. 8, pp. 2444-2448, http://www.pnas.org/cgi/content/abstract/ 85/82444, 1988. [2] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, "Basic Local Alignment Search Tool.," J. Molecular Biology, vol. 215, no. 3, pp. 403-410, http://dx.doi.org/10.1006jmbi.1990.9999, 1990. [3] S.F. Altschul, T.L. Madden, A.A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman, "Gapped BLAST PSI-BLAST: A New Generation of Protein Database Search Programs," Nucleic Acids Research, vol. 25, no. 17, pp. 3389-3402, http://dx.doi.org/10.1093/nar25.17.3389, 1997. [4] T.F. Smith and M.S. Waterman, "Identification of Common Molecular Subsequences," J. Molecular Biology, vol. 147, no. 1, pp. 195-197, http://view.ncbi.nlm.nih.gov/pubmed7265238 , 1981. [5] O. Gotoh, "An Improved Algorithm for Matching Biological Sequences," J. Molecular Biology, vol. 162, no. 3, pp. 705-708, Dec. 1982. [6] P.H. Sellers, "Pattern Recognition in Genetic Sequences by Mismatch Density," Bull. of Math. Biology, vol. 46, no. 4, pp. 501-514, http://www.springerlink.com/content2v4477481102w030 , 1984. [7] W.R. Pearson, "Effective Protein Sequence Comparison," Methods in Enzymology, vol. 266, pp. 227-259, 1996. [8] W.R. Pearson, "Flexible Sequence Similarity Searching with the FASTA3 Program Package," Methods in Molecular Biology, vol. 132, pp. 185-219, 2000. [9] B. Ma, J. Tromp, and M. Li, "PatternHunter: Faster and More Sensitive Homology Search," Bioinformatics, vol. 18, no. 3, pp. 440-445, 2002. [10] M. Li, B. Ma, D. Kisman, and J. Tromp, "PatternHunter II: Highly Sensitive and Fast Homology Search," J. Bioinformatics and Computational Biology, vol. 2, no. 3, pp. 417-439, 2004. [11] K.-M. Chao, "Calign: Aligning Sequences with Restricted Affine Gap Penalties," Bioinformatics, vol. 15, no. 4, pp. 298-304, 1999. [12] X. Huang and K.-M. Chao, "A Generalized Global Alignment Algorithm," Bioinformatics, vol. 19, no. 2, pp. 228-233, 2003. [13] X. Huang and D.L. Brutlag, "Dynamic Use of Multiple Parameter Sets in Sequence Alignment," Nucleic Acids Research, vol. 35, no. 2, pp. 678-686, http://nar.oxfordjournals.org/cgi/content/ abstract/35/2678, 2007. [14] R. Mott, "Alignment: Statistical Significance," Encyclopedia of Life Science, http://mrw.interscience.wiley.com/emrw/9780470015902/ els/article/a0005264/current abstract, 2005. [15] S.F. Altschul, M.S. Boguski, W. Gish, and J.C. Wootton, "Issues in Searching Molecular Sequence Databases," Nature Genetics, vol. 6, no. 2, pp. 119-129, 1994. [16] S. Karlin and S.F. Altschul, "Methods for Assessing the Statistical Significance of Molecular Sequence Features by Using General Scoring Schemes," Proc. Nat'l Academy of Sciences USA, vol. 87, no. 6, pp. 2264-2268, http://www.pnas.org/cgi/content/ abstract/ 87/62264, 1990. [17] M.S. Waterman and M. Vingron, "Rapid, Accurate Estimates of Statistical Significance for Sequence Database Searches," Proc. Nat'l Academy of Sciences USA, vol. 91, no. 11, pp. 4625-4628, http://www.pnas.org/cgi/content/abstract/ 91/114625, 1994. [18] S.F. Altschul and W. Gish, "Local Alignment Statistics," Methods in Enzymology, vol. 266, pp. 460-80, 1996. [19] W.R. Pearson, "Empirical Statistical Estimates for Sequence Similarity Searches," J. Molecular Biology, vol. 276, pp. 71-84, 1998. [20] R. Mott and R. Tribe, "Approximate Statistics of Gapped Alignments," J. Computational Biology, vol. 6, no. 1, pp. 91-112, 1999. [21] R. Mott, "Accurate Formula for P-Values of Gapped Local Sequence and Profile Alignments," J. Molecular Biology, vol. 300, pp. 649-659, 2000. [22] R. Bundschuh, "Rapid Significance Estimation in Local Sequence Alignment with Gaps," Proc. Fifth Ann. Int'l Conf. Research in Computational Molecular Biology (RECOMB '01), pp. 77-85, 2001. [23] S.F. Altschul, R. Bundschuh, R. Olsen, and T. Hwa, "The Estimation of Statistical Parameters for Local Alignment Score Distributions," Nucleic Acids Research, vol. 29, no. 2, pp. 351-361, 2001. [24] A.A. Schäffer, L. Aravind, T.L. Madden, S. Shavirin, J.L. Spouge, Y.I. Wolf, E.V. Koonin, and S.F. Altschul, "Improving the Accuracy of PSI-BLAST Protein Database Searches with Composition-Based Statistics and Other Refinements," Nucleic Acids Research, vol. 29, no. 14, pp. 2994-3005, 2001. [25] S. Sheetlin, Y. Park, and J.L. Spouge, "The Gumbel Pre-Factor $k$ for Gapped Local Alignment Can Be Estimated from Simulations of Global Alignment," Nucleic Acids Research, vol. 33, no. 15, pp. 4987-4994, 2005. [26] A. Poleksic, J.F. Danzer, K. Hambly, and D.A. Debe, "Convergent Island Statistics: A Fast Method for Determining Local Alignment Score Significance," Bioinformatics, vol. 21, no. 12, pp. 2827-2831, 2005. [27] Y.-K. Yu, E.M. Gertz, R. Agarwala, A.A. Schäffer, and S.F. Altschul, "Retrieval Accuracy, Statistical Significance and Compositional Similarity in Protein Sequence Database Searches," Nucleic Acids Research, vol. 34, no. 20, pp. 5966-5973, 2006. [28] A. Agrawal, V.P. Brendel, and X. Huang, "Pairwise Statistical Significance and Empirical Determination of Effective Gap Opening Penalties for Protein Local Sequence Alignment," Int'l J. Computational Biology and Drug Design, vol. 1, no. 4, pp. 347-367, 2008. [29] A. Agrawal and X. Huang, "Conservative, Non-Conservative and Average Pairwise Statistical Significance of Local Sequence Alignment," Proc. IEEE Int'l Conf. Bioinformatics and Biomedicine, pp. 433-436, 2008. [30] M. Kschischo, M. Lässig, and Y.-K. Yu, "Toward an Accurate Statistics of Gapped Alignments," Bull. of Math. Biology, vol. 67, pp. 169-191, 2004. [31] S. Grossmann and B. Yakir, "Large Deviations for Global Maxima of Independent Superadditive Processes with Negative Drift and an Application to Optimal Sequence Alignments," Bernoulli, vol. 10, no. 5, pp. 829-845, 2004. [32] M. Pagni and C.V. Jongeneel, "Making Sense of Score Statistics for Sequence Alignments," Briefings in Bioinformatics, vol. 2, no. 1, pp. 51-67, 2001. [33] W.R. Pearson and T.C. Wood, "Statistical Significance in Biological Sequence Comparison," Handbook of Statistical Genetics, D. J. Balding, M. Bishop, and C. Cannings, eds., pp. 39-66, Wiley, 2001. [34] A.Y. Mitrophanov and M. Borodovsky, "Statistical Significance in Biological Sequence Analysis," Briefings in Bioinformatics, vol. 7, no. 1, pp. 2-24, 2006. [35] Y.-K. Yu and S.F. Altschul, "The Construction of Amino Acid Substitution Matrices for the Comparison of Proteins with Non-Standard Compositions," Bioinformatics, vol. 21, no. 7 pp. 902-911, 2005. [36] S.R. Eddy, "Maximum Likelihood Fitting of Extreme Value Distributions," unpublished work, citeseer.ist.psu.edu370503.html, 1997. [37] A. Agrawal and X. Huang, "Pairwise Statistical Significance of Local Sequence Alignment Using Multiple Parameter Sets and Empirical Justification of Parameter Set Change Penalty," BMC Bioinformatics, vol. 10, suppl. 3, p. S1, 2009. [38] A. Agrawal and X. Huang, "Pairwise Statistical Significance of Local Sequence Alignment Using Substitution Matrices with Sequence-Pair-Specific Distance," Proc. Int'l Conf. Information Technology, (ICIT '08), pp. 94-99, 2008. [39] M.L. Sierk and W.R. Pearson, "Sensitivity and Selectivity in Protein Structure Comparison," Protein Science, vol. 13, no. 3, pp. 773-785, 2004. [40] S. Kotz and S. Nadarajah, Extreme Value Distributions: Theory and Applications, ch. 1, pp. 3-4. Imperial College Press, 2000. [41] S. Wolfsheimer, B. Burghardt, and A.K. Hartmann, "Local Sequence Alignments Statistics: Deviations from Gumbel Statistics in the Rare-Event Tail," Algorithms for Molecular Biology, vol. 2, p. 9, 2007. [42] A.K. Hartmann, "Sampling Rare Events: Statistics of Local Sequence Alignments," Physical Rev. E, vol. 65, no. 5, p. 056102, 2002. [43] R. Olsen, R. Bundschuh, and T. Hwa, "Rapid Assessment of Extremal Statistics for Gapped Local Alignment," Proc. Seventh Int'l Conf. Intelligent Systems for Molecular Biology, pp. 211-222, 1999. [44] R.F. Mott, "Maximum-Likelihood Estimation of the Statistical Distribution of Smith Waterman Local Sequence Similarity Scores," Bull. of Math. Biology, vol. 54, pp. 59-75, 1992. [45] S.R. Eddy, "Where did the Blosum62 Alignment Score Matrix Come from?," Nature Biotechnology, vol. 22, no. 8, pp. 1035-1036, Aug. 2004. [46] C.A. Orengo, A.D. Michie, S. Jones, D.T. Jones, M.B. Swindells, and J.M. Thornton, "CATH—A Hierarchic Classification of Protein Domain Structures," Structure, vol. 28, no. 1, pp. 1093-1108, 1997. [47] J. Rocha, F. Rosselló, and J. Segura, "Compression Ratios Based on the Universal Similarity Metric Still Yield Protein Distances Far from CATH Distances," CoRR, vol. abs/q-bio/0603007, 2006. [48] D.S. Hirschberg, "A Linear Space Algorithm for Computing Maximal Common Subsequences," Comm. ACM, vol. 18, no. 6, pp. 341-343, 1975. [49] S. Altschul and B. Erickson, "Optimal Sequence Alignment Using Affine Gap Costs," Bull. of Math. Biology, vol. 48, no. 5, pp. 603-616, Sept. 1986.