This Article 
 Bibliographic References 
 Add to: 
Improved Gapped Alignment in BLAST
July-September 2004 (vol. 1 no. 3)
pp. 116-129
Homology search is a key tool for understanding the role, structure, and biochemical function of genomic sequences. The most popular technique for rapid homology search is blast, which has been in widespread use within universities, research centers, and commercial enterprises since the early 1990s. In this paper, we propose a new step in the blast algorithm to reduce the computational cost of searching with negligible effect on accuracy. This new step—semigapped alignment—compromises between the efficiency of ungapped alignment and the accuracy of gapped alignment, allowing blast to accurately filter sequences with lower computational cost. In addition, we propose a heuristic—restricted insertion alignment—that avoids unlikely evolutionary paths with the aim of reducing gapped alignment cost with negligible effect on accuracy. Together, after including an optimization of the local alignment recursion, our two techniques more than double the speed of the gapped alignment stages in blast. We conclude that our techniques are an important improvement to the blast algorithm. Source code for the alignment algorithms is available for download at

[1] S. Altschul, M. Boguski, W. Gish, and J. Wootton, “Issues in Searching Molecular Sequence Databases,” Nature Genetics, vol. 6, pp. 119-129, 1994
[2] S.F. Altschul, “Generalized Affine Gap Costs for Protein Sequence Alignment,” PROTEINS: Structure, Function, and Genetics, vol. 32, no. 1, pp. 88-96, 1998.
[3] S.F. Altschul, R. Bundschuh, R. Olsen, and T. Hwa, “The Estimation of Statistical Parameters for Local Alignment Score Distributions,” Nucleic Acids Research, vol. 29, no. 2, pp. 351-361, 2001.
[4] S.F. Altschul and W. Gish, “Local Alignment Statistics,” Methods in Enzymology, vol. 266, pp. 460-480, 1996.
[5] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, “Basic Local Alignment Search Tool,” J. Molecular Biology, vol. 215, no. 3, pp. 403-410, 1990.
[6] S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman, “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs,” Nucleic Acids Research, vol. 25, no. 17, pp. 3389-3402, 1997.
[7] A. Andreeva, D. Howorth, S.E. Brenner, T.J.P. Hubbard, C. Chothia, and A.G. Murzin, “SCOP Database in 2004: Refinements Integrate Structure and Sequence Family Data,” Nucleic Acids Research, vol. 32, pp. D226-D229, 2004.
[8] D.A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, B.A. Rapp, and D.L. Wheeler, “Genbank,” Nucleic Acids Research, vol. 28, no. 1, pp. 15-18, 2000.
[9] S.E. Brenner, C. Chothia, and T.J.P. Hubbard, “Assessing Sequence Comparison Methods with Reliable Structurally Identified Distant Evolutionary Relationships,” Proc. Nat'l Academy of Sciences USA, vol. 95, no. 11, pp. 6073-6078, 1998.
[10] J.M. Chandonia, G. Hon, N.S. Walker, L. Lo Conte, P. Koehl, M. Levitt, and S.E. Brenner, “The ASTRAL Compendium in 2004,” Nucleic Acids Research, vol. 32, pp. D189-D192, 2004.
[11] K.M. Chao, R.C. Hardison, and W. Miller, “Recent Developments in Linear-Space Alignment Methods: A Survey,” J. Computational Biology, vol. 1, no. 4, pp. 271-91, 1994.
[12] K.M. Chao, W.R. Pearson, and W. Miller, “Aligning Two Sequences within a Specified Diagonal Band,” Computer Applications in the Biosciences, vol. 8, no. 5, pp. 481-487, 1992.
[13] X.L. Chen, personal comm., 2004.
[14] Z. Chen, “Assessing Sequence Comparison Methods with the Average Precision Criterion,” Bioinformatics, vol. 19, no. 18, pp. 2456-2460, 2003.
[15] M.O. Dayhoff, R.M. Schwartz, and B.C. Orcutt, “A Model of Evolutionary Change in Proteins,” Atlas of Protein Sequence and Structure, vol. 5, pp. 345-358, 1978.
[16] R. Durbin, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge Univ. Press, 1998.
[17] O. Gotoh, “An Improved Algorithm for Matching Biological Sequences,” J. Molecular Biology, vol. 162, no. 3, pp. 705-708, 1982.
[18] M. Gribskov and N.L. Robinson, “Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching,” Computers & Chemistry, vol. 20, pp. 25-33, 1996.
[19] D. Gusfield, Algorithms on Strings, Trees, and Sequences. Cambridge Univ. Press, 1997.
[20] S. Henikoff and J. Henikoff, “Amino Acid Substitution Matrices from Protein Blocks,” Proc. Nat'l Academy of Sciences, vol. 89, no. 22, pp. 10915-10919, 1992.
[21] S. Karlin and S.F. Altschul, “Methods for Assessing the Statistical Significance of Molecular Sequence Features by Using General Scoring Schemes,” Proc. Nat'l Academy of Sciences, vol. 87, no. 6, pp. 2264-2268, 1990.
[22] W.J. Kent, “BLAT-The BLAST-Like Alignment Tool,” Genome Research, vol. 12, no. 4, pp. 656-664, 2002.
[23] M. Li, B. Ma, D. Kisman, and J. Tromp, “Patternhunter II: Highly Sensitive and Fast Homology Search,” J. Bioinformatics and Computational Biology, vol. 2, no. 3, pp. 417-439, 2004.
[24] B. Ma, J. Tromp, and M. Li, “Patternhunter: Faster and More Sensitive Homology Search,” Bioinformatics, vol. 18, no. 3, pp. 440-445, 2002.
[25] S. McGinnis and T.L. Madden, “BLAST: At the Core of a Powerful and Diverse Set of Sequence Analysis Tools,” Nucleic Acids Research, vol. 32, pp. W20-W25, 2004.
[26] A.G. Murzin, S.E. Brenner, T. Hubbard, and C. Chothia, “SCOP: A Structural Classification of Proteins Database for the Investigation of Sequences and Structures,” J. Molecular Biology, vol. 247, no. 4, pp. 536-540, 1995.
[27] E.W. Myers and W. Miller, “Optimal Alignments in Linear Space,” Computer Applications in the Biosciences, vol. 4, no. 1, pp. 11-17, 1988.
[28] G. Myers and R. Durbin, “A Table-Driven, Full-Sensitivity Similarity Search Algorithm,” J. Computational Biology, vol. 10, no. 2, pp. 103-117, 2003.
[29] J. Park, K. Karplus, C. Barrett, R. Hughey, D. Haussler, T. Hubbard, and C. Chothia, “Sequence Comparisons Using Multiple Sequences Detect Three Times as Many Remote Homologues as Pairwise Methods,” J. Molecular Biology, vol. 284, no. 4, pp. 1201-1210, 1998.
[30] W.R. Pearson and D.J. Lipman, “Rapid and Sensitive Protein Similarity Searches,” Science, vol. 227, no. 4693, pp. 1435-1441, 1985.
[31] W.R. Pearson and D.J. Lipman, “Improved Tools for Biological Sequence Comparison,” Proc. Nat'l Academy of Sciences, vol. 85, no. 8, pp. 2444-2448, 1988.
[32] W.R. Pearson and W. Miller, “Dynamic Programming Algorithms for Biological Sequence Comparison,” Methods in Enzymology, vol. 210, pp. 575-601, 1992.
[33] A.A. Schaffer, L. Aravind, T.L. Madden, S. Shavirin, J.L. Spouge, Y.I. Wolf, E.V. Koonin, and S.F. Altschul, “Improving the Accuracy of PSI-BLAST Protein Database Searches with Composition-Based Statistics and Other Refinements,” Nucleic Acids Research, vol. 29, no. 14, pp. 2994-3005, 2001.
[34] T.F. Smith and M.S. Waterman, “Identification of Common Molecular Subsequences,” J. Molecular Biology, vol. 147, no. 1, pp. 195-197, 1981.
[35] W.J. Wilbur and D.J. Lipman, “Rapid Similarity Searches of Nucleic Acid and Protein Data Banks,” Proc. Nat'l Academy of Sciences, vol. 80, no. 3, pp. 726-730, 1983.
[36] Z. Zhang, P. Berman, and W. Miller, “Alignments without Low-Scoring Regions,” J. Computational Biology, vol. 5, no. 2, pp. 197-210, 1998.
[37] Z. Zhang, W. Pearson, and W. Miller, “Aligning a DNA Sequence with a Protein Sequence,” J. Computational Biology, vol. 4, no. 3, pp. 339-349, 1997.

Index Terms:
Sequence alignment, BLAST, dynamic programming, homology search.
Michael Cameron, Hugh E. Williams, Adam Cannane, "Improved Gapped Alignment in BLAST," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 1, no. 3, pp. 116-129, July-Sept. 2004, doi:10.1109/TCBB.2004.32
Usage of this product signifies your acceptance of the Terms of Use.