This Article 
 Bibliographic References 
 Add to: 
Comparing Compressed Sequences for Faster Nucleotide BLAST Searches
July-September 2007 (vol. 4 no. 3)
pp. 349-364
Molecular biologists, geneticists, and other life scientists use the BLAST homology search package as their first step for discovery of information about unknown or poorly annotated genomic sequences. There are two main variants of BLAST: BLASTP for searching protein collections and BLASTN for nucleotide collections. Surprisingly, BLASTN has had very little attention; for example, the algorithms it uses do not follow those described in the 1997 BLAST paper [1] and no exact description has been published. It is important that BLASTN is state-of-the-art: Nucleotide collections such as GenBank dwarf the protein collections in size, they double in size almost yearly, and they take many minutes to search on modern general purpose workstations. This paper proposes significant improvements to the BLASTN algorithms. Each of our schemes is based on compressed bytepacked formats that allow queries and collection sequences to be compared four bases at a time, permitting very fast query evaluation using lookup tables and numeric comparisons. Our most significant innovations are two new, fast gapped alignment schemes that allow accurate sequence alignment without decompression of the collection sequences. Overall, our innovations more than double the speed of BLASTN with no effect on accuracy and have been integrated into our new version of BLAST that is freely available for download from

[1] S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman, “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs,” Nucleic Acids Research, vol. 25, no. 17, pp. 3389-3402, 1997.
[2] D. Benson, I. Karsch-Mizrachi, D. Lipman, J. Ostell, and D. Wheeler, “Genbank,” Nucleic Acids Research, vol. 32, pp. D23-D26, 2004.
[3] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, “Basic Local Alignment Search Tool,” J. Molecular Biology, vol. 215, no. 3, pp. 403-410, 1990.
[4] X.L. Chen, “A Framework for Comparing Homology Search Techniques,” master's thesis, School of Computer Science and Information Technology, RMIT Univ., 2004.
[5] S. McGinnis and T. Madden, “BLAST: At the Core a Powerful and Diverse Set of Sequence Analysis Tools,” Nucleic Acids Research, vol. 32, pp. W20-W25, 2004.
[6] H.E. Williams and J. Zobel, “Indexing and Retrieval for Genomic Databases,” IEEE Trans. Knowledge and Data Eng., vol. 14, no. 1, pp.63-78, Jan./Feb. 2002.
[7] Z. Zhang, S. Schwartz, L. Wagner, and W. Miller, “A Greedy Algorithm for Aligning DNA Sequences,” J. Computational Biology, vol. 7, nos. 1-2, pp. 203-214, 2000.
[8] B. Ma, J. Tromp, and M. Li, “PatternHunter: Faster and More Sensitive Homology Search,” Bioinformatics, vol. 18, no. 3, pp. 440-445, 2002.
[9] M. Li, B. Ma, D. Kisman, and J. Tromp, “PatternHunter II: Highly Sensitive and Fast Homology Search,” J. Bioinformatics and Computational Biology, vol. 2, no. 3, pp. 417-439, 2004.
[10] D. Brown, M. Li, and B. Ma, “A Tutorial of Recent Developments in the Seeding of Local Alignment,” J. Bioinformatics and Computational Biology, vol. 2, no. 4, pp. 819-842, 2004.
[11] H.E. Williams and J. Zobel, “Compression of Nucleotide Databases for Fast Searching,” Computer Applications in the Biosciences, vol. 13, no. 5, pp. 549-554, 1997.
[12] S. Wu, U. Manber, and E.W. Myers, “A Subquadratic Algorithm for Approximate Limited Expression Matching,” Algorithmica, vol. 15, no. 1, pp. 50-67, 1996.
[13] T. Smith and M. Waterman, “Identification of Common Molecular Subsequences,” J. Molecular Biology, vol. 147, no. 1, pp. 195-197, 1981.
[14] E. Myers and W. Miller, “Optimal Alignments in Linear Space,” Computer Applications in the Biosciences, vol. 4, no. 1, pp. 11-17, 1988.
[15] W. Pearson and D. Lipman, “Improved Tools for Biological Sequence Comparison,” Proc. Nat'l Academy of Sciences USA, vol. 85, no. 8, pp. 2444-2448, 1988.
[16] W. Pearson and D. Lipman, “Rapid and Sensitive Protein Similarity Searches,” Science, vol. 227, no. 4693, pp. 1435-1441, 1985.
[17] K. Chao, W. Pearson, and W. Miller, “Aligning Two Sequences within a Specified Diagonal Band,” Computer Applications in the Biosciences, vol. 8, no. 5, pp. 481-487, 1992.
[18] S. Brenner, C. Chothia, and T. Hubbard, “Assessing Sequence Comparison Methods with Reliable Structurally Identified Distant Evolutionary Relationships,” Proc. Nat'l Academy of Sciences USA, vol. 95, no. 11, pp. 6073-6078, 1998.
[19] H.E. Williams and J. Zobel, “Indexing Nucleotide Databases for Fast Query Evaluation,” Proc. Fifth Int'l Conf. Extending Database Technology (EDBT '96), pp. 275-288, 1996.
[20] I.H. Witten, A. Moffat, and T.C. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, 1999.
[21] A. Califano and I. Rigoutsos, “FLASH: A Fast Look-up Algorithm for String Homology,” Proc. Fourth Int'l Conf. Intelligent Systems for Molecular Biology, vol. 1, pp. 56-64, 1993.
[22] B. Orcutt and W. Barker, “Searching the Protein Database,” Bull. Math. Biology, vol. 46, pp. 545-552, 1984.
[23] C. Fondrat and P. Dessen, “A Rapid Access Motif Database (RAMdb) with a Search Algorithm for the Retrieval Patterns in Nucleic Acids or Protein Databanks,” Computer Applications in the Biosciences, vol. 11, no. 3, pp. 273-279, 1995.
[24] W. Kent, “BLAT—The BLAST-Like Alignment Tool,” Genome Research, vol. 12, no. 4, pp. 656-664, 2002.
[25] M. Cameron, H.E. Williams, and A. Cannane, “A Deterministic Finite Automaton for Faster Protein Hit Detection in BLAST,” J.Computational Biology, vol. 13, no. 4, pp. 965-978, 2006.
[26] W. Wilbur and D. Lipman, “Rapid Similarity Searches of Nucleic Acid and Protein Data Banks,” Proc. Nat'l Academy of Sciences USA, vol. 80, no. 3, pp. 726-730, 1983.
[27] M. Cameron, H.E. Williams, and A. Cannane, “Improved Gapped Alignment in BLAST,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 1, no. 3, pp. 116-129, 2004.
[28] O. Gotoh, “An Improved Algorithm for Matching Biological Sequences,” J. Molecular Biology, vol. 162, no. 3, pp. 705-708, 1982.
[29] Z. Zhang, P. Berman, and W. Miller, “Alignments without Low-Scoring Regions,” J. Computational Biology, vol. 5, no. 2, pp. 197-210, 1998.
[30] S. Karlin and S. Altschul, “Methods for Assessing the Statistical Significance of Molecular Sequence Features by Using General Scoring Schemes,” Proc. Nat'l Academy of Sciences USA, vol. 87, no. 6, pp. 2264-2268, 1990.
[31] S. Altschul and W. Gish, “Local Alignment Statistics,” Methods in Enzymology, vol. 266, pp. 460-480, 1996.
[32] D.J. States and P. Agarwal, “Compact Encoding Strategies for DNA Sequence Similarity Search,” Proc. Fourth Int'l Conf. Intelligent Systems for Molecular Biology, pp. 211-217, 1996.
[33] R. Durbin, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge Univ. Press, 1998.
[34] G. Myers, “A Four Russians Algorithm for Regular Expression Pattern Matching,” J. ACM, vol. 39, no. 2, pp. 432-448, 1992.
[35] G. Myers and R. Durbin, “A Table-Driven, Full-Sensitivity Similarity Search Algorithm,” J. Computational Biology, vol. 10, no. 2, pp. 103-117, 2003.
[36] M. Crochemore, G.M. Landau, and M. Ziv-Ukelson, “A Sub-Quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices,” Proc. 13th Ann. ACM-SIAM Symp. Discrete Algorithms (SODA '02), pp. 679-688, 2002.
[37] N.C. Jones and P.A. Pevzner, An Introduction to Bioinformatics Algorithms. MIT Press, 2004.
[38] A. Murzin, S. Brenner, T. Hubbard, and C. Chothia, “SCOP: A Structural Classification of Proteins Database for the Investigation of Sequences and Structures,” J. Molecular Biology, vol. 247, no. 4, pp. 536-540, 1995.
[39] A. Andreeva, D. Howorth, S. Brenner, T. Hubbard, C. Chothia, and A. Murzin, “SCOP Database in 2004: Refinements Integrate Structure and Sequence Family Data,” Nucleic Acids Research, vol. 32, pp. D226-D229, 2004.
[40] W. Barker, J. Garavelli, Z. Hou, H. Huang, R. Ledley, P. McGarvey, H. Mewes, B. Orcutt, F. Pfeiffer, A. Tsugita, C. Vinayaka, C. Xiao, L. Yeh, and C. Wu, “Protein Information Resource: A Community Resource for Expert Annotation of Protein Data,” Nucleic Acids Research, vol. 29, no. 1, pp. 29-32, 2001.
[41] M. Gribskov and N. Robinson, “Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching,” Computers & Chemistry, vol. 20, pp. 25-33, 1996.
[42] Z. Ning, A.J. Cox, and J.C. Mullikin, “SSAHA: A Fast Search Method for Large DNA Databases,” Genome Research, vol. 11, no. 10, pp. 1725-1729, 2001.

Index Terms:
Homology search, BLAST, sequence alignment, compression, Four Russians algorithm
Michael Cameron, Hugh Williams, "Comparing Compressed Sequences for Faster Nucleotide BLAST Searches," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 4, no. 3, pp. 349-364, July-Sept. 2007, doi:10.1109/TCBB.2007.1029
Usage of this product signifies your acceptance of the Terms of Use.