This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Superiority of Spaced Seeds for Homology Search
July-September 2007 (vol. 4 no. 3)
pp. 496-505
In homology search, good spaced seeds have higher sensitivity for the same cost (weight). However, elucidating the mechanism that confers power to spaced seeds and characterizing optimal spaced seeds still remain unsolved. This paper investigates these two important open questions by formally analyzing the average number of non-overlapping hits and the hit probability of a spaced seed in the Bernoulli sequence model. We prove that when the length of a non-uniformly spaced seed is bounded above by an exponential function of the seed weight, the seed outperforms strictly the traditional consecutive seed of the same weight in both (i) the average number of non-overlapping hits and (ii) the asymptotic hit probability. This clearly answers the first problem mentioned above in the Bernoulli sequence model. The theoretical study in this paper also gives a new solution to finding long optimal seeds.

[1] S.F. Altschul et al., “Basic Local Alignment Search Tool,” J.Molecular Biology, vol. 215, pp. 403-410, 1990.
[2] S.F. Altschul et al., “Gapped Blast and Psi-Blast: A New Generation of Protein Database Search Programs,” Nucleic Acids Research, vol. 25, pp. 3389-3402, 1997.
[3] N. Balakrishnan and M.V. Koutras, Runs and Scans with Applications. John Wiley & Sons, 2002.
[4] S. Batzoglou, “The Many Faces of Sequence Alignment,” Briefings in Bioinformatics, vol. 6, pp. 6-22, 2005.
[5] B. Brejovà, D. Brown, and T. Vinař, “Optimal Spaced Seeds for Homologous Coding Regions,” J. Bioinformatics and Computational Biology, vol. 1, pp. 595-610, 2004.
[6] J. Buhler, “Efficient Large-Scale Sequence Comparison by Local-Sensitive Hashing,” Bioinformatics, vol. 17, pp. 419-428, 2001.
[7] J. Buhler, U. Keich, and Y. Sun, “Designing Seeds for Similarity Search in Genomic DNA,” Proc. Seventh Ann. Int'l Conf. Research in Computational Molecular Biology (RECOMB '03), pp. 67-75, 2003.
[8] S. Burkhardt and J. Karkkainen, “Better Filtering with Gapped q-Grams,” Fundamenta Informaticae XXIII, pp. 1001-1018, 2003.
[9] A. Califano and I. Rigoutsos, “FLASH: Fast Look-Up Algorithm for String Homology,” technical report, IBM T.J. Watson Research Center, 1995.
[10] K.P. Choi and L. Zhang, “Sensitivity Analysis and Efficient Method for Identifying Optimal Spaced Seeds,” J. Computer and System Sciences, vol. 68, pp. 22-40, 2004.
[11] K.P. Choi and L. Zhang, “Analysis of Spaced Seed Technique in Sequence Alignment,” Cosmos, vol. 1, pp. 57-73, 2005.
[12] K.P. Choi, F. Zeng, and L. Zhang, “Good Spaced Seeds for Homology Search,” Bioinformatics, vol. 20, pp. 1053-1059, 2004.
[13] W. Feller, An Introduction to Probability Theory and Its Applications, third ed. Wiley, 1968.
[14] V. Gotea, V. Veeramachaneni, and W. Makalowski, “Mastering Seeds for Genomic Size Nucleotide BLAST Searches,” Nucleic Acids Research, vol. 31, pp. 6935-6941, 2003.
[15] L.J. Guibas and A.M. Odlyzko, “String Overlaps, Pattern Matching, and Nontransitive Games,” J. Combinational Theory, vol. 30, pp. 183-208, 1981.
[16] G.H. Hardy, J.E. Littlewood, and G. Pólya, Inequalities, second ed. Cambridge Univ. Press, 1952.
[17] P. Jacquet and W. Szpankowski, “Analytic Approach to Pattern Matching,” Applied Combinatorics on Words, M. Lothaire, ed., Cambridge Press, 2005.
[18] U. Keich, M. Li, B. Ma, and J. Tromp, “On Spaced Seeds for Similarity Search,” Discrete Applied Math., vol. 3, pp. 253-263, 2004.
[19] W.J. Kent, “BLAT—The BLAST-Like Alignment Tool,” Genome Research, vol. 12, pp. 656-664, 2002.
[20] W.J. Kent and A.M. Zahler, “Conservation, Regulation, Synteny, and Introns in a Large-Scale C. briggsae-C. elegans Genomic Alignment,” Genome Research, vol. 10, pp. 1115-1125, 2000.
[21] Y. Kong, “Methods to Find Optimal Multiple Spaced Seeds for Homology Search Using Generalized Correlations,” manuscript, 2005.
[22] L. Noé and G. Kucherov, “YASS: Enhancing the Sensitivity of DNA Similarity Search,” Nucleic Acids Research, vol. 33, pp. W540-W543, 2005.
[23] G. Kucherov, L. Noé, and M. Roytberg, “A Unifying Framework for Seed Sensitivity and Its Application to Subset Seeds,” Proc. Fifth Workshop Algorithms in Bioinformatics (WABI '05), 2005.
[24] S.Y. Li, “A Martingale Approach to the Study of Occurrence of Sequence Patterns in Repeated Experiments,” Annals of Applied Probability, vol. 8, pp. 1171-1176, 1980.
[25] M. Li and B. Ma, “On the Complexity of Computing the Sensitivity of Spaced Seeds,” manuscript, 2005.
[26] M. Li, B. Ma, D. Kisman, and J. Tromp, “PatternHunterII: Highly Sensitive and Fast Homology Search,” J. Bioinformatics and Computational Biology, vol. 2, pp. 417-440, 2004.
[27] B. Ma, J. Tromp, and M. Li, “PatternHunter-Faster and More Sensitive Homology Search,” Bioinformatics, vol. 18, pp. 440-445, 2002.
[28] P. Nicodéme, B. Salvy, and P. Flajolet, “Motif Statistics,” Lecture Notes in Computer Sciences, vol. 1643, pp. 194-211, 1999.
[29] F.P. Preparata, L. Zhang, and K.P. Choi, “Quick, Practical Selection of Effective Seeds for Homology Search,” J. Computational Biology, vol. 12, pp. 1137-1152, 2005.
[30] G. Reinert, S. Schbath, and M. Waterman, “Probabilistic and Statistical Properties of Words: An Overview,” J. Computational Biology, vol. 7, pp. 1-46, 2000.
[31] S.J. Schwager, “Run Probabilities in Sequences of Markov-Dependent Trials,” J. Am. Statistics Assoc., vol. 78, pp. 168-175, 1983.
[32] S. Schwartz et al., “Human-Mouse Alignment with BLASTZ,” Genome Research, vol. 13, pp. 103-107, 2003.
[33] A.D. Solov'ev, “A Combinatorial Identity and Its Application to the Problem Concerning the First Occurences of a Rare Event,” Theory of Probability and Application, vol. 11, pp. 276-282, 1966.
[34] Y. Sun and J. Buhler, “Designing Multiple Simultaneous Seeds for DNA Similarity Search,” Proc. ACM-SIAM Int'l Conf. Research in Computational Molecular Biology (RECOMB '04), pp. 76-85, 2004.
[35] I.-H. Yang, S.-H. Wang, Y.-H. Chen, P.-H. Huang, L. Ye, X. Huang, and K.-M. Chao, “Efficient Methods for Generating Optimal Single and Multiple Spaced Seeds,” Proc. Fourth IEEE Symp. Bioinformatics and Bioeng., pp. 411-418, 2004.

Index Terms:
Homology search, pattern matching, sequence alignment, spaced seeds, renewal theory, run statistics
Citation:
Louxin Zhang, "Superiority of Spaced Seeds for Homology Search," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 4, no. 3, pp. 496-505, July-Sept. 2007, doi:10.1109/tcbb.2007.1013
Usage of this product signifies your acceptance of the Terms of Use.