This Article 
 Bibliographic References 
 Add to: 
Multiseed Lossless Filtration
January-March 2005 (vol. 2 no. 1)
pp. 51-61
We study a method of seed-based lossless filtration for approximate string matching and related bioinformatics applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt and Kärkkäinen [1]. We present algorithms to compute several important parameters of seed families, study their combinatorial properties, and describe several techniques to construct efficient families. We also report a large-scale application of the proposed technique to the problem of oligonucleotide selection for an EST sequence database.

[1] S. Burkhardt and J. Kärkkäinen, “Better Filtering with Gapped $q{\hbox{-}}{\rm{Grams}}$ ,” Fundamenta Informaticae, vol. 56, nos. 1-2, pp. 51-70, 2003, preliminary version in Combinatorial Pattern Matching 2001.
[2] G. Navarro and M. Raffinot, Flexible Pattern Matching in Strings — Practical On-Line Search Algorithms for Texts and Biological Sequences. Cambridge Univ. Press, 2002.
[3] S. Altschul, T. Madden, A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman, “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs,” Nucleic Acids Research, vol. 25, no. 17, pp. 3389-3402, 1997.
[4] B. Ma, J. Tromp, and M. Li, “PatternHunter: Faster and More Sensitive Homology Search,” Bioinformatics, vol. 18, no. 3, pp. 440-445, 2002.
[5] S. Schwartz, J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. Hardison, D. Haussler, and W. Miller, “Human— Mouse Alignments with BLASTZ,” Genome Research, vol. 13, pp. 103-107, 2003.
[6] L. Noé and G. Kucherov, “Improved Hit Criteria for DNA Local Alignment,” BMC Bioinformatics, vol. 5, no. 149, Oct. 2004.
[7] P. Pevzner and M. Waterman, “Multiple Filtration and Approximate Pattern Matching,” Algorithmica, vol. 13, pp. 135-154, 1995.
[8] A. Califano and I. Rigoutsos, “Flash: A Fast Look-Up Algorithm for String Homology,” Proc. First Int'l Conf. Intelligent Systems for Molecular Biology, pp. 56-64, July 1993.
[9] J. Buhler, “Provably Sensitive Indexing Strategies for Biosequence Similarity Search,” Proc. Sixth Ann. Int'l Conf. Computational Molecular Biology (RECOMB '02), pp. 90-99, Apr. 2002.
[10] U. Keich, M. Li, B. Ma, and J. Tromp, “On Spaced Seeds for Similarity Search,” Discrete Applied Math., vol. 138, no. 3, pp. 253-263, 2004.
[11] J. Buhler, U. Keich, and Y. Sun, “Designing Seeds for Similarity Search in Genomic DNA,” Proc. Seventh Ann. Int'l Conf. Computational Molecular Biology (RECOMB '03), pp. 67-75, Apr. 2003.
[12] B. Brejova, D. Brown, and T. Vinar, “Vector Seeds: An Extension to Spaced Seeds Allows Substantial Improvements in Sensitivity and Specificity,” Proc. Third Int'l Workshop Algorithms in Bioinformatics (WABI), pp. 39-54, Sept. 2003.
[13] G. Kucherov, L. Noé, and Y. Ponty, “Estimating Seed Sensitivity on Homogeneous Alignments,” Proc. IEEE Fourth Symp. Bioinformatics and Bioeng. (BIBE 2004), May 2004.
[14] K. Choi and L. Zhang, “Sensitivity Analysis and Efficient Method for Identifying Optimal Spaced Seeds,” J. Computer and System Sciences, vol. 68, pp. 22-40, 2004.
[15] M. Csürös, “Performing Local Similarity Searches with Variable Length Seeds,” Proc. 15th Ann. Combinatorial Pattern Matching Symp. (CPM), pp. 373-387, 2004.
[16] M. Li, B. Ma, D. Kisman, and J. Tromp, “PatternHunter II: Highly Sensitive and Fast Homology Search,” J. Bioinformatics and Computational Biology, vol. 2, no. 3, pp. 417-440, Sept. 2004.
[17] Y. Sun and J. Buhler, “Designing Multiple Simultaneous Seeds for DNA Similarity Search,” Proc. Eighth Ann. Int'l Conf. Research in Computational Molecular Biology (RECOMB 2004), pp. 76-84, Mar. 2004.
[18] D.G. Brown, “Multiple Vector Seeds for Protein Alignment,” Proc. Fourth Int'l Workshop Algorithms in Bioinformatics (WABI), pp. 170-181, Sept. 2004.
[19] J. Xu, D. Brown, M. Li, and B. Ma, “Optimizing Multiple Spaced Seeds for Homology Search,” Proc. 15th Symp. Combinatorial Pattern Matching, pp. 47-58, 2004.
[20] J. Oommen and J. Dong, “Generalized Swap-with-Parent Schemes for Self-Organizing Sequential Linear Lists,” Proc. 1997 Int'l Symp. Algorithms and Computation (ISAAC '97), pp. 414-423, Dec. 1997.
[21] F. Li and G. Stormo, “Selection of Optimal DNA Oligos for Gene Expression Arrays,” Bioinformatics, vol. 17, pp. 1067-1076, 2001.
[22] L. Kaderali and A. Schliep, “Selecting Signature Oligonucleotides to Identify Organisms Using DNA Arrays,” Bioinformatics, vol. 18, no. 10, pp. 1340-1349, 2002.
[23] S. Rahmann, “Fast Large Scale Oligonucleotide Selection Using the Longest Common Factor Approach,” J. Bioinformatics and Computational Biology, vol. 1, no. 2, pp. 343-361, 2003.
[24] J. Zheng, T. Close, T. Jiang, and S. Lonardi, “Efficient Selection of Unique and Popular Oligos for Large EST Databases,” Proc. 14th Ann. Combinatorial Pattern Matching Symp. (CPM), pp. 273-283, 2003.
[25] S. Burkhardt and J. Karkkainen, “One-Gapped $q{\hbox{-}}{\rm{Gram}}$ Filters for Levenshtein Distance,” Proc. 13th Symp. Combinatorial Pattern Matching (CPM '02), vol. 2373, pp. 225-234, 2002.

Index Terms:
Filtration, string matching, gapped seed, gapped q-gram, local alignment, sequence similarity, seed family, multiple spaced seeds, dynamic programming, EST, oligonucleotide selection.
Gregory Kucherov, Laurent Noé, Mikhail Roytberg, "Multiseed Lossless Filtration," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2, no. 1, pp. 51-61, Jan.-March 2005, doi:10.1109/TCBB.2005.12
Usage of this product signifies your acceptance of the Terms of Use.