This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Optimizing Multiple Seeds for Protein Homology Search
January-March 2005 (vol. 2 no. 1)
pp. 29-38
We present a framework for improving local protein alignment algorithms. Specifically, we discuss how to extend local protein aligners to use a collection of vector seeds or ungapped alignment seeds to reduce noise hits. We model picking a set of seed models as an integer programming problem and give algorithms to choose such a set of seeds. While the problem is NP-hard, and Quasi-NP-hard to approximate to within a logarithmic factor, it can be solved easily in practice. A good set of seeds we have chosen allows four to five times fewer false positive hits, while preserving essentially identical sensitivity as BLASTP.

[1] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, “Basic Local Alignment Search Tool,” J. Molecular Biology, vol. 215, no. 3, pp. 403-410, 1990.
[2] B. Ma, J. Tromp, and M. Li, “PatternHunter: Faster and More Sensitive Homology Search,” Bioinformatics, vol. 18, no. 3, pp. 440-445, Mar. 2002.
[3] B. Brejova, D. Brown, and T. Vinar, “Vector Seeds: An Extension to Spaced Seeds Allows Substantial Improvements in Sensitivity and Specificity,” Proc. Third Ann. Workshop Algorithms in Bioinformatics, pp. 39-54, 2003.
[4] M. Li, B. Ma, D. Kisman, and J. Tromp, “Patternhunter II: Highly Sensitive and Fast Homology Search,” J. Bioinformatics and Computational Biology, vol. 2, no. 3, pp. 419-439, 2004.
[5] J. Xu, D. Brown, M. Li, and B. Ma, “Optimizing Multiple Spaced Seeds for Homology Search,” Proc. 15th Ann. Symp. Combinatorial Pattern Matching, pp. 47-58, 2004.
[6] Y. Sun and J. Buhler, “Designing Multiple Simultaneous Seeds for DNA Similarity Search,” Proc. Eighth Ann. Int'l Conf. Computational Biology, pp. 76-84, 2004.
[7] D. Kisman, M. Li, B. Ma, and L. Wang, “TPatternHunter: Gapped, Fast and Sensitive Translated Homology Search,” Bioinformatics, 2004.
[8] T. Smith and M. Waterman, “Identification of Common Molecular Subsequences,” J. Molecular Biology, vol. 147, pp. 195-197, 1981.
[9] B. Brejova, D. Brown, and T. Vinar, “Vector Seeds: An Extension to Spaced Seeds,” J. Computer and System Sciences, 2005, pending publication.
[10] J. Buhler, U. Keich, and Y. Sun, “Designing Seeds for Similarity Search in Genomic DNA,” Proc. Seventh Ann. Int'l Conf. Computational Biology, pp. 67-75, 2003.
[11] B. Brejova, D. Brown, and T. Vinar, “Optimal Spaced Seeds for Homologous Coding Regions,” J. Bioinformatics and Computational Biology, vol. 1, pp. 595-610, Jan. 2004.
[12] U. Keich, M. Li, B. Ma, and J. Tromp, “On Spaced Seeds for Similarity Search,” Discrete Applied Math., vol. 138, pp. 253-263, 2004.
[13] K.P. Choi, F. Zeng, and L. Zhang, “Good Spaced Seeds for Homology Search,” Bioinformatics, vol. 20, no. 7, pp. 1053-1059, 2004.
[14] G. Kucherov, L. Noé, and Y. Ponty, “Estimating Seed Sensitivity on Homogeneous Alignments,” Proc. Fourth IEEE Int'l Symp. BioInformatics and BioEng., pp. 387-394, 2004.
[15] D. Brown and A. Hudek, “New Algorithms for Multiple DNA Sequence Alignment,” Proc. Fourth Ann. Workshop Algorithms in Bioinformatics, pp. 314-326, 2004.
[16] M. Csürös, “Performing Local Similarity Searches with Variable Length Seeds,” Proc. 15th Ann. Symp. Combinatorial Pattern Matching, pp. 373-387, 2004.
[17] K. Choi and L. Zhang, “Sensitive Analysis and Efficient Method for Identifying Optimal Spaced Seeds,” J. Computer and System Sciences, vol. 68, pp. 22-40, 2004.
[18] G. Kucherov, L. Noé, and Y. Ponty, “Multiseed Lossless Filtration,” Proc. 15th Ann. Symp. Combinatorial Pattern Matching, pp. 297-310, 2004.
[19] U. Feige, “A Threshold of $\ln n$ for Approximating Set Cover,” J. ACM, vol. 45, pp. 634-652, 1998.
[20] A. Bairoch and R. Apweiler, “The SWISS-PROT Protein Sequence Database and Its Supplement TrEMBL in 2000,” Nucleic Acids Research, vol. 28, no. 1, pp. 45-48, 2000.
[21] D. Brown, “Multiple Vector Seeds for Protein Alignment,” Proc. Fourth Ann. Workshop Algorithms in Bioinformatics, pp. 170-181, 2004.

Index Terms:
Bioinformatics database applications, similarity measures, biology and genetics.
Citation:
Daniel G. Brown, "Optimizing Multiple Seeds for Protein Homology Search," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2, no. 1, pp. 29-38, Jan.-March 2005, doi:10.1109/TCBB.2005.13
Usage of this product signifies your acceptance of the Terms of Use.