The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.01 - January-February (2011 vol.8)
pp: 69-79
Cinzia Pizzi , University of Padova, Padova
Pasi Rastas , University of Helsinki, Helsinki
Esko Ukkonen , University of Helsinki, Helsinki
ABSTRACT
Position weight matrices are an important method for modeling signals or motifs in biological sequences, both in DNA and protein contexts. In this paper, we present fast algorithms for the problem of finding significant matches of such matrices. Our algorithms are of the online type, and they generalize classical multipattern matching, filtering, and superalphabet techniques of combinatorial string matching to the problem of weight matrix matching. Several variants of the algorithms are developed, including multiple matrix extensions that perform the search for several matrices in one scan through the sequence database. Experimental performance evaluation is provided to compare the new techniques against each other as well as against some other online and index-based algorithms proposed in the literature. Compared to the brute-force O(mn) approach, our solutions can be faster by a factor that is proportional to the matrix length m. Our multiple-matrix filtration algorithm had the best performance in the experiments. On a current PC, this algorithm finds significant matches (p = 0.0001) of the 123 JASPAR matrices in the human genome in about 18 minutes.
INDEX TERMS
Position weight matrices, position-specific scoring matrices, profiles, pattern search, string matching.
CITATION
Cinzia Pizzi, Pasi Rastas, Esko Ukkonen, "Finding Significant Matches of Position Weight Matrices in Linear Time", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.8, no. 1, pp. 69-79, January-February 2011, doi:10.1109/TCBB.2009.35
REFERENCES
[1] A.V. Aho and M. Corasick, "Efficient String Matching: An Aid to Bibliographic Search," Comm. ACM, vol. 18, pp. 333-340, 1975.
[2] T.K. Attwood and M.E. Beck, "PRINT—A Protein Motif Finger-Print Database," Protein Eng., vol. 7, no. 7, pp. 841-848, 1994.
[3] M. Beckstette, D. Strothmann, R. Homann, R. Giegerich, and S. Kurtz, "PoSSuMsearch: Fast and Sensitive Matching of Position Specific Scoring Matrices Using Enhanced Suffix Arrays," Proc. German Conf. Bioinformatics, pp. 53-64, 2004.
[4] M. Beckstette, D. Strothmann, R. Homann, R. Giegerich, and S. Kurtz, "Fast Index Based Algorithms and Software for Matching Position Specific Scoring Matrices," BMC Bioinformatics, vol. 7, article no. 389, Aug. 2006.
[5] B. Boeckmann, A. Bairoch, R. Apweiler, M.C. Blatter, A. Estreicher, E. Gasteiger, M.J. Martin, K. Michoud, C. O'Donovan, I. Phan, S. Pilbout, and M. Schneider, "The SWISS-PROT Protein Knowledgebase and Its Supplement TrEMBL in 2003," Proc. Nucleic Acids Research, vol. 31, no. 1, pp. 365-370, 2003.
[6] R.S. Boyer and J.S. Moore, "A Fast String Searching Algorithm," Comm. ACM, vol. 20, no. 10, pp. 762-772, 1977.
[7] M. Crochemore and W. Rytter, Text Algorithms. Oxford Univ. Press, 1994.
[8] B. Dorohonceanu and C.G. Neville-Manning, "Accelerating Protein Classification Using Suffix Trees," Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology (ISMB), pp. 128-133, 2000.
[9] K. Fredriksson, "Shift-or String Matching with Superalphabets," Information Processing Letters, vol. 87, no. 4, pp. 201-204, 2003.
[10] V. Freschi and A. Bogliolo, "Using Sequence Compression to Speedup Probabilistic Profile Matching," Bioinformatics, vol. 21, no. 10, pp. 2225-2229, 2005.
[11] M. Gribskov, A.D. McLachlan, and D. Eisenberg, "Profile Analysis: Detection of Distantly Related Proteins," Proc. Nat'l Academy of Sciences USA, vol. 84, no. 13, pp. 4355-8, 1987.
[12] O. Hallikas, K. Palin, N. Sinjushina, R. Rautiainen, J. Partanen, E. Ukkonen, and J. Taipale, "Genome-Wide Prediction of Mammalian Enhancers Based on Analysis of Transcription-Factor Binding Affinity," Cell, vol. 124, pp. 47-59, Jan. 2006.
[13] S. Henikoff, J.C. Wallace, and J.P. Brown, "Finding Protein Similarities with Nucleotide Sequence Databases," Methods Enzymology, vol. 183, pp. 111-132, 1990.
[14] J.G. Henikoff, E.A. Greene, S. Pietrokovski, and S. Henikoff, "Increased Coverage of Protein Families with the Blocks Database Servers," Nucleic Acids Research, vol. 28, no. 1, pp. 228-230, 2000.
[15] A. Liefhooghe, H. Touzet, and J. Varre, "Large Scale Matching for Position Weight Matrices," Proc. Combinatorial Pattern Matching Conf. (CPM '06), pp. 401-412, 2006.
[16] V. Matys, E. Fricke, R. Geffers, E. Gossling, M. Haubrock, R. Hehl, K. Hornischer, D. Karas, A.E. Kel, O.V. Kel-Margoulis, D.U. Kloos, S. Land, B. Lewicki-Potapov, H. Michael, R. Munch, I. Reuter, S. Rotert, H. Saxel, M. Scheer, S. Thiele, and E. Wingender, "TRANSFAC: Transcriptional Regulation, from Patterns to Profiles," Nucleic Acids Research, vol. 31, no. 1, pp. 374-378, 2003.
[17] G. Navarro and M. Raffinot, Flexible Pattern Matching in Strings. Cambridge Univ. Press, 2002.
[18] C. Pizzi, P. Rastas, and E. Ukkonen, "Fast Search Algorithms for Position Specific Scoring Matrices," Proc. Bioinformatics Research and Development Conf. (BIRD '07), pp. 239-250, 2007.
[19] K. Quandt, K. Frech, H. Karas, E. Wingender, and T. Werner, "MatInd and MatInspector: New Fast and Versatile Tools for Detection of Consensus Matches in Nucleotide Sequences Data," Nucleic Acid Research, vol. 23, no. 23, pp. 4878-4884, 1995.
[20] S. Rajasekaran, X. Jin, and J.L. Spouge, "The Efficient Computation of Position-Specific Match Scores with the Fast Fourier Transform," J. Computational Biology, vol. 9, no. 1, pp. 23-33, 2002.
[21] L. Salmela and J. Tarhio, "Algorithms for Weighted Matching," Proc. Int'l Symp. String Processing and Information Retrieval (SPIRE '07), pp. 276-286, 2007.
[22] A. Sandelin, W. Alkema, P. Engstrom, W.W. Wasserman, and B. Lanhard, "JASPAR: An Open-Access Database for Eukaryotic Transcription Factor Binding Profiles," Nucleic Acids Research, vol. 32, pp. D91-D94, 2004.
[23] T.D. Schneider, G.D. Stormo, L. Gold, and A. Ehrenfeucht, "Information Content of Binding Sites on Nucleotide Sequences," J. Molecular Biology, vol. 188, pp. 415-431, 1986.
[24] D.E. Schones, A.D. Smith, and M.Q. Zhang, "Statistical Significance of Cis-Regulatory Modules," BMC Bioinformatics, vol. 8, article no. 19, 2007.
[25] P. Scordis, D.R. Flower, and T. Attwood, "FingerPRINTScan: Intelligent Searching of the PRINTS Motif Database," Bioinformatics, vol. 15, no. 10, pp. 799-806, 1999.
[26] R. Staden, "Methods for Calculating the Probabilities of Finding Patterns in Sequences," Computer Applications in the Biosciences, vol. 5, no. 2, pp. 89-96, 1989.
[27] G.D. Stormo, T.D. Schneider, L.M. Gold, and A. Ehrenfeucht, "Use of the "Perceptron" Algorithm to Distinguish Translational Initiation Sites in E. Coli," Nucleic Acid Research, vol. 10, pp. 2997-3012, 1982.
[28] E. Ukkonen, "Approximate String-Matching with Q-Grams and Maximal Matches," Theoretical Computer Science, vol. 92, pp. 191-211, 1992.
[29] J.C. Wallace and S. Henikoff, "PATMAT: A Searching and Extraction Program for Sequence, Pattern and Block Queries and Databases," Computer Applications in the Biosciences, vol. 8, no. 3, pp. 249-254, 1992.
[30] T.D. Wu, C.G. Neville-Manning, and D.L. Brutlag, "Fast Probabilistic Analysis of Sequence Function Using Scoring Matrices," Bioinformatics, vol. 16, no. 3, pp. 233-244, 2000.
[31] J. Korhonen, P. Martinmaki, C. Pizzi, P. Rastas, and E. Ukkonen, "MOODS: Fast Search for Position Weight Matrix Matches in DNA Sequences," Bioinformatics vol. 25, no. 23, pp. 3181-3182, 2009.
28 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool