This Article 
 Bibliographic References 
 Add to: 
Bases of Motifs for Generating Repeated Patterns with Wild Cards
January-March 2005 (vol. 2 no. 1)
pp. 40-50
Motif inference represents one of the most important areas of research in computational biology, and one of its oldest ones. Despite this, the problem remains very much open in the sense that no existing definition is fully satisfying, either in formal terms, or in relation to the biological questions that involve finding such motifs. Two main types of motifs have been considered in the literature: matrices (of letter frequency per position in the motif) and patterns. There is no conclusive evidence in favor of either, and recent work has attempted to integrate the two types into a single model. In this paper, we address the formal issue in relation to motifs as patterns. This is essential to get at a better understanding of motifs in general. In particular, we consider a promising idea that was recently proposed, which attempted to avoid the combinatorial explosion in the number of motifs by means of a generator set for the motifs. Instead of exhibiting a complete list of motifs satisfying some input constraints, what is produced is a basis of such motifs from which all the other ones can be generated. We study the computational cost of determining such a basis of repeated motifs with wild cards in a sequence. We give new upper and lower bounds on such a cost, introducing a notion of basis that is provably contained in (and, thus, smaller) than previously defined ones. Our basis can be computed in less time and space, and is still able to generate the same set of motifs. We also prove that the number of motifs in all bases defined so far grows exponentially with the quorum, that is, with the minimal number of times a motif must appear in a sequence, something unnoticed in previous work. We show that there is no hope to efficiently compute such bases unless the quorum is fixed.

[1] A. Aho and M. Corasick, “Efficient String Matching: An Aid to Bibliographic Search,” Comm. ACM, vol. 18, no. 6, pp. 333-340, 1975.
[2] A. Apostolico and L. Parida, “Incremental Paradigms of Motif Discovery,” J. Computational Biology, vol. 11, no. 1, pp. 15-25, 2004.
[3] R. Baeza-Yates and G. Gonnet, “A New Approach to Text Searching,” Comm. ACM, vol. 35, pp. 74-82, 1992.
[4] A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert, “Approaches to the Automatic Discovery of Patterns in Biosequences,” J. Computational Biology, vol. 5, pp. 279-305, 1998.
[5] M. Crochemore and W. Rytter, Jewels of Stringology. World Scientific Publishing, 2002.
[6] E. Eskin, “From Profiles to Patterns and Back Again: A Branch and Bound Algorithm for Finding Near Optimal Motif Profiles,” RECOMB'04: Proc. Eighth Ann. Int'l Conf. Computational Molecular Biology, pp. 115-124, 2004.
[7] E. Eskin, U. Keich, M. Gelfand, and P. Pevzner, “Genome-Wide Analysis of Bacterial Promoter Regions,” Proc. Pacific Symp. Biocomputing, pp. 29-40, 2003.
[8] M. Fischer and M. Paterson, “String Matching and Other Products,” SIAM AMS Complexity of Computation, R. Karp, ed., pp. 113-125, 1974.
[9] M. Gribskov, A. McLachlan, and D. Eisenberg, “Profile Analysis: Detection of Distantly Related Proteins,” Proc. Nat'l Academy of Sciences, vol. 84, no. 13, pp. 4355-4358, 1987.
[10] D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge Univ. Press, 1997.
[11] G.Z. Hertz and G.D. Stormo, “Escherichia Coli Promoter Sequences: Analysis and Prediction,” Methods in Enzymology, vol. 273, pp. 30-42, 1996.
[12] C.E. Lawrence, S.F. Altschul, M.S. Boguski, J.S. Liu, A.F. Neuwald, and J.C. Wooton, “Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment,” Science, vol. 262, pp. 208-214, 1993.
[13] C.E. Lawrence and A.A. Reilly, “An Expectation Maximization (EM) Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences,” Proteins: Structure, Function, and Genetics, vol. 7, pp. 41-51, 1990.
[14] L. Marsan and M.-F. Sagot, “Algorithms for Extracting Structured Motifs Using a Suffix Tree with an Application to Promoter and Regulatory Site Consensus Identification,” J. Computational Biology, vol. 7, pp. 345-362, 2000.
[15] W. Miller, “Comparison of Genomic DNA Sequences: Solved and Unsolved Problems,” Bioinformatics, vol. 17, pp. 391-397, 2001.
[16] G. Myers, “A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming,” J. ACM, vol. 46, no. 3, pp. 395-415, 1999.
[17] L. Parida, I. Rigoutsos, A. Floratos, D. Platt, and Y. Gao, “Pattern Discovery on Character Sets and Real-Valued Data: Linear Bound on Irredundant Motifs and Efficient Polynomial Time Algorithm,” Proc. SIAM Symp. Discrete Algorithms (SODA), 2000.
[18] L. Parida, I. Rigoutsos, and D. Platt, “An Output-Sensitive Flexible Pattern Discovery Algorithm,” Combinatorial Pattern Matching, A. Amir and G. Landau, eds., pp. 131-142, Springer-Verlag, 2001.
[19] J. Pelfrêne, S. Abdeddaïm, and J. Alexandre, “Extracting Approximate Patterns,” Combinatorial Pattern Matching, pp. 328-347, Springer-Verlag, 2003.
[20] N. Pisanti, M. Crochemore, R. Grossi, and M.-F. Sagot, “A Basis for Repeated Motifs in Pattern Discovery and Text Mining,” Technical Report IGM 2002-10, Institut Gaspard-Monge, Univ. of Marne-la-Vallée, July 2002.
[21] N. Pisanti, M. Crochemore, R. Grossi, and M.-F. Sagot, “A Basis of Tiling Motifs for Generating Repeated Patterns and Its Complexity for Higher Quorum,” Math. Foundations of Computer Science (MFCS), B. Rovan and P. Vojtás, eds., pp. 622-631, Springer-Verlag, 2003.
[22] N. Pisanti, M. Crochemore, R. Grossi, and M.-F. Sagot, String Algorithmics, chapter: A Comparative Study of Bases for Motif Inference, pp. 195-225, KCL Press, 2004.
[23] D. Pollard, C. Bergman, J. Stoye, S. Celniker, and M. Eisen, “Benchmarking Tools for the Alignment of Functional Noncoding DNA,” BMC Bioinformatics, vol. 5, pp. 6-23, 2004.
[24] A. Vanet, L. Marsan, and M.-F. Sagot, “Promoter Sequences and Algorithmical Methods for Identifying Them,” Research in Microbiology, vol. 150, pp. 779-799, 1999.
[25] S. Wu and U. Manber, “Path-Matching Problems,” Algorithmica, vol. 8, no. 2, pp. 89-101, 1992.

Index Terms:
Motifs basis, repeated motifs.
Nadia Pisanti, Maxime Crochemore, Roberto Grossi, Marie-France Sagot, "Bases of Motifs for Generating Repeated Patterns with Wild Cards," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2, no. 1, pp. 40-50, Jan.-March 2005, doi:10.1109/TCBB.2005.5
Usage of this product signifies your acceptance of the Terms of Use.