This Article 
 Bibliographic References 
 Add to: 
An Efficient Algorithm for the Identification of Structured Motifs in DNA Promoter Sequences
April-June 2006 (vol. 3 no. 2)
pp. 126-140
We propose a new algorithm for identifying cis-regulatory modules in genomic sequences. The proposed algorithm, named RISO, uses a new data structure, called box-link, to store the information about conserved regions that occur in a well-ordered and regularly spaced manner in the data set sequences. This type of conserved regions, called structured motifs, is extremely relevant in the research of gene regulatory mechanisms since it can effectively represent promoter models. The complexity analysis shows a time and space gain over the best known exact algorithms that is exponential in the spacings between binding sites. A full implementation of the algorithm was developed and made available online. Experimental results show that the algorithm is much faster than existing ones, sometimes by more than four orders of magnitude. The application of the method to biological data sets shows its ability to extract relevant consensi.

[1] J. Allali, “Comparaison de Structures Secondaires d'ARN,” PhD thesis, Univ. of Marne-la-Vallée, 2004.
[2] T.L. Bailey and C. Elkan, “The Value of Prior Knowledge in Discovering Motifs with MEME,” Proc. Int'l Conf. Intelligent Systems for Molecular Biology (ISMB '95), pp. 21-29, 1995.
[3] A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert, “Approaches to the Automatic Discovery of Patterns in Biosequences,” J. Computational Biology, vol. 5, no. 2, pp. 279-305, 1998.
[4] L.R. Cardon and G.D. Stormo, “Expectation Maximization Algorithm for Identifying Protein-Binding Sites with Variable Length from Unaligned DNA Fragments,” J. Molecular Biology, vol. 223, no. 1, pp. 159-170, 1992.
[5] A.M. Carvalho, A.T. Freitas, A.L. Oliveira, and M.-F. Sagot, “A Highly Scalable Algorithm for the Extraction of Cis-Regulatory Regions,” Proc. Asia-Pacific Bioinformatics Conf. (APBC '05), Y.-P.P. Chen and L. Wong, eds., pp. 273-282, 2005.
[6] J. Chae Na, A. Apostolico, C.S. Iliopoulos, and K. Park, “Truncated Suffix Trees and Their Application to Data Compression,” Theoretical Computer Science, vol. 304, nos. 1-3, pp. 87-101, 2003.
[7] M. Crochemore and M.-F. Sagot, “Motifs in Sequences: Localization and Extraction,” Handbook of Computational Chemistry, Marcel Dekker, Inc., to appear.
[8] L. Duret and P. Bucher, “Searching for Regulatory Elements in Human Noncoding Sequences,” Current Opinions in Structural Biology, vol. 7, no. 3, pp. 399-406, 1997.
[9] E. Eskin, U. Keich, M.S. Gelfand, and P.A. Pevzner, “Genome-Wide Analysis of Bacterial Promoter Regions,” Proc. Pacific Symp. Biocomputing (PSB '03), pp. 29-40, 2003.
[10] E. Eskin and P.A. Pevzner, “Finding Composite Regulatory Patterns in DNA Sequences,” Bioinformatics, vol. 18, no. 1, pp. 354-363, 2002.
[11] D. Gusfield, Algorithms on Strings, Trees, and Sequences. Cambridge Univ. Press, 1997.
[12] J.D. Helmann, “Compilation and Analysis of Bacillus subtilis $\alpha$ -Dependent Promoter Sequences: Evidence for Extended Contact between RNA Polymerase and Upstream Promoter DNA,” Nucleic Acids Research, vol. 23, no. 13, pp. 2351-2360, 1995.
[13] L.C.K. Hui, “Color Set Size Problem with Applications to String Matching,” Proc. Combinatorial Pattern Matching Symp. (CPM '92), A. Apostolico, M. Crochemore, Z. Galil, and U. Manber, eds., pp. 230-243, 1992.
[14] S. Karlin, F. Ost, and B.E. Blaisdell, “Patterns in DNA and Amino Acid Sequences and Their Statistical Significance,” Math. Methods for DNA Sequences, M.S. Waterman, ed., pp. 133-158, 1989.
[15] C. Kirchhamer, C. Yuh, and E. Davidson, “Modular cis-Regulatory Organization of Developmentally Expressed Genes: Two Genes Transcribed Territorially in the Sea Urchin Embryo, and Additional Examples,” Proc. Nat'l Academy of Sciences USA, vol. 93, pp. 9322-9328, 1996.
[16] S. Kurtz, “Reducing the Space Requirement of Suffix Trees,” Software: Practice and Experience, vol. 29, no. 13, pp. 1149-1171, 1999.
[17] L. Marsan and M.-F. Sagot, “Algorithms for Extracting Structured Motifs Using a Suffix Tree with an Application to Promoter and Regulatory Site Consensus Identification,” J. Computational Biology, vol. 7, nos. 3-4, pp. 345-362, 2000.
[18] E. McCreight, “A Space-Economical Suffix Tree Construction Algorithm,” J. ACM, vol. 23, no. 2, pp. 262-272, 1976.
[19] A. Policriti, N. Vitacolonna, M. Morgante, and A. Zuccolo, “Structured Motifs Search,” Proc. Conf. Research in Computational Molecular Biology (RECOMB '04), pp. 133-139, 2004.
[20] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery, Numerical Recipes In C: The Art of Scientific Computing. Cambridge Univ. Press, 1993.
[21] M.T. Record, W.S. Reznikoff, M.L. Craig, K.L. McQuade, and P.J. Schlax, Escherichia Coli RNA Polymerase Sigma70 Promoters, and the Kinetics of the Steps of Transcription Initiation, vol. 1. ASM Press, 1996.
[22] V.A. Rhodius, D.M. West, C.L. Webster, S.J. Busby, and N.J. Savery, “Transcription Activation at Class II CRP-Dependent Promoters: The Role of Different Activating,” Nucleic Acids Research, vol. 25, no. 2, pp. 326-332, 1997.
[23] M.-F. Sagot, “Spelling Approximate Repeated or Common Motifs Using a Suffix,” Proc. Latin '98, C. Lucchessi and A. Moura, eds., pp. 111-127, 1998.
[24] E. Segal, Y. Barash, I. Simon, N. Friedman, and D. Koller, “A Discriminative Model for Identifying Spatial Cis-Regulatory Modules,” Proc. Conf. Research in Computational Molecular Biology (RECOMB '04), pp. 141-149, 2004.
[25] R. Sharan, I. Ovcharenko, A. Ben-Hur, and R.M. Karp, “Creme: A Framework for Identifying Cis-Regulatory Modules in Human-Mouse Conserved Segments,” Bioinformatics, vol. 19, supplement 1, pp. i283-i291, 2003.
[26] E. Ukkonen, “On-Line Construction of Suffix Trees,” Algorithmica, vol. 14, no. 3, pp. 249-260, 1995.
[27] J. vanHelden, B. André, and J. Collado-Vides, “Extracting Regulatory Sites from the Upstream Region of Yeast Genes by Computational Analysis of Oligonucleotide Frequencies,” J. Molecular Biology, vol. 281, no. 5, pp. 827-842, 1998.
[28] J. van Helden, A.F. Rios, and J. Collado-Vides, “Comparative Amino Acid Sequence Analysis of the C6 Zinc Cluster Family of Transcriptional Regulators,” Nucleic Acids Research, vol. 24, no. 23, pp. 4599-4607, 1996.
[29] J. van Helden, A.F. Rios, and J. Collado-Vides, “Discovering Regulatory Elements in Non-Coding Sequences by Analysis of Spaced Dyads,” Nucleic Acids Research, vol. 28, no. 8, pp. 1808-1818, 2000.
[30] A. Vanet, L. Marsan, and M.-F. Sagot, “Promoter Sequences and Algorithmical Methods for Identifying Them,” Research in Microbiology, vol. 150, nos. 9-10, pp. 779-799, 1999.
[31] P. Weiner, “Linear Pattern Matching Algorithms,” Proc. 14th Ann. Symp. Foundations of Computer Science, pp. 1-11, 1973.
[32] T. Werner, “Models for Prediction and Recognition of Eukaryotic Promoters,” Mammalian Genetics, vol. 10, no. 2, pp. 168-175, 1999.

Index Terms:
Box-link, factor tree, structured motif, promoter, binding site consensus.
Alexandra M. Carvalho, Ana T. Freitas, Arlindo L. Oliveira, Marie-France Sagot, "An Efficient Algorithm for the Identification of Structured Motifs in DNA Promoter Sequences," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 3, no. 2, pp. 126-140, April-June 2006, doi:10.1109/TCBB.2006.16
Usage of this product signifies your acceptance of the Terms of Use.