This Article 
 Bibliographic References 
 Add to: 
DNA Motif Representation with Nucleotide Dependency
January-March 2008 (vol. 5 no. 1)
pp. 110-119
The problem of discovering novel motifs of binding sites is important to theunderstanding of gene regulatory networks. Motifs are generally represented by matrices (PWM orPSSM) or strings. However, these representations cannot model biological binding sites wellbecause they fail to capture nucleotide interdependence. It has been pointed out by manyresearchers that the nucleotides of the DNA binding site cannot be treated independently, e.g. thebinding sites of zinc finger in proteins. In this paper, a new representation called Scored PositionSpecific Pattern (SPSP), which is a generalization of the matrix and string representations, isintroduced which takes into consideration the dependent occurrences of neighboring nucleotides.Even though the problem of discovering the optimal motif in SPSP representation is proved to beNP-hard, we introduce a heuristic algorithm called SPSP-Finder, which can effectively findoptimal motifs in most simulated cases and some real cases for which existing popular motiffindingsoftware, such as Weeder, MEME and AlignACE, fail.

[1] T. Bailey and C. Elkan, “Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization,” Machine Learning, vol. 21, pp. 51-80, 1995.
[2] Y. Barash, G. Bejerano, and N. Friedman, “A Simple Hyper-Geometric Approach for Discovering Putative Transcription Factor Binding Sites,” Proc. Int'l Workshop Algorithms in Bioinformatics (WABI '01), pp. 278-293, 2001.
[3] Y. Barash, G. Elidan, N. Friedman, and T. Kaplan, “Modeling Dependencies in Protein-DNA Binding Sites,” Proc. Ann. Int'l Conf. Research in Computational Molecular Biology (RECOMB '03), pp. 28-37, 2003.
[4] J. Buhler and M. Tompa, “Finding Motifs Using Random Projections,” Proc. Ann. Int'l Conf. Research in Computational Molecular Biology (RECOMB '01), pp. 69-76, 2001.
[5] M.L. Bulyk, P.L.F. Johnson, and G.M. Church, “Nucleotides of Transcription Factor Binding Sites Exert Interdependent Effects on the Binding Affinities of Transcription Factors,” Nucleic Acids Research, vol. 30, pp. 1255-1261, 2002.
[6] F. Chin and H. Leung, “An Efficient Algorithm for String Motif Discovery,” Proc. Asia-Pacific Bioinformatics Conf. (APBC '06), pp.79-88, 2006.
[7] F. Chin and H. Leung, “An Efficient Algorithm for the Extended $(l, d)\hbox{-}{\rm Motif}$ Problem with Unknown Number of Binding Sites,” Proc. Int'l Symp. BioInformatics and BioEngineering (BIBE '05), pp.11-18, 2005.
[8] F. Chin and H. Leung, “Voting Algorithms for Discovering Long Motifs,” Proc. Asia-Pacific Bioinformatics Conf. (APBC '05), pp. 261-271, 2005.
[9] F. Chin, H. Leung, S.M. Yiu, T.W. Lam, R. Rosenfeld, W.W. Tsang, D. Smith, and Y. Jiang, “Finding Motifs for Insufficient Number of Sequences with Strong Binding to Transcription Factor,” Proc. Ann. Int'l Conf. Research in Computational Molecular Biology (RECOMB '04), pp. 125-132, 2004.
[10] S. Hannenhalli and L.S. Wang, “Enhanced Position Weight Matrices Using Mixture Models,” Bioinformatics, vol. 21, pp.204-212, 2005.
[11] G.Z. Hertz and G.D. Stormo, “Identification of Consensus Patterns in Unaligned DNA and Protein Sequences: A Large-Deviation Statistical Basis for Penalizing Gaps,” Proc. Third Int'l Conf. Bioinformatics and Genome Research, pp. 201-216, 1995.
[12] J.D. Hughes, P.W. Estep, S. Tavazoie, and G.M. Church, “Computational Identification of CIS-Regulatory Elements Associated with Groups of Functionally Related Genes in Saccharomyces cerevisiae,” J. Molecular Biology, vol. 296, no. 5, pp. 1205-1214, 2000.
[13] U. Keich and P. Pevzner, “Finding Motifs in the Twilight Zone,” Proc. Ann. Int'l Conf. Research in Computational Molecular Biology (RECOMB '02), pp. 195-204, 2002.
[14] S. Kielbasa, J. Korbel, D. Beule, J. Schuchhardt, and H. Herzel, “Combining Frequency and Positional Information to Predict Transcription Factor Binding Sites,” Bioinformatics, vol. 17, pp.1019-1026, 2001.
[15] C. Lawrence, S. Altschul, M. Boguski, J. Liu, A. Neuwald, and J. Wootton, “Detecting Subtule Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment,” Science, vol. 262, pp. 208-214, 1993.
[16] C. Lawrence and A. Reilly, “An Expectation Maximization (em) Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences,” Proteins: Structure, Function and Genetics, vol. 7, pp. 41-51, 1990.
[17] H. Leung and F. Chin, “Algorithms for Challenging Motif Problems,” J. Bioinformatics and Computational Biology, pp. 43-58, 2005.
[18] H. Leung and F. Chin, “Discovering Motifs with Transcription Factor Domain Knowledge,” Proc. Pacific Symp. Biocomputing (PSB '07), pp. 472-483, 2007.
[19] H. Leung and F. Chin, “Finding Exact Optimal Motif in Matrix Representation by Partitioning,” Bioinformatics, vol. 22, pp. 86-92, 2005.
[20] H. Leung and F. Chin, “Generalized Planted $(l, d)\hbox{-}{\rm Motif}$ Problem with Negative Set,” Proc. Int'l Workshop Algorithms in Bioinformatics (WABI '05), pp. 264-275, 2005.
[21] H. Leung, F. Chin, S.M. Yiu, R. Rosenfeld, and W.W. Tsang, “Finding Motifs with Insufficient Number of Strong Binding Sites,” J. Computational Biology, vol. 12, no. 6, pp. 686-701, 2005.
[22] M. Li, B. Ma, and L. Wang, “Finding Similar Regions in Many Strings,” J. Computer and System Sciences, vol. 65, pp. 73-96, 2002.
[23] S. Liang, “cWINNOWER Algorithm for Finding Fuzzy DNA Motifs,” Proc. IEEE CS Bioinformatics Conf., pp. 260-265, 2003.
[24] T.K. Man and G.D. Stormo, “Non-Independence of MNT Repressor-Operator Interaction Determined by a New Quantitative Multiple Fluorescence Relative Affinity (QuMFRA) Assay,” Nucleic Acids Research, vol. 29, pp. 2471-2478, 2001.
[25] G. Pavesi, P. Mereghetti, F. Zambelli, M. Stefani, G. Mauri, and G. Pesole, “MoD Tools: Regulatory Motif Discovery in Nucleotide Sequences from Co-Regulated or Homologous Genes,” Nucleic Acids Research, vol. 34, pp. 566-570, 2006.
[26] G. Pesole, N. Prunella, S. Liuni, M. Attimonelli, and C. Saccone, “Wordup: An Efficient Algorithm for Discovering Statistically Significant Patterns in DNA Sequences,” Nucleic Acids Research, vol. 20, no. 11, pp. 2871-2875, 1992.
[27] P. Pevzner and S.H. Sze, “Combinatorial Approaches to Finding Subtle Signals in DNA Sequences,” Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology, pp. 269-278, 2000.
[28] S. Rajasekaran, S. Balla, and C.H. Huang, “Exact Algorithms for Planted Motif Challenge Problem,” Proc. Asia-Pacific Bioinformatics Conf. (APBC '05), pp. 249-259, 2005.
[29] S. Sinha, “Discriminative Motifs,” Proc. Sixth Ann. Int'l Conf. Computational Biology, pp. 291-298, 2002.
[30] S. Sinha, S.N. Maity, J. Lu, and B. Crombrugghe, “Recombinant Rat CBF-C, the Third Subunit of CBF/NFY, Allows Formation of a Protein-DNA Complex with CBF-A and CBF-B and with Yeast HAP2 and HAP3,” Proc. Nat'l Academy of Sciences, vol. 92, no. 5, pp. 1624-1628, 1995.
[31] S. Sinha and M. Tompa, “A Statistical Method for Finding Transcription Factor Binding Sites,” Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology, pp. 344-354, 2000.
[32] K.T. Takusagawa and D.K. Gifford, “Negative Information for Motif Discovery,” Proc. Pacific Symp. Biocomputing (PSB '04), pp.360-371, 2004.
[33] M. Tompa, “An Exact Method for Finding Short Motifs in Sequences with Application to the Ribosome Binding Site Problem,” Proc. Seventh Int'l Conf. Intelligent Systems for Molecular Biology, pp. 262-271, 1999.
[34] Y. Xing, J.D. Fikes, and L. Guarente, “Mutations in Yeast HAP2 HAP3 Define a Hybrid CCAAT Box Binding Domain,” EMBO J., vol. 12, pp. 4647-4655, 1993.
[35] S. Wolfe, H. Greisman, E. Ramm, and C. Pabo, “Analysis of Zinc Fingers Optimized via Phage Display: Evaluating the Utility of a Recognition Code,” J. Molecular Biology, vol. 285, no. 5, pp. 1917-1934, 1999.
[36] X. Zhao, H. Huang, and T.P. Speed, “Finding Short DNA Motifs Using Permuted Markov Models,” Proc. Ann. Int'l Conf. Research in Computational Molecular Biology (RECOMB '04), pp. 68-75, 2004.
[37] J. Zhu and M. Zhang, “SCPD: A Promoter Database of the Yeast Saccharomyces cerevisiae,” Bioinformatics, vol. 15, pp. 563-577, http://cgsigma.cshl.orgjian/, 1999.
[38] TRANSFAC Database, , 2007.

Index Terms:
Computing Methodologies, Pattern Recognition, Design Methodology, Pattern analysis
Francis Chin, Henry C.M. Leung, "DNA Motif Representation with Nucleotide Dependency," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 5, no. 1, pp. 110-119, Jan.-March 2008, doi:10.1109/TCBB.2007.70220
Usage of this product signifies your acceptance of the Terms of Use.