Subscribe
Issue No.04 - October-December (2010 vol.7)
pp: 654-668
Gang Li , Chinese University of Hong Kong, Hong Kong
Tak-Ming Chan , Chinese University of Hong Kong, Hong Kong
Kwong-Sak Leung , Chinese University of Hong Kong, Hong Kong
Kin-Hong Lee , Chinese University of Hong Kong, Hong Kong
ABSTRACT
Finding Transcription Factor Binding Sites, i.e., motif discovery, is crucial for understanding the gene regulatory relationship. Motifs are weakly conserved and motif discovery is an NP-hard problem. We propose a new approach called Cluster Refinement Algorithm for Motif Discovery (CRMD). CRMD employs a flexible statistical motif model allowing a variable number of motifs and motif instances. CRMD first uses a novel entropy-based clustering to find complete and good starting candidate motifs from the DNA sequences. CRMD then employs an effective greedy refinement to search for optimal motifs from the candidate motifs. The refinement is fast, and it changes the number of motif instances based on the adaptive thresholds. The performance of CRMD is further enhanced if the problem has one occurrence of motif instance per sequence. Using an appropriate similarity test of motifs, CRMD is also able to find multiple motifs. CRMD has been tested extensively on synthetic and real data sets. The experimental results verify that CRMD usually outperforms four other state-of-the-art algorithms in terms of the qualities of the solutions with competitive computing time. It finds a good balance between finding true motif instances and screening false motif instances, and is robust on problems of various levels of difficulty.
INDEX TERMS
Transcription factor binding site, motif discovery.
CITATION
Gang Li, Tak-Ming Chan, Kwong-Sak Leung, Kin-Hong Lee, "A Cluster Refinement Algorithm for Motif Discovery", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.7, no. 4, pp. 654-668, October-December 2010, doi:10.1109/TCBB.2009.25
REFERENCES
[1] E. Blanco, D. Farré, M.M. Albà, X. Messeguer, and R. Guigó, "ABS: A Database of Annotated Regulatory Binding Sites from Orthologous Promoters," Nucleic Acids Research, vol. 34, no. Database Issue, pp. 63-67, 2006.
[2] J. Zhu, "SCPD: A Promoter Database of the Yeast Saccharomyces Cerevisiae," Bioinformatics, vol. 15, no. 7, pp. 607-611, 1999.
[3] J. Hu, B. Li, and D. Kihara, "Limitations and Potentials of Current Motif Discovery Algorithms," Nucleic Acids Research, vol. 33, no. 15, pp. 4899-4913, 2005.
[4] M. Tompa, N. Li, and T. Bailey, "Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites," Nature Biotechnology, vol. 23, pp. 137-144, 2005.
[5] T.L. Bailey and C. Elkan, "Fitting a Mixture Model by Expectation Maximization to Discover Motifs in Biopolymers," Proc. Second Int'l Conf. Intelligent Systems for Molecular Biology, pp. 28-36, 1994.
[6] J.S. Liu, A.F. Neuwald, and C.E. Lawrence, "Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies," J. Am. Statistical Assoc., vol. 90, no. 432, pp. 1156-1170, 1995.
[7] Z. Wei and S.T. Jensen, "GAME: Detecting Cis-Regulatory Elements Using a Genetic Algorithm," Bioinformatics, vol. 22, no. 13, pp. 1577-1584, 2006.
[8] T.-M. Chan, K.-S. Leung, and K.-H. Lee, "Tfbs Identification Based on Genetic Algorithm with Combined Representations and Adaptive Post-Processing," J. Bioinformatics, vol. 24, pp. 341-349, 2007.
[9] D.J. Galas and A. Schmitz, "DNAse Footprinting: A Simple Method for the Detection of Protein-DNA Binding Specificity," Nucleic Acids Research, vol. 5, no. 9, pp. 3157-3170, Sept. 1987.
[10] C. Horak and M. Snyder, "ChIP-Chip: A Genomic Approach for Identifying Transcription Factor Binding Sites," Methods in Enzymology, vol. 350, pp. 469-483, 2002.
[11] G. Sandve and F. Drablos, "A Survey of Motif Discovery Methods in an Integrated Framework," Biology Direct, vol. 1, no. 1, 2006.
[12] M. Li, B. Ma, and L. Wang, "Finding Similar Regions in Many Sequences," J. Computer and System Sciences, vol. 65, pp. 73-96, 2002.
[13] P. Pevzner and S. Sze, "Combinatorial Approaches to Finding Subtle Signals in DNA Sequences," Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology, pp. 269-278, 2000.
[14] M.F. Sagot, "Spelling Approximate Repeated or Common Motifs Using a Suffix Tree," LATIN '98: Theoretical Informatics, pp. 111-127, Springer-Verlag, 1998.
[15] P. Bieganski, J. Riedl, J.V. Carlis, and E. Retzel, "Generalized Suffix Trees for Biological Sequence Data: Applications and Implementations," Proc. 27th Hawaii Int'l Conf. Systems Sciences, pp. 35-44, 1994.
[16] J. Buhler and M. Tompa, "Finding Motifs Using Random Projections," J. Computational Biology, vol. 9, no. 2, pp. 225-242, 2002.
[17] B. Raphael, L. Lung-Tien, and G. Varghese, "A Uniform Projection Method for Motif Discovery in DNA Sequences," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 1, no. 2, pp. 91-94, Apr.-June 2004.
[18] K. Blekas, D. Fotiadis, and A. Likas, "Greedy Mixture Learning for Multiple Motif Discovery in Biological Sequences," Bioinformatics, vol. 19, no. 5, pp. 607-617, 2003.
[19] J.S. Liu, A.F. Neuwald, and C.E. Lawrence, "Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies," J. Am. Statistical Assoc., vol. 90, no. 432, pp. 1156-1170, Nov. 1995.
[20] C.E. Lawrence, S.F. Altschul, M.S. Boguski, J.S. Liu, A.F. Neuwald, and J.C. Wooton, "Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment," Science, vol. 262, no. 8, pp. 208-214, Oct. 1993.
[21] G.B. Fogel, D.G. Weekes, G. Varga, E.R. Dow, H.B. Harlow, J.E. Onyia, and C. Su, "Discovery of Sequence Motifs Related to Coexpression of Genes Using Evolutionary Computation," Nucleic Acids Research, vol. 32, no. 13, pp. 3826-3835, 2004.
[22] M. Lones and A. Tyrrell, "Regulatory Motif Discovery Using a Population Clustering Evolutionary Algorithm," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 4, no. 3, pp. 403-414, July-Sept. 2007.
[23] G. Li, T. Chan, K. Leung, and K. Lee, "An Estimation of Distribution Algorithm for Motif Discovery," Proc. IEEE Congress Evolutionary Computation 2008 (CEC '08) (IEEE World Congress Computational Intelligence), pp. 2411-2418, 2008.
[24] M.K. Das and H.-K. Dai, "A Survey of DNA Motif Finding Algorithms," BMC Bioinformatics, vol. 8, no. 7, 2007.
[25] S. Hannenhalli, "Eukaryotic Transcription Factor Binding Sites Vmodeling and Integrative Search Methods," Bioinformatics, vol. 24, no. 11, pp. 1325-1331, 2008.
[26] S. Jensen, X. Liu, Q. Zhou, and J. Liu, "Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective," Statistical Science, vol. 19, no. 1, pp. 188-204, 2004.
[27] Z. Qin, L. McCue, W. Thompson, L. Mayerhofer, C. Lawrence, and J. Liu, "Identification of Co-Regulated Genes through Bayesian Clustering of Predicted Regulatory Binding Sites," Nature Biotechnology, vol. 21, pp. 435-439, 2003.
[28] S. Kielbasa, D. Gonze, and H. Herzel, "Measuring Similarities between Transcription Factor Binding Sites," BMC Bioinformatics, vol. 6, no. 1, 2005.
[29] G. Stormo et al., "Identifying Protein-Binding Sites from Unaligned DNA Fragments," Proc. Nat'l Academy of Sciences USA, vol. 86, no. 4, pp. 1183-1187, 1989.
[30] C. Lawrence and A. Reilly, "An Expectation Maximization (EM) Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences," Proteins, vol. 7, no. 1, pp. 41-51, 1990.
[31] J. Liu, "The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem," J. Am. Statistical Assoc., vol. 89, no. 427, pp. 958-966, 1994.
[32] C. Klinge, "Estrogen Receptor Interaction with Estrogen Response Elements," Nucleic Acids Research, vol. 29, no. 14, pp. 2905-2929, 2001.
[33] A. Kel, O. Kel-Margoulis, P. Farnham, S. Bartley, E. Wingender, and M. Zhang, "Computer-Assisted Identification of Cell Cycle-Related Genes: New Targets for E2F Transcription Factors," J. Molecular Biology, vol. 309, no. 1, pp. 99-120, 2001.
[34] B. Berman, Y. Nibu, B. Pfeiffer, P. Tomancak, S. Celniker, M. Levine, G. Rubin, and M. Eisen, "Exploiting Transcription Factor Binding Site Clustering to Identify Cis-Regulatory Modules Involved in Pattern Formation in the Drosophila Genome," Proc. Nat'l Academy of Sciences USA, vol. 99, no. 2, pp. 757-762, 2002.
[35] M. Frith, U. Hansen, J. Spouge, and Z. Weng, "Finding Functional Sequence Elements by Multiple Local Alignment," Nucleic Acids Research, vol. 32, no. 1, pp. 189-200, 2004.
[36] W. Krivan and W. Wasserman, "A Predictive Model for Regulatory Sequences Directing Liver-Specific Transcription," Genome Research, vol. 11, no. 9, pp. 1559-1566, 2001.