Issue No. 04 - October-December (2010 vol. 7)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TCBB.2009.25
Gang Li , Chinese University of Hong Kong, Hong Kong
Tak-Ming Chan , Chinese University of Hong Kong, Hong Kong
Kwong-Sak Leung , Chinese University of Hong Kong, Hong Kong
Kin-Hong Lee , Chinese University of Hong Kong, Hong Kong
Finding Transcription Factor Binding Sites, i.e., motif discovery, is crucial for understanding the gene regulatory relationship. Motifs are weakly conserved and motif discovery is an NP-hard problem. We propose a new approach called Cluster Refinement Algorithm for Motif Discovery (CRMD). CRMD employs a flexible statistical motif model allowing a variable number of motifs and motif instances. CRMD first uses a novel entropy-based clustering to find complete and good starting candidate motifs from the DNA sequences. CRMD then employs an effective greedy refinement to search for optimal motifs from the candidate motifs. The refinement is fast, and it changes the number of motif instances based on the adaptive thresholds. The performance of CRMD is further enhanced if the problem has one occurrence of motif instance per sequence. Using an appropriate similarity test of motifs, CRMD is also able to find multiple motifs. CRMD has been tested extensively on synthetic and real data sets. The experimental results verify that CRMD usually outperforms four other state-of-the-art algorithms in terms of the qualities of the solutions with competitive computing time. It finds a good balance between finding true motif instances and screening false motif instances, and is robust on problems of various levels of difficulty.
Clustering algorithms, Sequences, Testing, DNA, Databases, NP-hard problem, Robustness, Evolution (biology), Organisms, Biology computing
G. Li, T. Chan, K. Leung and K. Lee, "A Cluster Refinement Algorithm for Motif Discovery," in IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 4, pp. 654-668, 2010.