The Community for Technology Leaders
2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (2014)
Belfast, United Kingdom
Nov. 2, 2014 to Nov. 5, 2014
ISBN: 978-1-4799-5669-2
pp: 397-402
Qiang Yu , School of Computer Science and Technology, Xidian University, Xi'an, 710071, China
Hongwei Huo , School of Computer Science and Technology, Xidian University, Xi'an, 710071, China
Xiaoyang Chen , School of Computer Science and Technology, Xidian University, Xi'an, 710071, China
Haitao Guo , School of Computer Science and Technology, Xidian University, Xi'an, 710071, China
Jeffrey Scott Vitter , Information and Telecommunication Technology Center, The University of Kansas, Lawrence, 66047, USA
Jun Huan , Information and Telecommunication Technology Center, The University of Kansas, Lawrence, 66047, USA
ABSTRACT
The planted (l, d) motif discovery has been successfully used to locate transcription factor binding sites in dozens of promoter sequences over the past decade. However, there has not been enough work done in identifying (l, d) motifs in the next-generation sequencing (ChIP-seq) data sets, which contain thousands of input sequences and thereby bring new challenge to make a good identification in reasonable time. To cater this need, we propose a new planted (l, d) motif discovery algorithm named MCES, which identifies motifs by mining and combining emerging substrings. Specially, to handle larger data sets, we design a MapReduce-based strategy to mine emerging substrings distributedly. Experimental results on the simulated data show that i) MCES is able to identify (l, d) motifs efficiently and effectively in thousands to millions of input sequences, and runs faster than the state-of-the-art (l, d) motif discovery algorithms, such as F-motif and TraverStringsR; ii) MCES is able to identify motifs without known lengths, and has a better identification accuracy than the competing algorithm CisFinder. Also, the validity of MCES is tested on real data sets.
INDEX TERMS
Pulse width modulation, Algorithm design and analysis, Data mining, DNA, Clustering algorithms, Dispersion, Accuracy
CITATION

Q. Yu, H. Huo, X. Chen, H. Guo, J. S. Vitter and J. Huan, "An efficient motif finding algorithm for large DNA data sets," 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Belfast, United Kingdom, 2014, pp. 397-402.
doi:10.1109/BIBM.2014.6999191
92 ms
(Ver 3.3 (11022016))