Issue No. 08 - Aug. (2012 vol. 24)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.100
Andrew K.C. Wong , University of Waterloo, Waterloo
Dennis Zhuang , University of Waterloo, Waterloo
Gary C.L. Li , University of Waterloo, Waterloo
En-Shiun Annie Lee , University of Waterloo, Waterloo
Discovering patterns from sequence data has significant impact in many aspects of science and society, especially in genomics and proteomics. Here we consider multiple strings as input sequence data and substrings as patterns. In the real world, usually a large set of patterns could be discovered yet many of them are redundant, thus degrading the output quality. This paper improves the output quality by removing two types of redundant patterns. First, the notion of delta tolerance closed itemset is employed to remove redundant patterns that are not delta closed. Second, the concept of statistically induced patterns is proposed to capture redundant patterns which seem to be statistically significant yet their significance is induced by their strong significant subpatterns. It is computationally intense to mine these nonredundant patterns (delta closed patterns and noninduced patterns). To efficiently discover these patterns in very large sequence data, two efficient algorithms have been developed through innovative use of suffix tree. Three sets of experiments were conducted to evaluate their performance. They render excellent results when applying to genomics. The experiments confirm that the proposed algorithms are efficient and that they produce a relatively small set of patterns which reveal interesting information in the sequences.
Sequence pattern discovery, delta closed patterns, statistically induced patterns, suffix tree.
E. A. Lee, G. C. Li, D. Zhuang and A. K. Wong, "Discovery of Delta Closed Patterns and Noninduced Patterns from Sequences," in IEEE Transactions on Knowledge & Data Engineering, vol. 24, no. , pp. 1408-1421, 2011.