Subscribe

Issue No.08 - Aug. (2012 vol.24)

pp: 1408-1421

Dennis Zhuang , University of Waterloo, Waterloo

Gary C.L. Li , University of Waterloo, Waterloo

Andrew K.C. Wong , University of Waterloo, Waterloo

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.100

ABSTRACT

Discovering patterns from sequence data has significant impact in many aspects of science and society, especially in genomics and proteomics. Here we consider multiple strings as input sequence data and substrings as patterns. In the real world, usually a large set of patterns could be discovered yet many of them are redundant, thus degrading the output quality. This paper improves the output quality by removing two types of redundant patterns. First, the notion of delta tolerance closed itemset is employed to remove redundant patterns that are not delta closed. Second, the concept of statistically induced patterns is proposed to capture redundant patterns which seem to be statistically significant yet their significance is induced by their strong significant subpatterns. It is computationally intense to mine these nonredundant patterns (delta closed patterns and noninduced patterns). To efficiently discover these patterns in very large sequence data, two efficient algorithms have been developed through innovative use of suffix tree. Three sets of experiments were conducted to evaluate their performance. They render excellent results when applying to genomics. The experiments confirm that the proposed algorithms are efficient and that they produce a relatively small set of patterns which reveal interesting information in the sequences.

INDEX TERMS

Sequence pattern discovery, delta closed patterns, statistically induced patterns, suffix tree.

CITATION

Dennis Zhuang, Gary C.L. Li, Andrew K.C. Wong, "Discovery of Delta Closed Patterns and Noninduced Patterns from Sequences",

*IEEE Transactions on Knowledge & Data Engineering*, vol.24, no. 8, pp. 1408-1421, Aug. 2012, doi:10.1109/TKDE.2011.100REFERENCES

- [1] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, "Discovering Frequent Closed Itemsets for Association Rules,"
Proc. Seventh Int'l Conf. Database Theory, pp. 398-416, 1999.- [2] J. Cheng, Y. Ke, and W. Ng, "$\delta$ -Tolerance Closed Frequent Itemsets,"
Proc. Sixth Int'l Conf. Data Mining, pp. 139-148, 2006.- [3] S.C. Chan and A.K.C Wong, "Synthesis and Recognition of Sequences,"
IEEE Trans. Pattern Analysis Machine Intelligence, vol. 13, no. 12, pp. 1245-1255, Dec. 1991.- [4] A.K.C. Wong, D.K.Y. Chiu, and S.C. Chan, "Pattern Detection in Biomolecules Using Synthesis Random Sequence,"
J. Pattern Recognition, vol. 29, no. 9, pp. 1581-1586, 1995.- [5] R. Agrawal and R. Srikant, "Mining Sequential Patterns,"
Proc. 11th Int'l Conf. Data Eng., pp. 3-14, 1995.- [6] R. Srikant and R. Agrawal, "Mining Sequential Patterns: Generalizations and Performance Improvements,"
Proc. Fifth Int'l Conf. Extending Database Technology, pp. 3-17, 1996.- [7] J. Pei and J. Han, "PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth,"
Proc. 17th Int'l Conf. Data Eng., pp. 215-224, 2001.- [8] C. Antunes and A.L. Oliveira, "Generalization of Pattern-Growth Methods for Sequential Pattern Mining with Gap Constraints,"
Proc. Int'l Conf. Machine Learning and Data Mining, pp. 239-251, 2003.- [9] J. Chen, "Contiguous Item Sequential Pattern Mining using UpDown Tree,"
J. Intelligent Data Analysis, vol. 12, no. 1, pp. 25-49, Jan. 2008.- [10] X. Yan, J. Han, and R. Afshar, "CloSpan: Mining Closed Sequential Patterns in Large Databases,"
Proc. Third SIAM Int'l Conf. Data Mining, pp. 166-177, 2003.- [11] J. Wang and J. Han, "BIDE: Efficient Mining of Frequent Closed Sequences,"
Proc. 20th Int'l Conf. Data Eng., pp. 79-90, 2004.- [12] C. Li and J. Wang, "Efficiently Mining Closed Subsequences with Gap Constraints,"
Proc. Eighth SIAM Int'l Conf. Data Mining, pp. 313-322, 2008.- [13] S. Parthasarathy, M.J. Zaki, M. Ogihara, and S. Dwarkadas, "Incremental and Interactive Sequence Mining,"
Proc. Eighth Int'l Conf. Information and Knowledge Management, pp. 251-258, 1999.- [14] H. Cheng and J. Han, "Incspan: Incremental Mining of Sequential Patterns in Large Database,"
Proc. 10th Int'l Conf. Knowledge Discovery in Databases, pp. 527-532, 2004.- [15] L. Geng and H.J. Hamilton, "Interestingness Measures for Data Mining: A Survey,"
ACM Computing Surveys, vol. 38, no. 3, p. 9, 2006.- [16] B. Padmanabhan and A. Tuzhilin, "A Belief-Driven Method for Discovering Unexpected Patterns,"
Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining, pp. 94-100, 1998.- [17] E. Keogh, S. Lonardi, and B. Chiu, "Finding Surprising Patterns in a Time Series Database in Linear Time and Space,"
Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 550-556, 2002.- [18] W. Hamalainen and M. Nykanen, "Efficient Discovery of Statistically Significant Association Rules,"
Proc. IEEE Eighth Int'l Conf. Data Mining, pp. 203-212, 2008.- [19] D.H. Li, A. Laurent, and P. Poncelet, "Mining Unexpected Sequential Patterns and Rules," Technical Report RR-07027, Laboratoire d'Informatique de Robotique et de Micro'electronique de Montpellier, 2007.
- [20] J. Yang, W. Wang, and P. Yu, "InfoMiner: Mining Surprising Periodic Patterns,"
Data Mining and Knowledge Discovery, vol. 9, no. 2, pp. 189-216, 2004.- [21] A. Gionis, H. Mannila, T. Mielikäinen, and P. Tsaparas, "Assessing Data Mining Results via Swap Randomization,"
ACM Trans. Knowledge Discovery from Data, vol. 1, no. 3, pp. 167-176, 2007.- [22] J. Li, G. Liu, L. Wong, "Mining Statistically Important Equivalence Classes and Delta-Discriminative Emerging Patterns,"
Proc. 13th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 430-439, 2007.- [23] B. Padmanabhan and A. Tuzhilin, "On Characterization and Discovery of Minimal Unexpected Patterns in Rule Discovery,"
IEEE Trans. Knowledge and Data Eng., vol. 18, no. 2, pp. 202-216, Feb. 2006.- [24] M. Blanchette and S. Sinha, "Separating Real Motifs from Their Artifacts,"
Bioinformatics, vol. 17, suppl. 1, pp. S30-S38, 2001.- [25] A.K.C Wong, D. Zhuang, G. Li, and E. Lee, "Discovery of Non-Induced Patterns from Sequences,"
Proc. Fifth IAPR Int'l Conf. Pattern Recognition in Bioinformatics, pp. 149-160, 2010.- [26] S. Haberman, "The Analysis of Residuals in Cross-Classified Tables,"
Biometrics, vol. 29, pp. 205-220, 1973.- [27] A.K.C. Wong and Y. Wang, "High-Order Pattern Discovery from Discrete-Valued Data,"
IEEE Trans. Knowledge and Data Eng., vol. 9, no. 6, pp. 877-893, Nov. 1997.- [28] K.C.C. Chan and A.K.C. Wong, "APACS: A System for Automated Analysis and Classification of Conceptual Patterns,"
Computational Intelligence, vol. 6, pp. 119-131, 1990.- [29] S. Aluru and P. Ko, "Lookup Tables, Suffix Trees and Suffix Arrays,"
Handbook of Computational Molecular Biology, CRC Press, 2006.- [30] A. Apostolico, M. Bock, S. Lonardi, and X. Xu, "Efficient Detection of Unusual Words,"
J. Computational Biology, vol. 7, no. 1/2, pp. 71-94, 2000.- [31] L. Hui, "Color Set Size Problem with Application to String Matching,"
Proc. Third Ann. Symp. Combinatorial Pattern Matching, 1992.- [32] SCPD, http://rulai.cshl.eduSCPD/, 2012.
- [33] S. Sinha and M. Tompa, "Discovery of Novel Transcription Factor Binding Sites by Statistical Overrepresentation,"
Nucleic Acids Research, vol. 30, no. 24, pp. 5549-5560, 2002.- [34] G. Pavesi, G. Mauri, and G. Pesole, "An Algorithm for Finding Signals of Unknown Length in DNA Sequences,"
Bioinformatics, vol.17, suppl. 1, pp. S207-S214, 2001.- [35] M. Tompa, N. Li, T.L. Bailey, G.M. Church, B. De Moor, E. Eskin, A.V. Fa-vorov, M.C. Frith, Y. Fu, W.J. Kent, V.J. Makeev, A.A. Mironov, W.S. Noble, G. Pavesi, G. Pesole, M. Regnier, N. Simonis, S. Sinha, G. Thijs, J. van Helden, M. Vandenbogaert, Z. Weng, C. Workman, C. Ye, and Z. Zhu, "Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites,"
Nature Biotechnology, vol. 23, no. 1, pp. 137-144, Jan. 2005. |