The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.08 - August (2011 vol.23)
pp: 1154-1168
ABSTRACT
Existing sequence mining algorithms mostly focus on mining for subsequences. However, a large class of applications, such as biological DNA and protein motif mining, require efficient mining of “approximate” patterns that are contiguous. The few existing algorithms that can be applied to find such contiguous approximate pattern mining have drawbacks like poor scalability, lack of guarantees in finding the pattern, and difficulty in adapting to other applications. In this paper, we present a new algorithm called FLexible and Accurate Motif DEtector (FLAME). FLAME is a flexible suffix-tree-based algorithm that can be used to find frequent patterns with a variety of definitions of motif (pattern) models. It is also accurate, as it always finds the pattern if it exists. Using both real and synthetic data sets, we demonstrate that FLAME is fast, scalable, and outperforms existing algorithms on a variety of performance metrics. In addition, based on FLAME, we also address a more general problem, named extended structured motif extraction, which allows mining frequent combinations of motifs under relaxed constraints.
INDEX TERMS
trees (mathematics), data mining, suffix-tree-based algorithm, sequence data sets, sequence mining algorithms, flexible and accurate motif detector, FLAME, Computational modeling, Fires, Data mining, Biological system modeling, Approximation algorithms, DNA, Proteins, suffix tree., Motif, sequence mining
CITATION
"Efficient and Accurate Discovery of Patterns in Sequence Data Sets", IEEE Transactions on Knowledge & Data Engineering, vol.23, no. 8, pp. 1154-1168, August 2011, doi:10.1109/TKDE.2011.69
REFERENCES
[1] M.O. Dayhoff, R.M. Schwartz, and B. Orcutt, "A Model for Evolutionary Changes in Proteins," Atlas of Protein Sequence and Structure, vol. 5, pp. 345-352, Nat'l Biomedical Research Foundation, 1978.
[2] S. Henikoff and J. Henikoff, "Amino Acid Substitution Matrices from Protein Blocks," Proc. Nat'l Academy of Sciences USA, vol. 89, no. 22, pp. 10915-10919, 1992.
[3] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 487-499, 1994.
[4] R. Agrawal and R. Srikant, "Mining Sequential Patterns," Proc. 11th IEEE Int'l Conf. Data Eng. (ICDE), pp. 3-14, 1995.
[5] M.J. Zaki, "SPADE: An Efficient Algorithm for Mining Frequent Sequences," Machine Learning, vol. 42, nos. 1/2, pp. 31-60, 2001.
[6] J. Wang and J. Han, "BIDE: Efficient Mining of Frequent Closed Sequences," Proc. 20th IEEE Int'l Conf. Data Eng. (ICDE), pp. 79-90, 2004.
[7] X. Yan, J. Han, and R. Afshar, "CloSpan: Mining Closed Sequential Patterns in Large Datasets," Proc. SIAM Int'l Conf. Data Mining (SDM), 2003.
[8] J. Yang, W. Wang, P.S. Yu, and J. Han, "Mining Long Sequential Patterns in a Noisy Environment," Proc. ACM SIGMOD, pp. 406-417, 2002.
[9] S. Sinha and M. Tompa, "YMF: A Program for Discovery of Novel Transcription Factor Binding Sites by Statistical Overrepresentation," Nucleic Acids Research, vol. 31, no. 13, pp. 3586-3588, 2003.
[10] G. Pavesi, P. Mereghetti, G. Mauri, and G. Pesole, "Weeder Web: Discovery of Transcription Factor Binding Sites in a Set of Sequences From Co-Regulated Genes," Nucleic Acids Research, vol. 32, pp. W199-W203, 2004.
[11] E. Eskin and P.A. Pevzner, "Finding Composite Regulatory Patterns in DNA Sequences," Proc. 10th Int'l Conf. Intelligent Systems for Molecular Biology (ISMB), pp. S354-S363, 2002.
[12] J. Buhler and M. Tompa, "Finding Motifs Using Random Projections," J. Computational Biology, vol. 9, no. 2, pp. 225-242, 2002.
[13] G. Das, K.-I. Lin, H. Mannila, G. Renganathan, and P. Smyth, "Rule Discovery from Time Series," Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 16-22, 1998.
[14] S. Hoppner, "Discovery of Temporal Patterns—Learning Rules about the Qualitative Behaviour of Time Series," Proc. Fifth European Conf. Principles and Practice of Knowledge Discovery in Databases, pp. 192-203, 2001.
[15] P. Patel, E. Keogh, J. Lin, and S. Lonardi, "Mining Motifs in Massive Time Series Databases," Proc. IEEE Int'l Conf. Data Mining (ICDM), pp. 370-377, 2002.
[16] H. Wu, B. Salzberg, G.C. Sharp, S.B. Jiang, H. Shirato, and D. Kaeli, "Subsequence Matching on Structured Time Series Data," Proc. ACM SIGMOD, pp. 682-693, 2005.
[17] M.J. Zaki, "Sequence Mining in Categorical Domains: Incorporating Constrains," Proc. Ninth Int'l Conf. Information and Knowledge Management (CIKM), pp. 442-429, 2000.
[18] B.Y.-C. Chiu, E.J. Keogh, and S. Lonardi, "Probabilistic Discovery of Time Series Motifs," Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 493-498, 2003.
[19] W. Wang and J. Yang, Mining Sequential Patterns from Large Data Sets, vol. 28, Springer-Verlag, 2005.
[20] M. Das and H.K. Dai, "A Survey of DNA Motif Finding Algorithms," BMC Bioinformatics, vol. 8, p. S21-S33, 2007.
[21] G.K. Sandve and F. Drabløs, "A Survey of Motif Discovery Methods in an Integrated Framework," Biology Direct, vol. 1, pp. 11-26, 2006.
[22] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu, "PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth," Proc. IEEE Int'l Conf. Data Eng. (ICDE), pp. 215-224, 2001.
[23] J. Pei, J. Han, and W. Wang, "Mining Sequential Patterns with Constraints in Large Databases," Proc. 11th Int'l Conf. Information and Knowledge Management (CIKM), pp. 18-25, 2002.
[24] A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert, "Approaches to the Automatic Discovery of Patterns in Biosequences," J. Computational Biology, vol. 5, pp. 279-305, 1998.
[25] L. Marsan and M.-F. Sagot, "Algorithms for Extracting Structured Motifs Using a Suffix Tree with Application to Promoter and Regulatory Site Consensus Identification," J. Computational Biology, vol. 7, nos. 3/4, pp. 345-360, 2000.
[26] F. Zhu, X. Yan, J. Han, and P.S. Yu, "Efficient Discovery of Frequent Approximate Sequential Patterns," Proc. Seventh IEEE Int'l Conf. Data Mining (ICDM), 2007.
[27] S. Rajasekaran, S. Balla, C.-H. Huang, V. Thapar, M.R. Gryk, M.W. Maciejewski, and M.R. Schiller, "Exact Algorithms for Motif Search," Proc. Asia-Pacific Bioinformatics Conf. (APBC), pp. 239-248, 2005.
[28] J. Davila, S. Balla, and S. Rajasekaran, "Fast and Practical Algorithms for Planted (l, d) Motif Search," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 4, no. 4, pp. 544-552, Oct.-Dec. 2007.
[29] J. Davila, S. Balla, and S. Rajasekaran, "Space and Time Efficient Algorithms for Planted Motif Search," Proc. Int'l Conf. Computational Science, pp. 822-829, 2006.
[30] S. Rajasekaran, S. Balla, and C.-H. Huang, "Exact Algorithms for Planted Motif Challenge Problems," Proc. Asia-Pacific Bioinformatics Conf. (APBC), pp. 249-259, 2005.
[31] T.L. Bailey and C. Elkan, "Unsupervised Learning of Multiple Motifs in Biopolymers Using EM," Machine Learning, vol. 21, nos. 1/2, pp. 51-80, 1995.
[32] W. Thompson, E.C. Rouchka, and C.E. Lawrence, "Gibbs Recursive Sampler: Finding Transcription Factor Binding Sites," Nucleic Acids Research, vol. 31, no. 13, pp. 3580-3585, 2003.
[33] G. Narasimhan, C. Bu, Y. Gao, X. Wang, N. Xu, and K. Mathee, "Mining Protein Sequences for Motifs," J. Computational Biology, vol. 9, no. 5, pp. 707-720, 2002.
[34] I. Rigoutsos and A. Floratos, "Motif Discovery without Alignment or Enumeration (Extended Abstract)," Proc. Second Ann. Int'l Conf. Computational Molecular Biology (RECOMB), pp. 221-227, 1998.
[35] M. Tompa et al., "Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites," Nature Biotechnology, vol. 23, pp. 137-144, 2005.
[36] A.W.-C. Fu, E.J. Keogh, L.Y.H. Lau, and C.A. Ratanamahatana, "Scaling and Time Warping in Time Series Querying," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 649-660, 2005.
[37] M. Vlachos, G. Kollios, and D. Gunopulos, "Discovering Similar Multidimensional Trajectories," Proc. 18th IEEE Int'l Conf. Data Eng. (ICDE), pp. 673-684, 2002.
[38] L. Chen, M. Tamer Ozsu, and V. Oria, "Robust and Fast Similarity Search for Moving Object Trajectories," Proc. ACM SIGMOD, pp. 491-502, 2005.
[39] Y. Zhu and D. Shasha, "Warping Indexes with Envelope Transforms for Query by Humming," Proc. ACM SIGMOD, pp. 181-192, 2003.
[40] A. Udechukwu, K. Barker, and R. Alhajj, "Discovering all Frequent Trends in Time Series," Proc. Winter Int'l Symp. Information and Comm. Technologies, vol. 58, pp. 1-6, 2004.
[41] Y. Zhang and M.J. Zaki, "SMOTIF: Efficient Structured Pattern and Profile Motif Search," Algorithms for Molecular Biology, vol. 1, pp. 22-45, 2006.
[42] G. Navarro and M. Raffinot, "Fast and Simple Character Classes and Bounded Gaps Pattern Matching, with Applications to Protein Searching," J. Computational Biology, vol. 10, no. 6, pp. 903-923, 2003.
[43] A. Policriti, N. Vitacolonna, M. Morgante, and A. Zuccolo, "Structured Motifs Search," Proc. Eighth Ann. Int'l Conf. Research in Computational Molecular Biology (RECOMB), pp. 133-139, 2004.
[44] A.M. Carvalho, A.T. Freitas, A.L. Oliveira, and M.-F. Sagot, "Efficient Extraction of Structured Motifs Using Box-Links," Proc. Int'l Symp. String Processing and Information Retrieval (SPIRE), pp. 267-268, 2004.
[45] A.M. Carvalho, A.T. Freitas, A.L. Oliveira, and M.-F. Sagot, "A Highly Scalable Algorithm for the Extraction of Cis-Regulatory Regions," Proc. Asia-Pacific Bioinformatics Conf. (APBC), pp. 273-282, 2005.
[46] A.M. Carvalho et al., "An Efficient Algorithm for the Identification of Structured Motifs in DNA Promoter Sequences," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 3, no. 2, pp. 126-140, Apr.-June 2006.
[47] N. Pisanti, A.M. Carvalho, L. Marsan, and M.-F. Sagot, "Risotto: Fast Extraction of Motifs with Mismatches," Proc. Seventh Latin Am. Theoretical Informatics Symp. (LATIN), pp. 757-768, 2006.
[48] Y. Zhang and M.J. Zaki, "EXMOTIF: Efficient Structured Motif Extraction," Algorithms for Molecular Biology, vol. 1, pp. 21-38, 2006.
[49] F. Fassetti, G. Greco, and G. Terracina, "Mining Loosely Structured Motifs from Biological Data," IEEE Trans. Knowledge and Data Eng., vol. 20, no. 11, pp. 1472-1489, Nov. 2008.
[50] L. DS, "Transcription Factors: An Overview," Int'l J. Biochemistry and Cell Biology, vol. 29, no. 12, pp. 1305-1312, 1997.
[51] I. Jonassen, J.F. Collins, and D.G. Higgins, "Finding Flexible Patterns in Unaligned Protein Sequences," Protein Science, vol. 4, no. 8, pp. 1587-1595, 1995.
[52] "Data Sets from Analysis of Financial Time Series," http://www.gsb.uchicago.edu/fac/ruey.tsay/ teachingfts/, 2010.
[53] R.S. Tsay, Analysis of Financial Time Series, first ed., Wiley-Interscience, Oct. 2001.
[54] P.A. Pevzner and S.-H. Sze, "Combinatorial Approaches to Finding Subtle Signals in DNA Sequences," Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology (ISMB), pp. 269-278, 2000.
[55] "cSPADE Source Code," http://www.cs.rpi.edu/zakisoftware/, 2010.
[56] "CloSpan Source Code," http:/illimine.cs.uiuc.edu/, 2011.
[57] "YMF Source Code," http://bio.cs.washington.edusoftware. html , 2010.
[58] "Weeder Source Code," http://www.pesolelab.it/Toolind.php, 2010.
[59] "Random Projections Source Code," http://www.cse.wustl.edu/jbuhlerpgt/, 2010.
[60] S. Tata, R.A. Hankins, and J.M. Patel, "Practical Suffix Tree Construction," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 36-47, 2004.
[61] A. Gionis, P. Indyk, and R. Motwani, "Similarity Search in High Dimensions via Hashing," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 518-529, 1999.
[62] T.L. Bailey and C. Elkan, "Fitting a Mixture Model by Expectation Maximization to Discover Motifs in Biopolymers," Proc. Second Int'l Conf. Intelligent Systems for Molecular Biology (ISMB), pp. 28-36, 1994.
[63] "TRANSFAC," http://www.gene-regulation.com/pub databases.html , 2011.
SEARCH
45 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool