This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Mining Loosely Structured Motifs from Biological Data
November 2008 (vol. 20 no. 11)
pp. 1472-1489
Fabio Fassetti, University of Calabria, Rende
Gianluigi Greco, University of Calabria, Rende
Giorgio Terracina, University of Calabria, Rende
The discovery of information encoded in biological sequences is assuming a prominent role in identifying genetic diseases and in deciphering biological mechanisms. This information is usually encoded in patterns frequently occurring in the sequences, also called motifs. In fact, motif discovery has received much attention in the literature, and several algorithms have already been proposed, which are specifically tailored to deal with motifs exhibiting some kinds of "regular structure". Motivated by biological observations, this paper focuses on the mining of loosely structured motifs, i.e., of more general kinds of motif where several "exceptions" may be tolerated in pattern repetitions. To this end, an algorithm exploiting data structures conceived to efficiently handle pattern variabilities is presented and analyzed. Furthermore, a randomized variant with linear time and space complexity is introduced, and a theoretical guarantee on its performances is proven. Both algorithms have been implemented and tested on real data sets. Despite the ability of mining very complex kinds of pattern, performance results evidence a genome-wide applicability of the proposed techniques.

[1] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. 11th Int'l Conf. Data Eng. (ICDE '95), pp. 3-14, 1995.
[2] N. Alon, Y. Matias, and M. Szegedy, “The Space Complexity of Approximating the Frequency Moments,” Proc. 28th ACM Symp. Theory of Computing (STOC '96), pp. 20-29, 1996.
[3] A. Apostolico and M. Crochemore, “String Pattern Matching for a Deluge Survival Kit,” Handbook of Massive Data Sets, J. Abello, P.M.Pardalos and M.G.C. Resende, eds., Kluwer Academic, 2000.
[4] M.I. Arnone and E.H. Davidson, “The Hardwiring of Development: Organization and Function of Genomic Regulatory Systems,” Development, vol. 124, pp. 1851-1864, 1997.
[5] J. Ayres, J. Flannick, J. Gehrke, and T. Yiu, “Sequential Pattern Mining Using a Bitmap Representation,” Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), pp. 429-435, 2002.
[6] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and Issues in Data Stream Systems,” Proc. 21st ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems (PODS '02), pp. 1-16, 2002.
[7] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addison Wesley Longman, 1999.
[8] T.L. Bailey and C. Elkan, “Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization,” Machine Learning, vol. 21, nos. 1-2, pp. 51-80, 1995.
[9] A. Bairoch, “PROSITE: A Dictionary of Protein Sites and Patterns,” Nucleic Acid Research, vol. 20, pp. 2013-2018, 1992.
[10] A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert, “Approaches to the Automatic Discovery of Patterns in Biosequences,” J. Computational Biology, vol. 5, no. 2, pp. 277-304, 1998.
[11] A. Brazma, I. Jonassen, J. Vilo, and E. Ukkonen, “Predicting Gene Regulatory Elements in Silico on a Genomic Scale,” Genome Research, vol. 8, pp. 1202-1215, 1998.
[12] J. Buhler and M. Tompa, “Finding Motifs Using Random Projections,” Proc. Fifth Ann. Int'l Conf. Computational Biology (RECOMB '01), pp. 69-76, 2001.
[13] A.M. Carvalho, A.T. Freitas, A.L. Oliveira, and M.F. Sagot, “An Efficient Algorithm for the Identification of Structured Motifs in DNA Promoter Sequences,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 3, no. 2, pp. 126-140, Apr.-June 2006.
[14] J.M. Chen, N. Chuzhanova, P.D. Stenson, C. Ferec, and D.N. Cooper, “Meta-Analysis of Gross Insertions Causing Human Genetic Disease: Novel Mutational Mechanisms and the Role of Replication Slippage,” Human Mutation, vol. 25, no. 2, pp. 207-221, 2005.
[15] E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J.D. Ullman, and C. Yang, “Finding Interesting Associations without Support Pruning,” IEEE Trans. Knowledge and Data Eng., vol. 13, no. 1, pp. 64-78, Jan./Feb. 2001.
[16] G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan, “Comparing Data Streams Using Hamming Norms (How to Zero In),” IEEE Trans. Knowledge and Data Eng., vol. 15, no. 3, pp. 529-540, May/June 2003.
[17] G. Cormode and S. Muthukrishnan, “An Improved Data Stream Summary: The Count-Min Sketch and Its Applications,” J.Algorithms, vol. 55, no. 1, pp. 58-75, 2005.
[18] A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi, “Processing Complex Aggregate Queries over Data Streams,” Proc. ACM SIGMOD '02, pp. 61-72, 2002.
[19] I. Erill, M. Escribano, S. Campoy, and J. Barbé, “In Silico Analysis Reveals Substantial Variability in the Gene Contents of the Gamma Proteobacteria Lexa-Regulon,” Bioinformatics, vol. 19, no. 17, pp. 2225-2236, 2003.
[20] E. Eskin and P.A. Pevzner, “Finding Composite Regulatory Patterns in DNA Sequences,” Proc. 10th Int'l Conf. Intelligent Systems for Molecular Biology (ISMB '02), pp. 354-363, 2002.
[21] M. Ester and X. Zhang, “A Top-Down Method for Mining Most-Specific Frequent Patterns in Biological Sequences,” Proc. SIAM Int'l Conf. Data Mining (SDM), 2004.
[22] P.B. Gibbons and Y. Matias, “Synopsis Data Structures for Massive Data Sets,” External Memory Algorithms, pp. 39-70, 1999.
[23] P.B. Gibbons and S. Tirthapura, “Estimating Simple Functions on the Union of Data Streams,” Proc. 13th ACM Symp. Parallel Algorithms and Architectures (SPAA '01), pp. 281-291, 2001.
[24] C.A. Gross, M. Lonetto, and R. Losick, “Bacterial Sigma Factors,” Transcriptional Regulation, vol. 1, pp. 129-176, 1992.
[25] D. GuhaThakurta and G.D. Stormo, “Identifying Target Sites for Cooperatively Binding Factors,” Bioinformatics, vol. 17, no. 7, pp.608-621, 2001.
[26] D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambrige Univ. Press, 1997.
[27] J. van Helden, A.F. Rios, and J. Collado-Vides, “Discovering Regulatory Elements in Non-Coding Sequences by Analysis of Spaced Dyads,” Nucleic Acids Research, vol. 28, no. 8, pp. 1808-1818, 2000.
[28] G. Hertz and G. Stormo, “Identifying DNA and Protein Patterns with Statistically Significant Alignments of Multiple Sequences,” Bioinformatics, vol. 15, nos. 7-8, pp. 563-577, 1999.
[29] D.A. Hinds, L.L. Stuve, G.B. Nilsen, E. Halperin, E. Eskin, D.G. Ballinger, K.A. Frazer, and D.R. Cox, “Whole-Genome Patterns of Common DNA Variation in Three Human Populations,” Science, vol. 307, no. 5712, pp. 1072-1079, 2005.
[30] J.D. Hughes, P.W. Estep, S. Tavazoie, and G.M. Church, “Computational Identification of CIS-Regulatory Elements Associated with Groups of Functionally Related Genes in Saccharomyces Cerevisiae,” J. Molecular Biology, vol. 296, no. 5, pp. 1205-1214, 2000.
[31] P. Indyk, N. Koudas, and S. Muthukrishnan, “Identifying Representative Trends in Massive Time Series Data Sets Using Sketches,” Proc. 26th Int'l Conf. Very Large Databases (VLDB '00), pp. 363-372, 2000.
[32] I. Jonassen, J.F. Collins, and D.G. Higgins, “Finding Flexible Patterns in Unaligned Protein Sequences,” Protein Science, vol. 4, pp. 1587-1595, 1995.
[33] U. Keich and P.A. Pevzner, “Finding Motifs in the Twilight Zone,” Proc. Sixth Ann. Int'l Conf. Research in Computational Molecular Biology (RECOMB '02), pp. 195-204, 2002.
[34] L. Li, Y. Liang, and R.L. Bass, “GAPWM: A Genetic Algorithm Method for Optimizing a Position Weight Matrix,” Bioinformatics, vol. 23, no. 10, pp. 1188-1194, 2007.
[35] H. Mannila, H. Toivonen, and A. Inkeri Verkamo, “Discovery of Frequent Episodes in Event Sequences,” Data Mining and Knowledge Discovery, vol. 1, no. 3, pp. 259-289, 1997.
[36] L. Marsan and M.F. Sagot, “Algorithms for Extracting Structured Motifs Using a Suffix Tree with Application to Promoter and Regulatory Site Consensus Identification,” J. Computational Biology, vol. 7, pp. 345-360, 2000.
[37] N.D. Mendes, A.C. Casimiro, P.M. Santos, I. Sà-Correia, A.L. Oliveira, and A.T. Freitas, “MUSA: A Parameter Free Algorithm for the Identification of Biologically Significant Motifs,” Bioinformatics, vol. 22, no. 24, pp. 2996-3002, 2006.
[38] G. Navarro, “A Guided Tour to Approximate String Matching,” ACM Computing Surveys, vol. 33, no. 1, pp. 31-88, 2001.
[39] A. Neuwald, J. Liu, and C. Lawrence, “Gibbs Motif Sampling: Detection of Bacterial Outer Membrane Repeats,” Protein Science, vol. 4, pp. 1618-1632, 1995.
[40] A.F. Neuwald and P. Green, “Detecting Patterns in Protein Sequences,” J. Molecular Biology, vol. 239, pp. 698-712, 1994.
[41] M. Osanai, H. Takahashi, K.K. Kojima, M. Hamada, and H. Fujiwara, “Essential Motifs in the 3' Untranslated Region Required for Retrotransposition and the Precise Start of Reverse Transcription in Non-Long-Terminal-Repeat Retrotransposon SART1,” Molecular and Cellular Biology, vol. 24, no. 19, pp. 7902-7913, 2004.
[42] G. Pavesi, G. Mauri, and G. Pesole, “In Silico Representation and Discovery of Transcription Factor Binding Sites,” Briefings in Bioinformatics, vol. 5, pp. 217-236, 2004.
[43] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu, “Prefixspan: Mining Sequential Patterns by Prefix-Projected Growth,” Proc. 17th Int'l Conf. Data Eng. (ICDE '01), pp. 215-224, 2001.
[44] S. Robin, J.-J. Daudin, H. Richard, M.-F. Sagot, and S. Schbath, “Occurrence Probability of Structured Motifs in Random Sequences,” J. Computational Biology, vol. 9, pp. 761-773, 2003.
[45] G.K. Sandve, O. Abul, V. Walseng, and F. Drabløs, “Improved Benchmarks for Computational Motif Discovery,” BMC Bioinformatics, vol. 8, no. 193, pp. 1-13, 2007.
[46] G.K. Sandve and F. Drabløs, “A Survey of Motif Discovery Methods in an Integrated Framework,” Biology Direct, vol. 1, no. 11, pp. 1-16, 2006.
[47] S. Sinha, “Composite Motifs in Promoter Regions of Genes: Models and Algorithms,” General Report, 2002.
[48] S. Sinha and M. Tompa, “YMF: A Program for Discovery of Novel Transcription Factor Binding Sites by Statistical Overrepresentation,” Nucleic Acid Research, vol. 31, no. 13, pp. 3586-3588, 2003.
[49] H.O. Smith, T.M. Annau, and S. Chandrasegaran, “Finding Sequence Motifs in Groups of Functionally Related Proteins,” Proc. Nat'l Academy of Sciences, pp. 826-830, 1990.
[50] R. Srikant and R. Agrawal, “Mining Sequential Patterns: Generalizations and Performance Improvements,” Proc. Fifth Int'l Conf. Extending Database Technology (EDBT '96), pp. 3-17, 1996.
[51] Z. Tu, S. Li, and C. Mao, “The Changing Tails of a Novel Short Interspersed Element in Aedes Aegypti: Genomic Evidence for Slippage Retrotransposition and the Relationship between 3' Tandem Repeats and the Poly(da) Tail,” Genetics, vol. 168, no. 4, pp. 2037-2047, 2004.
[52] A. Vanet, L. Marsan, A. Labigne, and M.-F. Sagot, “Inferring Regulatory Elements from a Whole Genome. An Analysis of Helicobacter Pylori $\sigma^{80}$ Family of Promoter Signals,” J. Molecular Biology, vol. 297, pp. 335-353, 2000.
[53] A. Vanet, L. Marsan, and M.-F. Sagot, “Promoter Sequences and Algorithmical Methods for Identifying Them,” Research in Microbiology, vol. 150, no. 9, pp. 779-799, 1999.
[54] K. Wang, Y. Xu, and J. Xu Yu, “Scalable Sequential Pattern Mining for Biological Sequences,” Proc. ACM 13th Conf. Information and Knowledge Management (CIKM '04), pp. 178-187, 2004.
[55] T. Werner, “Models for Prediction and Recognition of Eukaryotic Promoters,” Mammalian Genome, vol. 10, no. 2, pp. 168-175, 1999.
[56] T. Werner, “The State of the Art of Mammalian Promoter Recognition,” Briefings in Bioinformatics, vol. 4, no. 1, pp. 22-30, 2003.
[57] M.J. Zaki, “Spade: An Efficient Algorithm for Mining Frequent Sequences,” Machine Learning, vol. 42, no. 1-2, pp. 31-60, 2001.
[58] Y. Zhang and M.J. Zaki, “EXMOTIF: Efficient Structured Motif Extraction,” Algorithms for Molecular Biology, vol. 1, no. 1,rec.No21, 2006.
[59] J. Zhu and M. Zhang, “SCPD: A Promoter Database for the Yeast Saccharomyces Cerevisiae,” Bioinformatics, vol. 15, nos. 7-8, pp. 607-611, 1999.

Index Terms:
Data mining, Bioinformatics (genome or protein) databases, Mining methods and algorithms
Citation:
Fabio Fassetti, Gianluigi Greco, Giorgio Terracina, "Mining Loosely Structured Motifs from Biological Data," IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 11, pp. 1472-1489, Nov. 2008, doi:10.1109/TKDE.2008.65
Usage of this product signifies your acceptance of the Terms of Use.