The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.04 - October-December (2010 vol.7)
pp: 752-762
Alberto Apostolico , Georgia Institute of Technology, Atlanta and University of Padova, Padova,
Matteo Comin , University of Padova, Padova
Laxmi Parida , IBM T.J. Watson Research Center, Yorktown Heights
ABSTRACT
The discovery of motifs in biosequences is frequently torn between the rigidity of the model on one hand and the abundance of candidates on the other hand. In particular, motifs that include wild cards or “don't cares” escalate exponentially with their number, and this gets only worse if a don't care is allowed to stretch up to some prescribed maximum length. In this paper, a notion of extensible motif in a sequence is introduced and studied, which tightly combines the structure of the motif pattern, as described by its syntactic specification, with the statistical measure of its occurrence count. It is shown that a combination of appropriate saturation conditions and the monotonicity of probabilistic scores over regions of constant frequency afford us significant parsimony in the generation and testing of candidate overrepresented motifs. A suite of software programs called Varun¹ is described, implementing the discovery of extensible motifs of the type considered. The merits of the method are then documented by results obtained in a variety of experiments primarily targeting protein sequence families. Of equal importance seems the fact that the sets of all surprising motifs returned in each experiment are extracted faster and come in much more manageable sizes than would be obtained in the absence of saturation constraints.
INDEX TERMS
Computational genomics, pattern discovery, data mining, motif, protein sequence, protein family.
CITATION
Alberto Apostolico, Matteo Comin, Laxmi Parida, "VARUN: Discovering Extensible Motifs under Saturation Constraints", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.7, no. 4, pp. 752-762, October-December 2010, doi:10.1109/TCBB.2008.123
REFERENCES
[1] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules," Proc. 20th Int'l Conf. Very Large Data Bases (VLDB), pp. 487-499, 1994.
[2] A. Apostolico, M.E. Bock, and S. Lonardi, "Monotony of Surprise and Large Scale Quest for Unusual Words," J. Computational Biology, vol. 10, nos. 3/4, pp. 283-311, 2003.
[3] A. Apostolico, M. Comin, and L. Parida, "Conservative Extraction of Over-Represented Extensible Motifs," Proc. Int'l Conf. Intelligent Systems for Molecular Biology (ISMB '05), Bioinformatics, vol. 21, no. 1, pp. 9-18, 2005.
[4] A. Apostolico and L. Parida, "Incremental Paradigms for Motif Discovery," J. Computational Biology, vol. 11, no. 1, pp. 15-25, 2004.
[5] A. Apostolico and C. Pizzi, "Monotone Scoring of Patterns with Mismatches," Proc. Workshop Algorithms in Bioinformatics (WABI), vol. 3240, pp. 87-98, Sept. 2004.
[6] T. Bailey, N. Williams, C. Misleh, and W. Li, "MEME: Discovering and Analyzing DNA and Protein Sequence Motifs," Nucleic Acids Research, vol. 34, pp. W369-W373, 2006.
[7] A. Blumer, J. Blumer, A. Ehrenfeucht, D. Haussler, M.T. Chen, and J. Seiferas, "The Smallest Automaton Recognizing the Subwords of a Text," Theoretical Computer Science, vol. 40, pp. 31-55, 1985.
[8] D.R. Breiter, T.E. Meyer, I. Rayment, and H.M. Holden, J. Biological Chemistry, vol. 266, pp. 18660-18667, 1991.
[9] J. Buhler and M. Tompa, "Finding Motifs Using Random Projections," J. Computational Biology, vol. 9, no. 2, pp. 225-242, 2002.
[10] A. Chattaraj and L. Parida, "An Inexact Suffix Tree Based Algorithm for Extensible Pattern Discovery," Theoretical Computer Science, vol. 335, no. 1, pp. 3-14, 2005.
[11] G.Z. Hertz and G.D. Stormo, "Identifying DNA and Protein Patterns with Statistically Significant Alignments of Multiple Sequences," Bioinformatics, vol. 15, pp. 563-577, 1999.
[12] U. Keich and P.A. Pevzner, "Finding Motifs in the Twilight Zone" Proc. Ann. Int'l Conf. Computational Molecular Biology, pp. 195-204, Apr. 2002.
[13] C.E. Lawrence, S.F. Altschul, M.S. Boguski, J.S. Liu, A.F. Neuwald, and J.C. Wootton, "Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment," Science, vol. 262, pp. 208-214, Oct. 1993.
[14] M.Y. Leung, G.M. Marsh, and T.P. Speed, "Over and Underrepresentation of Short DNA Words in Herpesvirus Genomes," J. Computational Biology, vol. 3, pp. 345-360, 1996.
[15] L. Parida, Pattern Discovery in Bioinformatics: Theory and Algorithms. Chapman Hall/CRC, 2007.
[16] P.A. Pevzner and S.-H. Sze, "Combinatorial Approaches to Finding Subtle Signals in DNA Sequences," Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology, pp. 269-278, 2000.
[17] I. Rigoutsos and A. Floratos, "Motif Discovery in Biological Sequences without Alignment or Enumeration," Proc. Ann. Conf. Computational Molecular Biology (RECOMB '98), pp. 221-227, 1998.
[18] J.T.L. Wang, B.A. Shapiro, and D. Shasha, Pattern Discovery in Biomolecular Data. Oxford Univ. Press, 1999.
[19] M.S. Waterman, An Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman Hall, 1995.
[20] J. Zhu and M. Zhang, "SCPD: A Promoter Database of the Yeast Saccha-Romyces Cerevisiae," Bioinformatics, vol. 15, pp. 607-611, 1999.
19 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool