loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
VARUN: Discovering Extensible Motifs under Saturation Constraints
PrePrint
ISSN: 1545-5963
Alberto Apostolico, University of Padova, Padova, Georgia Institute of Technology, Atlanta
Matteo Comin, University of Padova, Padova
Laxmi Parida, IBM T. J. Watson Research Center, Yorktown Heights
The discovery of motifs in biosequences is frequently torn between the rigidity of the model on the one hand and the abundance of candidates on the other. In particular, motifs that include wildcards or "don't cares" escalate exponentially with their number, and this gets only worse if a don't care is allowed to stretch up to some prescribed maximum length. In this paper, a notion of extensible motif in a sequence is introduced and studied, which tightly combines the structure of the motif pattern, as described by its syntactic specification, with the statistical measure of its occurrence count. It is shown that a combination of appropriate saturation conditions and the monotonicity of probabilistic scores over regions of constant frequency afford us significant parsimony in the generation and testing of candidate overrepresented motifs. A suite of software programs called Varun is described, implementing the discovery of extensible motifs of the type considered. The merits of the method are then documented by results obtained in a variety of experiments primarily targeting protein sequence families. Of equal importance seems the fact that the sets of all surprising motifs returned in each experiment are extracted faster and come in much more manageable sizes than would be obtained in the absence of saturation constraints.
Index Terms:
computational genomics, pattern discovery, data mining, protein sequence
Citation:
Alberto Apostolico, Matteo Comin, Laxmi Parida, "VARUN: Discovering Extensible Motifs under Saturation Constraints," IEEE/ACM Transactions on Computational Biology and Bioinformatics, 12 Nov. 2008. IEEE computer Society Digital Library. IEEE Computer Society, <http://doi.ieeecomputersociety.org/10.1109/TCBB.2008.123>
Usage of this product signifies your acceptance of the Terms of Use.