Issue No. 04 - July-Aug. (2012 vol. 9)

ISSN: 1545-5963

pp: 1120-1127

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TCBB.2012.30

R. Marangoni , Dept. of Comput. Sci., Univ. of Pisa, Pisa, Italy

C. Felicioli , Noname Res., Pisa, Italy

ABSTRACT

Here, we propose BpMatch: an algorithm that, working on a suitably modified suffix-tree data structure, is able to compute, in a fast and efficient way, the coverage of a source sequence S on a target sequence T, by taking into account direct and reverse segments, eventually overlapped. Using BpMatch, the operator should define a priori, the minimum length l of a segment and the minimum number of occurrences minRep, so that only segments longer than l and having a number of occurrences greater than minRep are considered to be significant. BpMatch outputs the significant segments found and the computed segment-based distance. On the worst case, assuming the alphabet dimension d is a constant, the time required by BpMatch to calculate the coverage is O(l

^{2}n). On the average, by setting l≥ 2 log_{d}(n), the time required to calculate the coverage is only O(n). BpMatch, thanks to the minRep parameter, can also be used to perform a self-covering: to cover a sequence using segments coming from itself, by avoiding the trivial solution of having a single segment coincident with the whole sequence. The result of the self-covering approach is a spectral representation of the repeats contained in the sequence. BpMatch is freely available on: www.sourceforge.net/projects/bpmatch/.INDEX TERMS

trees (mathematics), bioinformatics, data structures, genomics, molecular biophysics, molecular configurations, self covering, BpMatch, genomic sequence segmental analysis algorithm, suffix tree data structure, source sequence coverage, target sequence, direct segments, reverse segments, segment based distance, alphabet dimension, Bioinformatics, Genomics, Algorithm design and analysis, Image segmentation, Complexity theory, Evolution (biology), coverage index., Segmental analysis, genomic sequences, repeats, inverted repeats

CITATION

R. Marangoni, C. Felicioli, "BpMatch: An Efficient Algorithm for a Segmental Analysis of Genomic Sequences",

*IEEE/ACM Transactions on Computational Biology and Bioinformatics*, vol. 9, no. , pp. 1120-1127, July-Aug. 2012, doi:10.1109/TCBB.2012.30