This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
BpMatch: An Efficient Algorithm for a Segmental Analysis of Genomic Sequences
July-Aug. 2012 (vol. 9 no. 4)
pp. 1120-1127
R. Marangoni, Dept. of Comput. Sci., Univ. of Pisa, Pisa, Italy
C. Felicioli, Noname Res., Pisa, Italy
Here, we propose BpMatch: an algorithm that, working on a suitably modified suffix-tree data structure, is able to compute, in a fast and efficient way, the coverage of a source sequence S on a target sequence T, by taking into account direct and reverse segments, eventually overlapped. Using BpMatch, the operator should define a priori, the minimum length l of a segment and the minimum number of occurrences minRep, so that only segments longer than l and having a number of occurrences greater than minRep are considered to be significant. BpMatch outputs the significant segments found and the computed segment-based distance. On the worst case, assuming the alphabet dimension d is a constant, the time required by BpMatch to calculate the coverage is O(l2n). On the average, by setting l≥ 2 logd(n), the time required to calculate the coverage is only O(n). BpMatch, thanks to the minRep parameter, can also be used to perform a self-covering: to cover a sequence using segments coming from itself, by avoiding the trivial solution of having a single segment coincident with the whole sequence. The result of the self-covering approach is a spectral representation of the repeats contained in the sequence. BpMatch is freely available on: www.sourceforge.net/projects/bpmatch/.

[1] B.H. Liu, Statistical Genomics. CRC Press, 1998.
[2] A. Kedzierska and D. Husmeier, "A Heuristic Bayesian Method for Segmenting DNA Sequence Alignments and Detecting Evidence for Recombination and Gene Conversion," Statistical Application in Genetics and Molecular Biology, vol. 5, pp. 65-97, 2006.
[3] J.M. Keith, "Segmenting Eukaryotic Genomes with the Generalized Gibbs Sampler," J. Computational Biology, vol. 13, pp. 1369-1383, 2006.
[4] R.K. Azad, P. Bernaola-Galvan, R. Ramaswamy, and J.S. Rao, "Segmentation of Genomic DNA through Entropic Divergence: Power Laws and Scaling," Physical Rev. E, vol. 6505, pp. 1909-1909, 2002.
[5] W.T. Li, "New Stopping Criteria for Segmenting DNA Sequences," Physical Rev. Letters, vol. 86, pp. 5815-5818, 2001.
[6] C. Andre, P. Vincens, J.F. Boisvieux, and S. Hazout, "MOSAIC: Segmenting Multiple Aligned DNA Sequences," Bioinformatics, vol. 17, pp. 196-197, 2001.
[7] B. Lewin, Genes, eighth ed. Prentice Hall, 2003.
[8] R. Redon et al., "Global Variation in Copy Number in the Human Genome," Nature, vol. 444, pp. 444-454, 2006.
[9] D. Komura et al., "Genome-Wide Detection of Human Copy Number Variations Using High-Density DNA Oligonucleotide Arrays," Genome Research, vol. 16, pp. 1575-1584, 2006.
[10] A.J. Sharp et al., "Segmental Duplications and Copy-number Variation in the Human Genome," Am. J. Human Genetics, vol. 77, pp. 78-88, 2005.
[11] B.E. Stranger et al., "Relative Impact of Nucleotide and Copy Number Variation on Gene Expression Phenotypes," Science, vol. 315, pp. 848-853, 2007.
[12] J. Zhang, L. Feuk, G.E. Duggan, R. Khaja, and S.W. Scherer, "Development of Bioinformatics Resources for Display and Analysis of Copy Number and Other Structural Variants in the Human Genome," Cytogenetic and Genome Research, vol. 115, pp. 205-214, 2006.
[13] J.S. Varré, J.P. Delahaye, and E. Rivals, "Transformation Distances: A Family of Dissimilarity Measures Based on Movements of Segments," Bioinformatics, vol. 15, pp. 194-202, 1999.
[14] N. Pisanti, R. Marangoni, P. Ferragina, A. Frangioni, A. Savona, C. Pisanelli, and F. Luccio, "PaTre: A Method for Paralogy Trees Construction," J. Computational Biology, vol. 10, pp. 791-802, 2003.
[15] A.L. Halpern, D.H. Huson, and K. Reinert, "Segment Match Refinement and Applications," Proc. Second Int'l Workshop Algorithms in Bioinformatics, vol. 2452, pp. 126-139, 2002.
[16] F. Ergun, S. Muthukrishnan, and S. Cenk Sahinalp, "Comparing Sequences with Segment Rearrangements," Foundations of Software Technology and Theoretical Computer Science, vol. 2914, pp. 183-194, 2003.
[17] B. Behzadi and J.-M. Steyaert, "On the Transformation Distance Problem," Proc. Prague Stringology Conf., pp. 310-320, 2004.
[18] G. Didier and C. Guziolowski, "Mapping Sequences by Parts," Algorithms for Molecular Biology, vol. 2, pp. 1-15, 2007.
[19] G.F. Richard, A. Kerrest, and B. Dujon, "Comparative Genomics and Molecular Dynamics of DNA Repeats in Eukaryotes," Microbiology and Molecular Biology Rev., vol. 72, pp. 686-727, 2008.
[20] S. Moon et al., "Data-Driven Approach to Detect Common Copy-Number Variation and Frequency Profiles in a Population-Based Korean Cohort," European J. Human Genetics, vol. 19, pp. 1167-1172, 2011.
[21] E.L. Braun et al., "Homoplastic Microinversions and the Avian Tree of Life," BMC Evolutionary Biology, vol. 11, article 141, 2011.
[22] P. Weiner, "Linear Pattern Matching Algorithm," Proc. IEEE 14th Ann. Symp. Switching and Automata Theory, pp. 1-11, 1973.
[23] E.M. McCreight, "A Space-Economical Suffix-Tree Construction Algorithm," J. ACM, vol. 23, no. 2, pp. 262-272, 1976.
[24] A. Apostolico, "The Myriad Virtues of Subwords Trees," Combinatorial Algorithms on Words, A. Apostolico and Z. Galil, eds., pp. 85-95, Springer, 1985.
[25] G.A. Stephen, String Searching Algorithms. World Scientific Press, 1994.
[26] R. Grossi and J.S. Vitter, "Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching," SIAM J. Computing, vol. 35, no. 2, pp. 378-407, 2005.
[27] S. Kurtz et al., "REPuter: The Manyfold Application of Repeat Analysis on a Genomic Scale," Nucleic Acids Research, vol. 29, pp. 4633-4642, 2001.
[28] R. Chenna et al., "Multiple Sequence Alignment with the Clustal Series of Programs," Nucleic Acids Research, vol. 31, pp. 3497-3500, 2003.
[29] M. Katoh and M. Kuma, "MAFFT: A Novel Method for a Rapid Multiple Sequence Alignment Based on Fast Fourier Transform," Nucleic Acids Research, vol. 30, pp. 3059-3066, 2002.
[30] T.A. Tatusova and T.L. Madden, "BLAST 2 Sequences, a New Tool for Comparing Protein and Nucleotide Sequences," FEMS Microbiology Letters, vol. 174, pp. 247-250, 1999.
[31] L. Noè and G. Kucherov, "YASS: Enhancing the Sensitivity of DNA Similarity Search," Nucleic Acids Research, vol. 33, pp. w540-w543, 2005.

Index Terms:
trees (mathematics),bioinformatics,data structures,genomics,molecular biophysics,molecular configurations,self covering,BpMatch,genomic sequence segmental analysis algorithm,suffix tree data structure,source sequence coverage,target sequence,direct segments,reverse segments,segment based distance,alphabet dimension,Bioinformatics,Genomics,Algorithm design and analysis,Image segmentation,Complexity theory,Evolution (biology),coverage index.,Segmental analysis,genomic sequences,repeats,inverted repeats
Citation:
R. Marangoni, C. Felicioli, "BpMatch: An Efficient Algorithm for a Segmental Analysis of Genomic Sequences," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 4, pp. 1120-1127, July-Aug. 2012, doi:10.1109/TCBB.2012.30
Usage of this product signifies your acceptance of the Terms of Use.