This Article 
 Bibliographic References 
 Add to: 
Efficient Maximal Repeat Finding Using the Burrows-Wheeler Transform and Wavelet Tree
March/April 2012 (vol. 9 no. 2)
pp. 421-429
M. Oğuzhan Külekci, National Research Institute of Electronics and Cryptology, Gebze
Jeffrey Scott Vitter, The University of Kansas, Lawrence
Bojian Xu, Eastern Washington University, Cheney
Finding repetitive structures in genomes and proteins is important to understand their biological functions. Many data compressors for modern genomic sequences rely heavily on finding repeats in the sequences. Small-scale and local repetitive structures are better understood than large and complex interspersed ones. The notion of maximal repeats captures all the repeats in the data in a space-efficient way. Prior work on maximal repeat finding used either a suffix tree or a suffix array along with other auxiliary data structures. Their space usage is 19-50 times the text size with the best engineering efforts, prohibiting their usability on massive data such as the whole human genome. We focus on finding all the maximal repeats from massive texts in a time- and space-efficient manner. Our technique uses the Burrows-Wheeler Transform and wavelet trees. For data sets consisting of natural language texts and protein data, the space usage of our method is no more than three times the text size. For genomic sequences stored using one byte per base, the space usage of our method is less than double the sequence size. Our space-efficient method keeps the timing performance fast. In fact, our method is orders of magnitude faster than the prior methods for processing massive texts such as the whole human genome, since the prior methods must use external memory. For the first time, our method enables a desktop computer with 8 GB internal memory (actual internal memory usage is less than 6 GB) to find all the maximal repeats in the whole human genome in less than 17 hours. We have implemented our method as general-purpose open-source software for public use.

[1] V. Becher, A. Deymonnaz, and P.A. Heiber, “Efficient Computation of All Perfect Repeats in Genomic Sequences of Up To Half a Gigabyte, with a Case Study on the Human Genome,” Bioinformatics, vol. 25, no. 14, pp. 1746-1753, 2009.
[2] B. Behzadi and F.L. Fessant, “DNA Compression Challenge Revisited: A Dynamic Programming Approach,” Proc. Ann. Symp. Combinatorial Pattern Matching, 2005.
[3] G. Benson, “Tandem Repeats Finder: A Program to Analyze DNA Sequences,” Nucleic Acids Research, vol. 27, no. 2, pp. 573-580, 1999.
[4] M. Burrows and D.J. Wheeler, “A Block Sorting Data Compression Algorithm,” technical report, Digital Systems Research Center, 1994.
[5] A.T. Castelo, W. Martins, and G.R. Gao, “Troll-Tandem Repeat Occurrence Locator,” Bioinformatics, vol. 18, no. 4, pp. 634-636, 2002.
[6] R. Dementiev, J. Kärkkäinen, J. Mehnert, and P. Sanders, “Better External Memory Suffix Array Construction,” J. Experimental Algorithmics, vol. 12, pp. 1-24, 2008.
[7] P. Ferragina and G. Manzini, “Indexing Compressed Text,” J. ACM, vol. 52, no. 4, pp. 552-581, 2005.
[8] P. Ferragina, G. Manzini, V. Mäkinen, and G. Navarro, “An Alphabet-Friendly FM-Index,” Proc. Int'l Symp. String Processing and Information Retrieval (SPIRE), pp. 150-160, 2004.
[9] J. Fischer, V. Mäkinen, and N. Välimäki, “Space Efficient String Mining under Frequency Constraints,” Proc. IEEE Int'l Conf. Data Mining, pp. 193-202, 2008.
[10] R. Grossi, A. Gupta, and J.S. Vitter, “High-Order Entropy-Compressed Text Indexes,” Proc. ACM-SIAM Symp. Discrete Algorithms, pp. 841-850, 2003.
[11] R. Grossi and J.S. Vitter, “Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching,” SIAM J. Computing, vol. 35, no. 32, pp. 378-407, 2005.
[12] D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge Univ. Press, 1997.
[13] W.-K. Hon, T.-W. Lam, K. Sadakane, W.-K. Sung, and S.-M. Yiu, “A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays,” Algorithmica, vol. 48, no. 1, pp. 23-36, 2007.
[14] W.-K. Hon, K. Sadakane, and W.-K. Sung, “Breaking a Time-and-Space Barrier in Constructing Full-Text Indices,” SIAM J. Computing, vol. 38, no. 6, pp. 2162-2178, 2009.
[15] J. Kärkkäinen, “Fast BWT in Small Space by Blockwise Suffix Sorting,” Theoretical Computer Science, vol. 387, no. 3, pp. 249-257, 2007.
[16] J. Kärkkäinen, G. Manzini, and S.J. Puglisi, “Permuted Longest-Common-Prefix Array,” Proc. 20th Ann. Symp. Combinatorial Pattern Matching (CPM), pp. 181-192, 2009.
[17] T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park, “Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications,” Proc. Ann. Symp. Combinatorial Pattern Matching, pp. 181-192, 2001.
[18] M.O. Külekci, J.S. Vitter, and B. Xu, “Time- and Space-Efficient Maximal Repeat Finding Using the Burrows-Wheeler Transform and Wavelet Trees,” Proc. IEEE Int'l Conf. Bioinformatics and Biomedicine (BIBM), pp. 622-625, 2010.
[19] S. Kurtz, “Reducing the Space Requirements of Suffix Trees,” Software—Practice and Experience, vol. 29, no. 13, pp. 1149-1171, 1999.
[20] S. Kurtz, J.V. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich, “Reputer: The Manifold Applications of Repeat Analysis on a Genomic Scale,” Nucleic Acids Research, vol. 29, no. 22, pp. 4633-4642, 2001.
[21] S. Kurtz and C. Schleiermacher, “Reputer: Fast Computation of Maximal Repeats in Complete Genomes,” Bioinformatics, vol. 15, no. 5, pp. 426-427, 1999.
[22] A. Lefebvre, T. Lecroq, H. Dauchel, and J. Alexandre, “FORRepeats: Detects Repeats on Entire Chromosomes and between Genomes,” Bioinformatics, vol. 19, no. 3, pp. 319-326, 2003.
[23] R. Lippert, “Space-Efficient Whole Genome Comparisons with Burrows-Wheeler Transforms,” J. Computational Biology, vol. 12, no. 4, pp. 407-415, 2005.
[24] R.A. Lippert, C.M. Mobarry, and B. Walenz, “A Space-Efficient Construction of the Burrows-Wheeler Transform for Genomic Data,” J. Computational Biology, vol. 12, no. 7, pp. 943-951, 2005.
[25] X. Liu and L. Wang, “Finding the Region of Pseudo-Periodic Tandem Repeats in Biological Sequences,” Algorithms for Molecular Biology, vol. 1, no. 1, p. 2, 2006.
[26] V. Mäkinen, “Compact Suffix Array: A Space-Efficient Full-Text Index,” Fundamenta Informaticae, vol. 56, pp. 191-210, Oct. 2002.
[27] G. Manzini, “Two Space Saving Tricks for Linear Time lcp Array Computation,” Proc. Scandinavian Workshop Algorithm Theory, pp. 372-383, 2004.
[28] G. Manzini and M. Rastero, “A Simple and Fast DNA Compressor,” Software—Practice and Experience, vol. 34, pp. 1397-1411, 2004.
[29] H.M. Martinez, “An Efficient Method for Finding Repeats in Molecular Sequences,” Nucleic Acids Research, vol. 11, no. 13, pp. 4629-4634, 1983.
[30] E.H. McConkey, Human Genetics: The Molecular Revolution. Jones and Bartlett, 1993.
[31] J.C. Na and K. Park, “Alphabet-Independent Linear-Time Construction of Compressed Suffix Arrays Using o(nlogn)-Bit Working Space,” Theoretical Computer Science, vol. 385, nos. 1-3, pp. 127-136, 2007.
[32] G. Navarro and V. Mäkinen, “Compressed Full-Text Indexes,” ACM Computing Surveys, vol. 39, no. 1, 2007.
[33] E. Ohlebusch, S. Gog, and A. Kügell, “Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes,” Proc. 17th Int'l Conf. String Processing and Information Retrieval (SPIRE), pp. 347-358, 2010.
[34] D. Okanohara and K. Sadakane, “A Linear-Time Burrows-Wheeler Transform Using Induced Sorting,” Proc. Int'l Symp. String Processing and Information Retrieval, pp. 90-101, 2009.
[35] A. Poddar, N. Chandra, M. Ganapathiraju, K. Sekar, J. Klein-Seetharaman, R. Judith, R. Reddy, and N. Balakrishnan, “Evolutionary Insights from Suffix Array-based Genome Sequence Analysis,” J. Biosciences, vol. 32, no. 5, pp. 871-881, 2007.
[36] R. Raman, V. Raman, and S.S. Rao, “Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees, Prefix Sums and Multisets,” ACM Trans. Algorithms, vol. 3, no. 4, p. 43, 2007.
[37] K. Sadakane, “Succinct Representations of lcp Information and Improvements in the Compressed Suffix Arrays,” Proc. Ann. ACM-SIAM Symp. Discrete Algorithms, pp. 225-232, 2002.
[38] S. Saha, S. Bridges, Z.V. Magbanua, and D.G. Peterson, “Computational Approaches and Tools Used in Identification of Dispersed Repetitive DNA Sequences,” Tropical Plant Biology, vol. 1, no. 1, pp. 85-96, 2008.
[39] S. Saha, S. Bridges, Z.V. Magbanua, and D.G. Peterson, “Empirical Comparison of ab Initio Repeat Finding Programs,” Nucleic Acids Research, vol. 36, no. 7, pp. 2284-2294, 2008.
[40] J. Sirén, “Sampled Longest Common Prefix Array,” Proc. Ann. Symp. Combinatorial Pattern Matching (CPM), pp. 227-237, 2010.
[41] N. Välimäki, V. Mäkinen, W. Gerlach, and K. Dixit, “Engineering a Compressed Suffix Tree Implementation,” J. Experimental Algorithmics, vol. 14, pp. 2:4.2-2:4.23, 2010.
[42] J.S. Vitter, Algorithms and Data Structures for External Memory, Foundations and Trends in Theoretical Computer Science. Now Publishers, 2008.

Index Terms:
Repeats, maximal repeats, Burrows-Wheeler transform, wavelet trees.
M. Oğuzhan Külekci, Jeffrey Scott Vitter, Bojian Xu, "Efficient Maximal Repeat Finding Using the Burrows-Wheeler Transform and Wavelet Tree," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 2, pp. 421-429, March-April 2012, doi:10.1109/TCBB.2011.127
Usage of this product signifies your acceptance of the Terms of Use.