This Article 
 Bibliographic References 
 Add to: 
A New Efficient Data Structure for Storage and Retrieval of Multiple Biosequences
March/April 2012 (vol. 9 no. 2)
pp. 345-357
Sascha Steinbiss, University of Hamburg, Hamburg
Stefan Kurtz, University of Hamburg, Hamburg
Today's genome analysis applications require sequence representations allowing for fast access to their contents while also being memory-efficient enough to facilitate analyses of large-scale data. While a wide variety of sequence representations exist, lack of a generic implementation of efficient sequence storage has led to a plethora of poorly reusable or programming language-specific implementations. We present a novel, space-efficient data structure (GtEncseq) for storing multiple biological sequences of variable alphabet size, with customizable character transformations, wildcard support, and an assortment of internal representations optimized for different distributions of wildcards and sequence lengths. For the human genome (3.1 gigabases, including 237 million wildcard characters) our representation requires only 2 + 8\cdot 10^{-6} bits per character. Implemented in C, our portable software implementation provides a variety of methods for random and sequential access to characters and substrings (including different reading directions) using an object-oriented interface. In addition, it includes access to metadata like sequence descriptions or character distributions. The library is extensible to be used from various scripting languages. GtEncseq is much more versatile than previous solutions, adding features that were previously unavailable. Benchmarks show that it is competitive with respect to space and time requirements.

[1] T. Li, K. Fan, J. Wang, and W. Wang, “Reduction of Protein Sequence Complexity by Residue Grouping,” Protein Eng., vol. 16, no. 5, pp. 323-330, 2003.
[2] E.L. Peterson, J. Kondev, J.A. Theriot, and R. Phillips, “Reduced Amino Acid Alphabets Exhibit an Improved Sensitivity and Selectivity in Fold Assignment,” Bioinformatics, vol. 25, no. 11, pp. 1356-1362, 2009.
[3] R.C. Edgar, “Local Homology Recognition and Distance Measures in Linear Time Using Compressed Amino Acid Alphabets,” Nucleic Acids Research, vol. 32, no. 1, pp. 380-385, 2004.
[4] A. Albayrak, H.H. Otu, and U.O. Sezerman, “Clustering of Protein Families into Functional Subtypes Using Relative Complexity Measure with Reduced Amino Acid Alphabets,” BMC Bioinformatics, vol. 11, no. 1,article 428, 2010.
[5] N. Bansal, M. Cieliebak, and Z. Lipták, “Efficient Algorithms for Finding Submasses in Weighted Strings,” Proc. 15th Ann. Symp. Combinatorial Pattern Matching, 2004.
[6] H. Williams and J. Zobel, “Compression of Nucleotide Databases for Fast Searching,” Computer Applications in the Biosciences, vol. 13, no. 5, pp. 549-554, 1997.
[7] A. Morgulis, G. Coulouris, Y. Raytselis, T.L. Madden, R. Agarwala, and A.A. Schäffer, “Database Indexing for Production MegaBLAST Searches,” Bioinformatics, vol. 24, no. 16, pp. 1757-1764, 2008.
[8] M. Cameron and H.E. Williams, “Comparing Compressed Sequences for Faster Nucleotide BLAST Searches,” IEEE/ACM Trans Computational Biology and Bioinformatics, vol. 4, no. 3, pp. 349-364, July-Sept. 2007.
[9] “The NCBI C Toolkit,” , 2011.
[10] W.J. Kent, “BLAT-the BLAST-Like Alignment Tool,” Genome Research, vol. 12, no. 4, pp. 656-664, 2002.
[11] “bx-Python - Tools for Manipulating Biological Data, Particularly Multiple Sequence Alignments,” taylor/bx-python overview, 2011.
[12] A. Döring, D. Weese, T. Rausch, and K. Reinert, “SeqAn an Efficient, Generic C++ Library for Sequence Analysis,” BMC Bioinformatics, vol. 9, article 11, 2008.
[13] P.J.A. Cock, C.J. Fields, N. Goto, M.L. Heuer, and P.M. Rice, “The Sanger FASTQ file Format for Sequences with Quality Scores, and the Solexa/Illumina FASTQ variants,” Nucleic Acids Research, vol. 38, pp. 1767-1771, 2010.
[14] R.C.G. Holland, T.A. Down, M. Pocock, A. Prlić, D. Huen, K. James, S. Foisy, A. Dräger, A. Yates, M. Heuer, and M.J. Schreiber, “BioJava: An Open-Source Framework for Bioinformatics,” Bioinformatics, vol. 24, no. 18, pp. 2096-2097, 2008.
[15] M.M. Hoffman, O.J. Buske, and W.S. Noble, “The Genomedata Format for Storing Large-Scale Functional Genomics Data,” Bioinformatics, vol. 26, no. 11, pp. 1458-1459, 2010.
[16] W. Tembe, J. Lowey, and E. Suh, “G-SQZ: Compact Encoding of Genomic Sequence and Quality Data,” Bioinformatics, vol. 26, no. 17, pp. 2192-2194, 2010.
[17] S. Deorowicz and S. Grabowski, “Compression of DNA Sequence Reads in FASTQ Format,” Bioinformatics, vol. 27, no. 6, pp. 860-862, 2011.
[18] R.C. Gentleman et al., “Bioconductor: Open Software Development for Computational Biology and Bioinformatics,” Genome Biology, vol. 5, p. R80, 2004.
[19] M. Morgan, S. Anders, M. Lawrence, P. Aboyoun, H. Pagès, and R. Gentleman, “ShortRead: A Bioconductor Package for Input, Quality Assessment and Exploration of High-Throughput Sequence Data,” Bioinformatics, vol. 25, no. 19, pp. 2607-2608, 2009.
[20] “GenomeTools C API,” http://genometools.orglibgenometools. html , 2011.
[21] M. Domazet-Lošo and B. Haubold, “Efficient Estimation of Pairwise Distances between Genomes,” Bioinformatics, vol. 25, pp. 3221-3227, 2009.
[22] M.I. Abouelhoda, S. Kurtz, and E. Ohlebusch, “Replacing Suffix Trees with Enhanced Suffix Arrays,” J. Discrete Algorithms, vol. 2, pp. 53-86, 2004.
[23] P. Ferragina and G. Manzini, “Opportunistic Data Structures with Applications,” Proc. 41st Ann. IEEE Symp. Foundations of Computer Science, pp. 390-398, 2000.
[24] D. Ellinghaus, S. Kurtz, and U. Willhoeft, “LTRharvest, an Efficient and Flexible Software for de novo Detection of LTR Retrotransposons,” BMC Bioinformatics, vol. 9, article 18, 2008.
[25] S. Kurtz, A. Narechania, J.C. Stein, and D. Ware, “A New Method to Compute K-mer Frequencies and Its Application to Annotate Large Repetitive Plant Genomes,” BMC Genomics, vol. 9, article 517, 2008.
[26] S. Steinbiss, U. Willhoeft, G. Gremme, and S. Kurtz, “Fine-Grained Annotation and Classification of de novo predicted LTR retrotransposons,” Nucleic Acids Research, vol. 37, no. 21, pp. 7002-7013, 2009.
[27] D.J. Schmitz-Hübsch and S. Kurtz, “MetaGenomeThreader: A Software Tool for Predicting Genes in DNA-Sequences of Metagenome Projects,” Metagenomics: Methods and Protocols, ser. Methods in Molecular Biology, W. Streit and R. Daniel, eds. Springer, 2010.
[28] A. Smit, R. Hubley, and P. Green, “Repeatmasker Open-3.0,” http:/, 2004.
[29] R. Ierusalimschy, L.H. de Figueiredo, and W.C. Filho, “Lua - An Extensible Extension Language,” Software: Practice & Experience, vol. 26, pp. 635-652, 1996.
[30] “The ISC License,”, 2011.
[31] “Cygwin,” http:/, 2011.
[32] D.A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, and E.W. Sayers, “GenBank,” Nucleic Acids Research, vol. 38, (Database Issue), pp. D46-D51, 2010.
[33] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, “Basic Local Alignment Search Tool,” J. Molecular Biology, vol. 215, no. 3, pp. 403-410, 1990.
[34] B. Ma, J. Tromp, and M. Li, “PatternHunter: Faster and More Sensitive Homology Search,” Bioinformatics, vol. 18, no. 3, pp. 440-445, 2002.
[35] S. Kurtz, J.V. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich, “REPuter: The Manifold Applications of Repeat Analysis on a Genomic Scale,” Nucleic Acids Research, vol. 29, pp. 4633-4642, 2001.
[36] S. Hoffmann, C. Otto, S. Kurtz, C.M. Sharma, P. Khaitovich, J. Vogel, P.F. Stadler, and J. Hackermüller, “Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures,” PLoS Computational Biology, vol. 5, no. 9, p. e1000502, 2009.
[37] S.-K. Lou, B. Ni, L.-Y. Lo, S.K.-W. Tsui, T.-F. Chan, and K.-S. Leung, “ABMapper: A Suffix Array-Based Tool for Multi-Location Searching and Splice-Junction Mapping,” Bioinformatics, vol. 27, no. 3, pp. 421-422, 2011.
[38] S. Gräf, F. Nielsen, S. Kurtz, M. Huynen, E. Birney, H. Stunnenberg, and P. Flicek, “Optimized Design and Assessment of Whole Genome Tiling Arrays,” Bioinformatics, vol. 23, pp. i195-i204, 2007.
[39] T. Bowden, B. Bauer, J. Nerin, S. Feng, and S. Seibold, “The /proc Filesystem,” filesystemsproc.txt, 2011.
[40] S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D. Seljebotn, and K. Smith, “Cython: The Best of Both Worlds,” Computing in Science and Eng., vol. 13, no. 2, pp. 31-39, 2011.
[41] D.M. Beazley, “SWIG: An Easy to Use Tool for Integrating Scripting Languages with C and C++,” Proc. Fourth Conf. USENIX Tcl/Tk Workshop, 1996.
[42] D.J. Lipman and W.R. Pearson, “Rapid and Sensitive Protein Similarity Searches,” Science, vol. 227, no. 4693, pp. 1435-1441, 1985.

Index Terms:
Data storage representations, biology and genetics, software engineering, reusable libraries.
Sascha Steinbiss, Stefan Kurtz, "A New Efficient Data Structure for Storage and Retrieval of Multiple Biosequences," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 2, pp. 345-357, March-April 2012, doi:10.1109/TCBB.2011.146
Usage of this product signifies your acceptance of the Terms of Use.