The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.01 - Jan.-Feb. (2013 vol.10)
pp: 213-218
Mark Howison , Center for Comput. & Visualization, Brown Univ., Providence, RI, USA
ABSTRACT
Compression has become a critical step in storing next-generation sequencing (NGS) data sets because of both the increasing size and decreasing costs of such data. Recent research into efficiently compressing sequence data has focused largely on improving compression ratios. Yet, the throughputs of current methods now lag far behind the I/O bandwidths of modern storage systems. As biologists move their analyses to high-performance systems with greater I/O bandwidth, low-throughput compression becomes a limiting factor. To address this gap, we present a new storage model called SeqDB, which offers high-throughput compression of sequence data with minimal sacrifice in compression ratio. It achieves this by combining the existing multithreaded Blosc compressor with a new data-parallel byte-packing scheme, called SeqPack, which interleaves sequence data and quality scores.
INDEX TERMS
Throughput, Arrays, Bandwidth, Libraries, Bioinformatics, Instruction sets, Genomics,FASTQ, Compression, data storage, next-generation sequencing
CITATION
Mark Howison, "High-Throughput Compression of FASTQ Data with SeqDB", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.10, no. 1, pp. 213-218, Jan.-Feb. 2013, doi:10.1109/TCBB.2012.160
REFERENCES
[1] P.J.A. Cock, C.J. Fields, N. Goto, M.L. Heuer, and P.M. Rice, “The Sanger FASTQ File Format for Sequences with Quality Scores, and the Solexa/Illumina FASTQ Variants,” Nucleic Acids Research, vol. 38, no. 6, pp. 1767-1771, Apr. 2010.
[2] F. Alted, “BLOSC,” http:/blosc.pytables.org/, 2009.
[3] F. Alted et al., “PyTables: Hierarchical Datasets in Python,” http:/www. pytables.org/, 2002.
[4] F. Alted, “Why Modern CPUs Are Starving and What Can be Done About It,” Computing in Science and Eng., vol. 12, no. 2, pp. 68-71, 2010.
[5] A. Hidayat, “FastLZ - Lightning-Fast Compression Library,” http:/fastlz. org, 2007.
[6] The HDF Group, “Hierarchical Data Format Version 5,” http://www. hdfgroup.orgHDF5/, 2000.
[7] C.E. Mason, P. Zumbo, S. Sanders, M. Folk, D. Robinson, R. Aydt, M. Gollery, M. Welsh, N.E. Olson, and T.M. Smith, “Standardizing the Next Generation of Bioinformatics Software Development with BioHDF (HDF5),” Advances in Computational Medicine and Biology, vol. 680, pp. 693-700, 2010.
[8] R. Leinonen, H. Sugawara, and M. Shumway, “The Sequence Read Archive,” Nucleic Acids Research, vol. 39, pp. D19-D21, Jan. 2011.
[9] R. Wan and K. Asai, “Sorting Next Generation Sequencing Data Improves Compression Effectiveness,” Proc. IEEE Int'l Conf. Bioinformatics and Biomedicine Workshops (BIBMW), pp. 567-572, 2010.
[10] W. Tembe, J. Lowey, and E. Suh, “G-SQZ: Compact Encoding of Genomic Sequence and Quality Data,” Bioinformatics, vol. 26, no. 17, pp. 2192-2194, 2010.
[11] S. Deorowicz and S. Grabowski, “Compression of DNA Sequence Reads in FASTQ Format,” Bioinformatics, vol. 27, no. 6, pp. 860-862, Mar. 2011.
[12] D.R. Zerbino and E. Birney, “Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs,” Genome Research, vol. 18, no. 5, pp. 821-829, 2008.
[13] M. Howison, N.A. Sinnott-Armstrong, and C.W. Dunn, “BioLite, A Lightweight Bioinformatics Framework with Automated Tracking of Diagnostics and Provenance,” Proc. Fourth USENIX Workshop the Theory and Practice of Provenance (TaPP '12), June 2012.
5 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool