The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.06 - Nov.-Dec. (2012 vol.9)
pp: 1837-1842
E. Grassi , Dept. of Genetics, Biol. & Biochem., Mol. Biotechnol. Center, Turin, Italy
F. D. Gregorio , DNDG srl, Turin, Italy
I. Molineris , Dept. of Genetics, Biol. & Biochem., Mol. Biotechnol. Center, Turin, Italy
ABSTRACT
Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference-based compression algorithms have been developed but they are less adequate when approaching new species without fully sequenced genomes or nongenomic data. We developed a tool that takes advantages of fastq characteristics and encodes them in a binary format optimized in order to be further compressed with standard tools (such as gzip or lzma). The algorithm is straightforward and does not need any external reference file, it scans the fastq only once and has a constant memory requirement. Moreover, we added the possibility to perform lossy compression, losing some of the original information (IDs and/or qualities) but resulting in smaller files; it is also possible to define a quality cutoff under which corresponding base calls are converted to N. We achieve 2.82 to 7.77 compression ratios on various fastq files without losing information and 5.37 to 8.77 losing IDs, which are often not used in common analysis pipelines. In this paper, we compare the algorithm performance with known tools, usually obtaining higher compression levels.
INDEX TERMS
Encoding, Bioinformatics, Genomics, Decoding, Standards, Compression algorithms,algorithms for data and knowledge management, Biology and genetics
CITATION
E. Grassi, F. D. Gregorio, I. Molineris, "KungFQ: A Simple and Powerful Approach to Compress fastq Files", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.9, no. 6, pp. 1837-1842, Nov.-Dec. 2012, doi:10.1109/TCBB.2012.123
REFERENCES
[1] R. Leinonen, H. Sugawara, M. Shumway, and Int'l Nucleotide Sequence Database Collaboration “The Sequence Read Archive.” Nucleic Acids Research, vol. 39, pp. D19-D21, Jan. 2011.
[2] B. Langmead, C. Trapnell, M. Pop, and S. Salzberg, “Ultrafast and Memory-Efficient Alignment of Short DNA Sequences to the Human Genome,” Genome Biology, vol. 10, no. 3, p. R25, 2009.
[3] C. Kozanitis, C. Saunders, S. Kruglyak, V. Bafna, and G. Varghese, “Compressing Genomic Sequence Fragments Using Slimgene,” Research in Computational Molecular Biology, vol. 6044, pp. 310-324, 2010.
[4] S. Christley, Y. Lu, C. Li, and X. Xie, “Human Genomes as Email Attachments,” Bioinformatics, vol. 25, no. 2, pp. 274-275, Jan. 2009.
[5] K. Daily, P. Rigor, S. Christley, X. Xie, and P. Baldi, “Data Structures and Compression Algorithms for High-Throughput Sequencing Technologies,” BMC Bioinformatics, vol. 11, no. 1,article 514, 2010.
[6] M. Hsi-Yang Fritz, R. Leinonen, G. Cochrane, and E. Birney, “Efficient Storage of High Throughput DNA Sequencing Data Using Reference-Based Compression,” Genome Research, vol. 21, no. 5, pp. 734-740, May 2011.
[7] S. Deorowicz and S. Grabowski, “Compression of DNA Sequence Reads in FASTQ Format,” Bioinformatics, vol. 27, no. 6, pp. 860-862, Mar. 2011.
[8] W. Tembe, J. Lowey, and E. Suh, “G-SQZ: Compact Encoding of Genomic Sequence and Quality Data,” Bioinformatics, vol. 26, pp. 2192-2194, July 2010.
[9] Y.J. Jeon, S.H. Park, S.M. Ahn, and H.J. Hwang, “Solidzipper: A High Speed Encoding Method for the Next-Generation Sequencing Data,” Evol Bioinform Online, vol. 7, pp. 1-6, 2011.
[10] X. Chen, M. Li, B. Ma, and J. Tromp, “DNACompress: Fast and Effective DNA Sequence Compression,” Bioinformatics, vol. 18, pp. 1696-1698, 2002.
[11] S.E. Celniker, L.A.L. Dillon, M.B. Gerstein, K.C. Gunsalus, S. Henikoff, G.H. Karpen, M. Kellis, E.C. Lai, J.D. Lieb, D.M. MacAlpine, G. Micklem, F. Piano, M. Snyder, L. Stein, K.P. White, and R.H. Waterston, “Unlocking the Secrets of the Genome,” Nature, vol. 459, no. 7249, pp. 927-930, June 2009.
[12] M. Krueger, “Sharpziplib,” http://www.sharpdevelop.net/OpenSourceSharpZipLib /, 2010.
[13] I. Pavlov, “LZMA SDK,” http://www.7-zip.orgsdk.html, 2011.
[14] Picard, http:/picard.sourceforge.net, 2012.
[15] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin,, and 1000 Genome Project Data Processing Subgroup, “The Sequence Alignment/Map Format and SAMtools,” Bioinformatics, vol. 25, no. 16, pp. 2078-2079, Aug. 2009.
[16] C. Kolivas, “lrzip,” http://freshmeat.net/projectslrzip, 2011.
87 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool