This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
KungFQ: A Simple and Powerful Approach to Compress fastq Files
Nov.-Dec. 2012 (vol. 9 no. 6)
pp. 1837-1842
E. Grassi, Dept. of Genetics, Biol. & Biochem., Mol. Biotechnol. Center, Turin, Italy
F. D. Gregorio, DNDG srl, Turin, Italy
I. Molineris, Dept. of Genetics, Biol. & Biochem., Mol. Biotechnol. Center, Turin, Italy
Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference-based compression algorithms have been developed but they are less adequate when approaching new species without fully sequenced genomes or nongenomic data. We developed a tool that takes advantages of fastq characteristics and encodes them in a binary format optimized in order to be further compressed with standard tools (such as gzip or lzma). The algorithm is straightforward and does not need any external reference file, it scans the fastq only once and has a constant memory requirement. Moreover, we added the possibility to perform lossy compression, losing some of the original information (IDs and/or qualities) but resulting in smaller files; it is also possible to define a quality cutoff under which corresponding base calls are converted to N. We achieve 2.82 to 7.77 compression ratios on various fastq files without losing information and 5.37 to 8.77 losing IDs, which are often not used in common analysis pipelines. In this paper, we compare the algorithm performance with known tools, usually obtaining higher compression levels.
Index Terms:
storage management,bioinformatics,data compression,nongenomic data,KungFQ,fastq file compression,reference-based compression algorithms,fastq characteristics,binary format,constant memory requirement,Encoding,Bioinformatics,Genomics,Decoding,Standards,Compression algorithms,algorithms for data and knowledge management,Biology and genetics
Citation:
E. Grassi, F. D. Gregorio, I. Molineris, "KungFQ: A Simple and Powerful Approach to Compress fastq Files," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 6, pp. 1837-1842, Nov.-Dec. 2012, doi:10.1109/TCBB.2012.123
Usage of this product signifies your acceptance of the Terms of Use.