|
| This Article | ||
| ||
| Share | ||
| Bibliographic References | ||
| Add to: | ||
| | ||
| Search | ||
| ||
2011 IEEE International Conference on Bioinformatics and Biomedicine
No-Reference Compression of Genomic Data Stored in FASTQ Format
Atlanta, Georgia USA
November 12-November 15
ISBN: 978-0-7695-4574-5
| ASCII Text | x | ||
| Vishal Bhola, Ajit S. Bopardikar, Rangavittal Narayanan, Kyusang Lee, TaeJin Ahn, "No-Reference Compression of Genomic Data Stored in FASTQ Format," 2012 IEEE International Conference on Bioinformatics and Biomedicine, pp. 147-150, 2011 IEEE International Conference on Bioinformatics and Biomedicine, 2011. | |||
| BibTex | x | ||
| @article{ 10.1109/BIBM.2011.110, author = {Vishal Bhola and Ajit S. Bopardikar and Rangavittal Narayanan and Kyusang Lee and TaeJin Ahn}, title = {No-Reference Compression of Genomic Data Stored in FASTQ Format}, journal ={2012 IEEE International Conference on Bioinformatics and Biomedicine}, volume = {0}, year = {2011}, isbn = {978-0-7695-4574-5}, pages = {147-150}, doi = {http://doi.ieeecomputersociety.org/10.1109/BIBM.2011.110}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, } | |||
| RefWorks Procite/RefMan/Endnote | x | ||
| TY - CONF JO - 2012 IEEE International Conference on Bioinformatics and Biomedicine TI - No-Reference Compression of Genomic Data Stored in FASTQ Format SN - 978-0-7695-4574-5 SP147 EP150 A1 - Vishal Bhola, A1 - Ajit S. Bopardikar, A1 - Rangavittal Narayanan, A1 - Kyusang Lee, A1 - TaeJin Ahn, PY - 2011 KW - FASTQ KW - Next generation sequencing KW - Genomic Data Compression VL - 0 JA - 2012 IEEE International Conference on Bioinformatics and Biomedicine ER - | |||
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/BIBM.2011.110
In this paper, we propose a system to compress Next Generation Sequencing (NGS) information stored in a FASTQ file. A FASTQ file contains text, DNA read and quality information for millions or billions of reads. The proposed system first parses the FASTQ file into its component fields. In a partial first pass it gathers statistics which are then used to choose a representation for each field that can give the best compression. Text data is further parsed into repeating and variable components and entropy coding is used to compress the latter. Similarly, Markov encoding and repeat finding based methods are used for DNA read compression. Finally, we propose several run length based methods to encode quality data choosing the method that gives the best performance for a given set of quality values. The compression system provides features for loss less and nearly loss less compression as well as compressing only read and read + quality data. We compare its performance to bzip2 text compression utility and an existing benchmark algorithm. We observe that the performance of the proposed system is superior to that of both the systems.
Index Terms:
FASTQ, Next generation sequencing, Genomic Data Compression
Citation:
Vishal Bhola, Ajit S. Bopardikar, Rangavittal Narayanan, Kyusang Lee, TaeJin Ahn, "No-Reference Compression of Genomic Data Stored in FASTQ Format," bibm, pp.147-150, 2011 IEEE International Conference on Bioinformatics and Biomedicine, 2011
Usage of this product signifies your acceptance of the Terms of Use.
