DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TCBB.2013.122
Sebastian Wandelt , Humboldt-University of Berlin, Berlin
Ulf Leser , Humboldt-University of Berlin, Berlin
In many applications, sets of similar texts or sequences are of high importance, e.g. for revision histories of documents or genomic sequences. Modern high-throughput sequencing technologies are able to generate DNA sequences at an ever increasing rate. In parallel to the decreasing experimental time and cost necessary to produce DNA sequences, computational requirements for analysis and storage of the sequences are steeply increasing. Compression is a key technology to deal with this challenge. In this paper, we propose a general open-source framework to compress large amounts of biological sequence data called FRESCO. Our basic compression algorithm is shown to be 1-2 orders of magnitudes faster than comparable related work, while achieving similar compression ratios. We also propose several techniques to further increase compression ratios, while still retaining the advantage in speed. In addition, we propose a new way of further boosting the compression ratios by second-order referential compression. This technique allows for compression ratios way beyond state-of-the-art, for instance, 4000:1 and higher for human genomes. We evaluate our algorithms on a large data set from three different species and on a collection of versions of Wikipedia pages. Our results show that real-time compression of highly-similar sequences at high compression ratios is possible on modern hardware.
Bioinformatics (genome or protein) databases, Data compaction and compression, Information Storage
Sebastian Wandelt, Ulf Leser, "FRESCO: Referential Compression of Highly-Similar Sequences", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. , no. , pp. 0, 5555, doi:10.1109/TCBB.2013.122