The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.05 - Sept.-Oct. (2013 vol.10)
pp: 1275-1288
Sebastian Wandelt , Humboldt-University of Berlin, Berlin
Ulf Leser , Humboldt-University of Berlin, Berlin
ABSTRACT
In many applications, sets of similar texts or sequences are of high importance. Prominent examples are revision histories of documents or genomic sequences. Modern high-throughput sequencing technologies are able to generate DNA sequences at an ever-increasing rate. In parallel to the decreasing experimental time and cost necessary to produce DNA sequences, computational requirements for analysis and storage of the sequences are steeply increasing. Compression is a key technology to deal with this challenge. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. In this paper, we propose a general open-source framework to compress large amounts of biological sequence data called Framework for REferential Sequence COmpression (FRESCO). Our basic compression algorithm is shown to be one to two orders of magnitudes faster than comparable related work, while achieving similar compression ratios. We also propose several techniques to further increase compression ratios, while still retaining the advantage in speed: 1) selecting a good reference sequence; and 2) rewriting a reference sequence to allow for better compression. In addition, we propose a new way of further boosting the compression ratios by applying referential compression to already referentially compressed files (second-order compression). This technique allows for compression ratios way beyond state of the art, for instance, 4,000:1 and higher for human genomes. We evaluate our algorithms on a large data set from three different species (more than 1,000 genomes, more than 3 TB) and on a collection of versions of Wikipedia pages. Our results show that real-time compression of highly similar sequences at high compression ratios is possible on modern hardware.
INDEX TERMS
Genomics, Compression algorithms, Encoding, Bioinformatics, Image coding, Sequential analysis, Computational biology,compression heuristics, Sequences, referential compression, second-order compression
CITATION
Sebastian Wandelt, Ulf Leser, "FRESCO: Referential Compression of Highly Similar Sequences", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.10, no. 5, pp. 1275-1288, Sept.-Oct. 2013, doi:10.1109/TCBB.2013.122
REFERENCES
[1] I.H.G.S. Consortium, "Initial Sequencing and Analysis of the Human Genome," Nature, vol. 409, no. 6822, pp. 860-921, Feb. 2001.
[2] E.E. Schadt, S. Turner, and A. Kasarskis, "A Window into Third-Generation Sequencing," Human Molecular Genetics, vol. 19, no. R2, pp. R227-R240, Oct. 2010.
[3] 1000 Genomes Project Consortium, "A Map of Human Genome Variation from Population-Scale Sequencing," Nature, vol. 467, no. 7319, pp. 1061-1073, Oct. 2010.
[4] Consortium ICG, "International Network of Cancer Genome Projects," Nature, vol. 464, no. 7291, pp. 993-998, http://dx.doi.org/10.1038nature08987, Apr. 2010.
[5] C. Brierley, "Ten Years on, Wellcome Trust Launches Study of 10,000 Human Genomes in UK," http://www.wellcome.ac. uk/News/Media-office/ Press-releases/2010WTX060061.htm, June 2010.
[6] J. Zhang, J. Baran, A. Cros, J.M. Guberman, S. Haider, J. Hsu, Y. Liang, E. Rivkin, J. Wang, B. Whitty, M. Wong-Erasmus, L. Yao, and A. Kasprzyk, "International Cancer Genome Consortium Data Portal—A One-Stop Shop for Cancer Genomics Data," Database: The J. Biological Databases and Curation, vol. 2011, article bar026, 2011.
[7] S.D. Kahn, "On the Future of Genomic Data," Science, vol. 331, no. 6018, pp. 728-729, 2011.
[8] V.A. Fusaro, P. Patil, E. Gafni, D.P. Wall, and P.J. Tonellato, "Biomedical Cloud Computing with Amazon Web Services," PLoS Computational Biology, vol. 7, no. 8,article e1002147, 2011.
[9] M.C. Schatz, B. Langmead, and S.L. Salzberg, "Cloud Computing and the DNA Data Race," Nature Biotechnology, vol. 28, no. 7, pp. 691-693, July. 2010.
[10] L.D. Stein, "The Case for Cloud Computing in Genome Informatics," Genome Biology, vol. 11, no. 5,article 207, May 2010.
[11] O. Trelles, P. Prins, M. Snir, and R.C. Jansen, "Big Data, But Are We Ready?" Nature Rev. Genetics, vol. 12, no. 3,article 224, Feb. 2011.
[12] E. Pennisi, "Will Computers Crash Genomics?" Science, vol. 331, no. 6018, pp. 666-668, Feb. 2011.
[13] U. Nalbantoglu, D.J. Russell, and K. Sayood, "Data Compression Concepts and Algorithms and Their Applications to Bioinformatics," Entropy, vol. 12, no. 1, pp. 34-52, 2010.
[14] D. Antoniou, E. Theodoridis, and A. Tsakalidis, "Compressing Biological Sequences Using Self Adjusting Data Structures," Proc. 10th IEEE Int'l Conf. Information Technology and Applications in Biomedicine, 2010.
[15] D. Pratas and A.J. Pinho, "Compressing the Human Genome Using Exclusively Markov Models," Proc. Fifth Int'l Conf. Practical Applications of Computational Biology and Bioinformatics (PACBB '11), pp. 213-220, 2011.
[16] S. Christley, Y. Lu, C. Li, and X. Xie, "Human Genomes as Email Attachments," Bioinformatics, vol. 25, no. 2, pp. 274-275, Jan. 2009.
[17] S. Deorowicz and S. Grabowski, "Robust Relative Compression of Genomes with Random Access," Bioinformatics, vol. 27, pp. 2979-2986, Nov. 2011.
[18] L. Chen, S. Lu, and J. Ram, "Compressed Pattern Matching in DNA Sequences," Proc. IEEE Computational Systems Bioinformatics Conf., pp. 62-68, 2004.
[19] J. Larsson and A. Moffat, "Offline Dictionary-Based Compression," Proc. IEEE Data Compression Conf., pp. 296-305, Mar. 1999.
[20] Y. Shibata et al., "A Boyer-Moore Type Algorithm for Compressed Pattern Matching," Proc. 11th Ann. Symp. Combinatorial Pattern Matching (COM '00), pp. 181-194, 2000.
[21] S. Kuruppu, B. Beresford-Smith, T. Conway, and J. Zobel, "Iterative Dictionary Construction for Compression of Large DNA Data Sets," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 9, no. 1, Jan./Feb. 2012.
[22] J. Ziv and A. Lempel, "A Universal Algorithm for Sequential Data Compression," IEEE Trans. Information Theory, vol. IT-23, no. 3, pp. 337-343, May 1977.
[23] J.G. Cleary and I.H. Witten, "Data Compression Using Adaptive Coding and Partial String Matching," IEEE Trans. Communications, vol. COM-32, no. 4, pp. 396-402, Apr. 1984.
[24] M. Duc Cao, T.I. Dix, L. Allison, and C. Mears, "A Simple Statistical Algorithm for Biological Sequence Compression," Proc. Data Compression Conf. pp. 43-52, 2007.
[25] G.V. Cormack and R.N.S. Horspool, "Data Compression Using Dynamic Markov Modelling," Comput. J., vol. 30, pp. 541-550, Dec. 1987.
[26] D.A. Huffman, "A Method for the Construction of Minimum-Redundancy Codes," Proc. IRE, vol. 40, no. 9, pp. 1098-1101, Sept. 1952.
[27] S. Kuruppu, S.J. Puglisi, and J. Zobel, "Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval," Proc. 17th Int'l Conf. String Processing and Information Retrieval (SPIRE '10), pp. 201-206, 2010.
[28] S. Kuruppu, S. Puglisi, and J. Zobel, "Optimized Relative Lempel-Ziv Compression of Genomes," Australasian Computer Science Conf., 2011.
[29] A.J. Pinho, D. Pratas, and S.P. Garcia, "GReEn: A Tool for Efficient Compression of Genome Resequencing Data," Nucleic Acids Research, vol. 40, article e27, Dec. 2011.
[30] S. Kreft and G. Navarro, "LZ77-Like Compression with Fast Random Access," Proc. Data Compression Conf. (DCC '10), pp. 239-248, 2010.
[31] M.H. Fritz, R. Leinonen, G. Cochrane, and E. Birney, "Efficient Storage of High Throughput DNA Sequencing Data Using Reference-Based Compression," Genome Research, vol. 21, no. 5, pp. 734-740, May 2011.
[32] C. Wang and D. Zhang, "A Novel Compression Tool for Efficient Storage of Genome Resequencing Data," Nucleic Acids Research, vol. 39, no. 7,article e45, Apr. 2011.
[33] V. Bhola, A.S. Bopardikar, R. Narayanan, K. Lee, and T. Ahn, "No-Reference Compression of Genomic Data Stored in FASTQ Format," Proc. IEEE Int'l Conf. Bioinformatics and Biomedicine (BIBM '11), pp. 147-150, 2011.
[34] R. Wan, V.N. Anh, and K. Asai, "Transformations for the Compression of FASTQ Quality Scores of Next Generation Sequencing Data," Bioinformatics, vol. 28, pp. 628-635, Mar. 2012.
[35] G. Menconi, V. Benci, and M. Buiatti, "Data Compression and Genomes: A Two-Dimensional Life Domain Map," J. Theoretical Biology, vol. 253, no. 2, pp. 281-288, 2008.
[36] K. Daily, P. Rigor, S. Christley, X. Xie, and P. Baldi, "Data Structures and Compression Algorithms for High-Throughput Sequencing Technologies," BMC Bioinformatics, vol. 11, no. 1,article 514, 2010.
[37] C. Kozanitis, C. Saunders, S. Kruglyak, V. Bafna, and G. Varghese, "Compressing Genomic Sequence Fragments Using SlimGene," Proc. 14th Ann. Int'l Conf. Research in Computational Molecular Biology (RECOMB '10), pp. 310-324, 2010.
[38] S. Wandelt and U. Leser, "Adaptive Efficient Compression of Genomes," Algorithms for Molecular Biology, vol. 7, article 30, 2012.
[39] M. Cohn and R. Khazan, "Parsing with Prefix and Suffix Dictionaries," Proc. Data Compression Conf., pp. 180-189, 1996.
[40] R.N. Horspool, "The Effect of Non-Greedy Parsing in Ziv-Lempel Compression Methods," Proc. Data Compression Conf. (DCC '95), pp. 302-311, 1995.
[41] S. Grabowski and S. Deorowicz, "Engineering Relative Compression of Genomes," CoRR, vol. abs/1103.2351, 2011.
[42] P. Danecek, A. Auton, G. Abecasis, and the 1000 Genomes Project Analysis Group, "The Variant Call Format and VCFtools," Bioinformatics, vol. 27, no. 15, pp. 2156-2158, Aug. 2011.
[43] J. Cao, K. Schneeberger, S. Ossowski, T. Günther, S. Bender, J. Fitz, D. Koenig, C. Lanz, O. Stegle, C. Lippert, X. Wang, F. Ott, J. Müller, C. Alonso-Blanco, K. Borgwardt, K.J. Schmid, and D. Weigel, "Whole-Genome Sequencing of Multiple Arabidopsis thaliana Populations," Nature Genetics, vol. 43, no. 10, pp. 956-963, Aug. 2011.
[44] H.W. Mewes, K. Albermann, M. Bähr, D. Frishman, A. Gleissner, J. Hani, K. Heumann, K. Kleine, A. Maierl, S.G. Oliver, F. Pfeiffer, and A. Zollner, "Overview of the Yeast Genome," Nature, vol. 387, no. 6632 Suppl, pp. 7-65, May 1997.
[45] E. Ohlebusch, J. Fischer, and S. Gog, "CST++," Proc. 17th Int'l Conf. String Processing and Information Retrieval (SPIRE '10), pp. 322-333, 2010.
[46] S. Wandelt and U. Leser, "String Searching in Referentially Compressed Genomes," Proc. Int'l Joint Conf. on Knowledge Discovery, Knowledge Eng. and Knowledge Management, 2012.
28 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool