This Article 
 Bibliographic References 
 Add to: 
Iterative Dictionary Construction for Compression of Large DNA Data Sets
January/February 2012 (vol. 9 no. 1)
pp. 137-149
Shanika Kuruppu, The University of Melbourne, Parkville
Bryan Beresford-Smith, National ICT Australia, Parkville
Thomas Conway, National ICT Australia, Parkville
Justin Zobel, University of Melbourne, Parkville
Genomic repositories increasingly include individual as well as reference sequences, which tend to share long identical and near-identical strings of nucleotides. However, the sequential processing used by most compression algorithms, and the volumes of data involved, mean that these long-range repetitions are not detected. An order-insensitive, disk-based dictionary construction method can detect this repeated content and use it to compress collections of sequences. We explore a dictionary construction method that improves repeat identification in large DNA data sets. Our adaptation, Comrad, of an existing disk-based method identifies exact repeated content in collections of sequences with similarities within and across the set of input sequences. Comrad compresses the data over multiple passes, which is an expensive process, but allows Comrad to compress large data sets within reasonable time and space. Comrad allows for random access to individual sequences and subsequences without decompressing the whole data set. Comrad has no competitor in terms of the size of data sets that it can compress (extending to many hundreds of gigabytes) and, even for smaller data sets, the results are competitive compared to alternatives; as an example, 39 S. cerevisiae genomes compressed to 0.25 bits per base.

[1] D. Wheeler et al., “The Complete Genome of an Individual by Massively Parallel DNA Sequencing,” Nature, vol. 452, no. 7189, pp. 872-876, 2008.
[2] D. Bentley et al., “Accurate Whole Human Genome Sequencing Using Reversible Terminator Chemistry,” Nature, vol. 456, no. 7218, pp. 53-59, 2008.
[3] J. Wang et al., “The Diploid Genome Sequence of an Asian Individual,” Nature, vol. 456, no. 7218, pp. 60-65, 2008.
[4] S. Schuster et al., “Complete Khoisan and Bantu Genomes from Southern Africa,” Nature, vol. 463, no. 7283, pp. 943-947, 2010.
[5] A. Cannane and H. Williams, “General-Purpose Compression for Efficient Retrieval,” J. Am. Soc. for Information Science and Technology, vol. 52, no. 5, pp. 430-437, 2001.
[6] B. Behzadi and F.L. Fessant, “DNA Compression Challenge Revisited: A Dynamic Programming Approach,” CPM '05: Proc. 16th Ann. Symp. Combinatorial Pattern Matching, pp. 190-200, 2005.
[7] M.D. Cao, T. Dix, L. Allison, and C. Mears, “A Simple Statistical Algorithm for Biological Sequence Compression,” DCC '07: Proc. Data Compression Conf., pp. 43-52, 2007.
[8] X. Chen, S. Kwong, and M. Li, “A Compression Algorithm for DNA Sequences and Its Applications in Genome Comparison,” RECOMB '00: Proc. Fourth Ann. Int'l Conf. Research in Computational Molecular Biology, pp. 107-117, 2000.
[9] X. Chen, M. Li, B. Ma, and J. Tromp, “DNACompress: Fast and Effective DNA Sequence Compression,” Bioinformatics, vol. 18, no. 12, pp. 1696-1698, 2002.
[10] D. Loewenstern and P. Yianilos, “Significantly Lower Entropy Estimates for Natural DNA Sequences,” DCC '97: Proc. Data Compression Conf., p. 151, 1997.
[11] T. Matsumoto, K. Sadakane, and H. Imai, “Biological Sequence Compression Algorithms,” Genome Informatics, vol. 11, pp. 43-52, 2000.
[12] J. Ziv and A. Lempel, “A Universal Algorithm for Sequential Data Compression,” IEEE Trans. Information Theory, vol. IT-23, no. 3, pp. 337-343, May 1977.
[13] J. Cleary and I. Witten, “Data Compression Using Adaptive Coding and Partial String Matching,” IEEE Trans. Comm., vol. COM-32, no. 4, pp. 396-402, Apr. 1984.
[14] P. Deutsch, “Gzip File Format Specification Version 4.3,” 1996.
[15] S. Grumbach and F. Tahi, “Compression of DNA Sequences,” DCC '93: Proc. Data Compression Conf., pp. 340-350, 1993.
[16] E. Rivals, J. Delahaye, M. Dauchet, and O. Delgrange, “A Guaranteed Compression Scheme for Repetitive DNA Sequences,” DCC '96: Proc. Data Compression Conf., p. 453, 1996.
[17] A. Apostolico and S. Lonardi, “Compression of Biological Sequences by Greedy Off-Line Textual Substitution,” DCC '00: Proc. Data Compression Conf., pp. 143-152, 2000.
[18] G. Korodi and I. Tabus, “An Efficient Normalized Maximum Likelihood Algorithm for DNA Sequence Compression,” ACM Trans. Information Systems, vol. 23, no. 1, pp. 3-34, 2005.
[19] S. Christley, Y. Lu, C. Li, and X. Xie, “Human Genomes as Email Attachments,” Bioinformatics, vol. 25, no. 2, pp. 274-275, 2009.
[20] M. Brandon, D. Wallace, and P. Baldi, “Data Structures and Compression Algorithms for Genomic Sequence Data,” Bioinformatics, vol. 25, no. 14, pp. 1731-1738, 2009.
[21] J. Sirén, N. Välimäki, V. Mäkinen, and G. Navarro, “Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections,” SPIRE '08: Proc. 15th Int'l Symp. String Processing and Information Retrieval, pp. 164-175, 2009.
[22] V. Mäkinen, G. Navarro, J. Sirén, and N. Välimäki, “Storage and Retrieval of Individual Genomes,” RECOMB '09: Proc. 13th Ann. Int'l Conf. Research in Computational Molecular Biology, pp. 121-137, 2009.
[23] V. Mäkinen, G. Navarro, J. Sirén, and N. Välimäki, “Storage and Retrieval of Highly Repetitive Sequence Collections,” J. Computational Biology, vol. 17, no. 3, pp. 281-308, 2010.
[24] F. Claude, A. Fariña, M. Martínez-Prieto, and G. Navarro, “Compressed $q$ -Gram Indexing for Highly Repetitive Biological Sequences,” Proc. 10th IEEE Conf. Bioinformatics and Bioeng., pp. 86-91, 2010.
[25] N.J. Larsson and A. Moffat, “Offline Dictionary-Based Compression,” DCC '99: Proc. Data Compression Conf., pp. 296-305, 1999.
[26] F. Claude and G. Navarro, “Self-Indexed Text Compression Using Straight-Line Programs,” MFCS '09: Proc. 34th Int'l Symp. Math. Foundations of Computer Science, pp. 235-246, 2009.
[27] S. Kreft and G. Navarro, “LZ77-Like Compression with Fast Random Access,” DCC '10: Proc. 20th Data Compression Conf., pp. 239-248, 2010.
[28] S. Kuruppu, S.J. Puglisi, and J. Zobel, “Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval,” SPIRE '10: Proc. 16th Int'l Symp. String Processing and Information Retrieval, E. Chavez and S. Lonardi, eds., pp. 201-206, 2010.
[29] S. Kuruppu, S.J. Puglisi, and J. Zobel, “Optimized Relative Lempel-Ziv Compression of Genomes,” ACSC '11: Proc. 34th Australasian Computer Science Conf., M. Reynolds, ed., pp. 91-98, 2011.
[30] C. Neville-Manning and I. Witten, “Compression and Explanation Using Hierarchical Grammars,” The Computer J., vol. 40, nos. 2/3, pp. 103-116, 1997.
[31] G. Manzini and M. Rastero, “A Simple and Fast DNA Compressor,” Software—Practice and Experience, vol. 34, pp. 1397-1411, 2004.
[32] S. Hirschberg and D. Lelewer, “Efficient Decoding of Prefix Coding,” Comm. ACM, vol. 33, no. 4, pp. 449-459, 1990.
[33] M. Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and A. Shelat, “The Smallest Grammar Problem,” IEEE Trans. Information Theory, vol. 51, no. 7, pp. 2554-2576, July 2005.
[34] D. Okanohara and K. Sadakane, “Practical Entropy-Compressed Rank/Select Dictionary,” ALENEX '07: Proc. Workshop Algorithm Eng. and Experiments, 2007.
[35] S. Levy et al., “The Diploid Genome Sequence of an Individual Human,” PLoS Biology, vol. 5, no. 10, p. e254, 2007.
[36] S.-M. Ahn et al., “The First Korean Genome Sequence and Analysis: Full Genome Sequencing for a Socio-Ethnic Group,” Genome Research, vol. 19, no. 9, pp. 1622-1629, 2009.

Index Terms:
Dictionary construction, compression, DNA, large data sets.
Shanika Kuruppu, Bryan Beresford-Smith, Thomas Conway, Justin Zobel, "Iterative Dictionary Construction for Compression of Large DNA Data Sets," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 1, pp. 137-149, Jan.-Feb. 2012, doi:10.1109/TCBB.2011.82
Usage of this product signifies your acceptance of the Terms of Use.