This Article 
 Bibliographic References 
 Add to: 
Compression of Annotated Nucleotide Sequences
July-September 2007 (vol. 4 no. 3)
pp. 447-457
This article introduces an algorithm for the lossless compression of DNA files, which contain annotation text besides the nucleotide sequence. First a grammar is specifically designed to capture the regularities of the annotation text. A revertible transformation uses the grammar rules in order to equivalently represent the original file as a collection of parsed segments and a sequence of decisions made by the grammar parser. This decomposition enables the efficient use of state-of-the-art encoders for processing the parsed segments. The output size of the decision-making process of the grammar is optimized by extending the states to account for high-order Markovian dependencies. The practical implementation of the algorithm achieves a significant improvement when compared to the general-purpose methods currently used for DNA files.

[1] A. Apostolico and A.S. Fraenkel, “Robust Transmission of Unbounded Strings Using Fibonacci Representations,” IEEE Trans. Information Theory, vol. 33, no. 2, pp. 238-245, 1987.
[2] A. Bookstein and S.T. Klein, “Compression, Information Theory and Grammars: A Unified Approach,” ACM Trans. Information Systems, vol. 8, no. 1, pp. 27-49, 1990.
[3] R.D. Cameron, “Source Encoding Using Syntactic Information Source Models,” IEEE Trans. Information Theory, vol. 34, no. 4, pp.843-850, 1988.
[4] X. Chen, S. Kwong, and M. Li, “A Compression Algorithm for DNA Sequences,” IEEE Eng. in Medicine and Biology, pp. 61-66, July/Aug. 2001.
[5] X. Chen, M. Li, B. Ma, and J. Tromp, “DNACompress: Fast and Effective DNA Sequence Compression,” Bioinformatics, vol. 18, pp.1696-1698, 2002.
[6] J. Cheney, “Compressing XML with Multiplexed Hierarchical PPM Models,” Proc. Data Compression Conf. '01, pp. 163-172, 2001.
[7] J.G. Cleary and I.H. Witten, “Data Compression Using Adaptive Coding and Partial String Matching,” IEEE Trans. Comm., vol. 32, no. 4, pp. 396-402, 1984.
[8] G.V. Cormack and R.N. Horspool, “Data Compression Using Dynamic Markov Modelling,” Computer J., vol. 30, no. 6, pp. 541-550, 1987.
[9] S. Grumbach and F. Tahi, “Compression of DNA Sequences,” Proc. Data Compression Conf. '93, pp. 340-350, 1993.
[10] S. Grumbach and F. Tahi, “A New Challenge for Compression Algorithms: Genetic Sequences,” J. Information Processing and Management, vol. 30, no. 6, pp. 875-886, 1994.
[11] J.C. Kieffer and E.-H. Yang, “Grammar-Based Codes: A New Class of Universal Lossless Source Codes,” IEEE Trans. Information Theory, vol. 46, no. 3, pp. 737-754, 2000.
[12] G. Korodi and I. Tabus, “An Efficient Normalized Maximum Likelihood Algorithm for DNA Sequence Compression,” ACM Trans. Information Systems, vol. 23, no. 1, pp. 3-34, 2005.
[13] J.M. Lake, “Prediction by Grammatical Match,” Proc. Data Compression Conf. '00, pp. 153-162, 2000.
[14] K. Lanctot, M. Li, and E. Yang, “Estimating DNA Sequence Entropy,” Proc. 11th Ann. ACM-SIAM Symp. Discrete Algorithms, pp. 409-418, 2000.
[15] H. Liefke and D. Suciu, “XMill: An Efficient Compressor for XML Data,” Proc. Special Interest Group on Management of Data '00, pp.153-164, 2000.
[16] D. Loewenstern and P. Yianilos, “Significantly Lower Entropy Estimates for Natural DNA Sequences,” Proc. Data Compression Conf. '97, pp. 151-160, 1997.
[17] E. Marsh and N. Sager, “Analysis and Processing of Compact Text,” Proc. Ninth Conf. Computational Linguistics, J. Horecký, ed., vol. 1, pp. 201-206, 1982.
[18] T. Matsumoto, K. Sadakane, and H. Imai, “Biological Sequence Compression Algorithms,” Genome Informatics, vol. 11, pp. 43-52, 2000.
[19] A. Moffat, “Implementing the PPM Data Compression Scheme,” IEEE Trans. Comm., vol. 38, no. 11, pp. 1917-1921, 1990.
[20] C.G. Nevill-Manning and I.H. Witten, “Compression and Explanation Using Hierarchical Grammars,” Computer J., vol. 40, nos.2/3, pp. 103-113, 1997.
[21] J. Rissanen and G. Langdon, “Arithmetic Coding,” IBM J. Research and Development, vol. 23, no. 2, pp. 149-162, 1979.
[22] É. Rivals, J.P. Delahaye, M. Dauchet, and O. Delgrange, “A Guaranteed Compression Scheme for Repetitive DNA Sequences,” Technical Report IT-285, LIFL Lille I Univ., 1995.
[23] D. Shkarin, “PPM: One Step to Practicality,” Proc. IEEE Data Compression Conf. '02, pp. 202-211, 2002.
[24] L. Stein, “Genome Annotation: From Sequence to Biology,” Nature Reviews Genetics, vol. 2, no. 7, pp. 493-503, 2001.
[25] I. Tabus, G. Korodi, and J. Rissanen, “DNA Sequence Compression Using the Normalized Maximum Likelihood Model for Discrete Regression,” Proc. Data Compression Conf. '03, pp. 253-262, 2003.
[26] J. Tarhio, “On Compression of Parse Trees,” Proc. Eighth Symp. String Processing and Information Retrieval, pp. 205-211, 2001.
[27] H.E. Williams and J. Zobel, “Compression of Nucleotide Databases for Fast Searching,” Computer Applications in the Biosciences, vol. 13, no. 5, pp. 549-554, 1997.
[28] Nat'l Center for Biotechnology Information, http:/www., 2007.
[29] The DDBJ/EMBL/GenBank Feature Table: Definition, FTindex.html, 2007.

Index Terms:
4 [Data]: Coding and Information Theory | Data compaction and compression, F.4 [Theory of Computation]: Mathematical Logic and Formal Languages | Formal languages, G.3 [Mathematics of Computing]: Probability and Statistics | Markov processes, J.3 [Computer Applications]: Life and Medical Sciences | Biology and genetics, Annotation, Compression, Formal Grammars, Nucleotide sequences
Gergely Korodi, Ioan Tabus, "Compression of Annotated Nucleotide Sequences," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 4, no. 3, pp. 447-457, July-Sept. 2007, doi:10.1109/tcbb.2007.1017
Usage of this product signifies your acceptance of the Terms of Use.