Data Compression Conference (2007)

Snowbird, Utah

Mar. 27, 2007 to Mar. 29, 2007

ISSN: 1068-0314

ISBN: 0-7695-2791-4

pp: 43-52

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/DCC.2007.7

Minh Duc Cao , Monash University, Australia

Trevor I. Dix , Monash University, Australia

Lloyd Allison , Monash University, Australia

Chris Mears , Monash University, Australia

ABSTRACT

This paper introduces a novel algorithm for biological sequence compression that makes use of both statistical properties and repetition within sequences. A panel of experts is maintained to estimate the probability distribution of the next symbol in the sequence to be encoded. Expert probabilities are combined to obtain the final distribution. The resulting information sequence provides insight for further study of the biological sequence. Each symbol is then encoded by arithmetic coding. Experiments show that our algorithm outperforms existing compressors on typical DNA and protein sequence datasets while maintaining a practical running time.

INDEX TERMS

null

CITATION

Minh Duc Cao,
Trevor I. Dix,
Lloyd Allison,
Chris Mears,
"A Simple Statistical Algorithm for Biological Sequence Compression",

*Data Compression Conference*, vol. 00, no. , pp. 43-52, 2007, doi:10.1109/DCC.2007.7CITATIONS

SEARCH