The Community for Technology Leaders
Data Compression Conference (2008)
Mar. 25, 2008 to Mar. 27, 2008
ISSN: 1068-0314
ISBN: 978-0-7695-3121-2
pp: 162-171
ABSTRACT
Semistatic word-based byte-oriented compression codes are known to be attractive alternatives to compress natural language texts. With compression ratios around 30%, they allow direct pattern searching on the compressed text up to 8 times faster than on its uncompressed version. In this paper we reveal that these compressors have even more benefits. We show that most of the state-of-the-art compressors such as the block-wise bzip2, those from the Ziv-Lempel family, and the predictive ppm-based ones, can benefit from compressing not the original text, but its compressed representation obtained by a word-based byte-oriented statistical compressor.??In particular, our experimental results show that using Dense-Code-based compression as a preprocessing step to classical compressors like bzip2, gzip, or ppmdi, yields several important benefits. For example, the ppm family is known for achieving the best compression ratios. With a Dense coding preprocessing, ppmdi achieves even better compression ratios (the best we know of on natural language) and much faster compression/decompression than ppmdi alone.??Text indexing also profits from our preprocessing step. A compressed self-index achieves much better space and time performance when preceded by a semistatic word-based compression step. We show, for example, that the AF-FMindex coupled with Tagged Huffman coding is an attractive alternative index for natural language texts.
INDEX TERMS
Text compression, compression boosting, indexing
CITATION
Antonio Fari?, Jos? R. Param?, Gonzalo Navarro, "Word-Based Statistical Compressors as Natural Language Compression Boosters", Data Compression Conference, vol. 00, no. , pp. 162-171, 2008, doi:10.1109/DCC.2008.14
89 ms
(Ver 3.3 (11022016))