This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Universal Text Preprocessing for Data Compression
May 2005 (vol. 54 no. 5)
pp. 497-507
J? Abel, IEEE
Several preprocessing algorithms for text files are presented which complement each other and which are performed prior to the compression scheme. The algorithms need no external dictionary and are language independent. The compression gain is compared along with the costs of speed for the BWT, PPM, and LZ compression schemes. The average overall compression gain is in the range of 3 to 5 percent for the text files of the Calgary Corpus and between 2 to 9 percent for the text files of the large Canterbury Corpus.

[1] M. Burrows and D. Wheeler, “A Block-Sorting Lossless Data Compression Algorithm,” Technical Report 124, Digital Equipment Corp., Palo Alto, Calif., 1994.
[2] J. Cleary and I. Witten, “Data Compression Using Adaptive Coding and Partial String Matching,” IEEE Trans. Comm., pp. 396-402, 1984.
[3] J. Ziv and A. Lempel, “A Universal Algorithm for Sequential Data Compression,” IEEE Trans. Information Theory, pp. 337-342, 1977.
[4] J. Abel, “Improvements to the Burrows-Wheeler Compression Algorithm: After BWT Stages,” ACM Trans. Computer Systems, submitted for publication, 2003.
[5] W. Teahan, “Probability Estimation for PPM,” Proc. New Zealand Computer Science Research Students' Conf., 1995.
[6] J. Gailly, “GZIP— The Data Compression Program— Edition 1.2.4,” http://miaif.lip6.fr/docs/gnudocsgzip.pdf , 1993.
[7] J. Bentley, D. Sleator, R. Tarjan, and V. Wei, “A Locally Adaptive Data Compression Scheme,” Comm. ACM, vol. 29, pp. 320-330, 1986.
[8] A. Moffat, “Word-Based Text Compression,” Software— Practice and Experience, pp. 185-198, 1989.
[9] N. Horspool and G. Cormack, “Constructing Word-Based Text Compression Algorithms,” Proc. IEEE Data Compression Conf., pp. 62-71, 1992.
[10] W. Teahan and J. Cleary, “The Entropy of English Using PPM-Based Models,” Proc. IEEE Data Compression Conf., pp. 53-62, 1996.
[11] W. Teahan and J. Cleary, “Models of English Text,” Proc. IEEE Data Compression Conf., pp. 12-21, 1997.
[12] W. Teahan, “Modelling English Text,” PhD dissertation, Dept. of Computer Science, Univ. of Waikato, New Zealand, 1998.
[13] R. Franceschini and A. Mukherjee, “Data Compression Using Encrypted Text,” Proc. IEEE Data Compression Conf., p. 437, 1996.
[14] H. Kruse and A. Mukherjee, “Preprocessing Text to Improve Compression Ratios,” Proc. IEEE Data Compression Conf., p. 556, 1998.
[15] W. Sun, N. Zhang, and A. Mukherjee, “Dictionary-Based Fast Transform for Text Compression,” Proc. IEEE Intl Conf. Information Technology: Coding and Computing, 2003.
[16] B. Chapin and S. Tate, “Higher Compression from the Burrows-Wheeler Transform by Modified Sorting,” Proc. IEEE Data Compression Conf., p. 532, 1998.
[17] B. Chapin, “Higher Compression from the Burrows-Wheeler Transform with New Algorithms for the List Update Problem,” PhD dissertation, Dept. of Computer Science, Univ. of North Texas, 2001.
[18] B. Balkenhol and Y. Shtarkov, “One Attempt of a Compression Algorithm Using the BWT,” SFB343: Discrete Structures in Math., Faculty of Math., Univ. of Bielefeld, Germany, 1999.
[19] H. Kruse and A. Mukherjee, “Improving Text Compression Ratios with the Burrows-Wheeler Transform,” Proc. IEEE Data Compression Conf., p. 536, 1999.
[20] S. Grabowski, “Text Preprocessing for Burrows-Wheeler Block-Sorting Compression,” Proc. VII Konferencja Sieci i Systemy Informatyczne-Teoria, Projekty, Wdrozenia, 1999,
[21] R. Franceschini, H. Kruse, N. Zhang, R. Iqbal, and A. Mukherjee, “Lossless, Reversible Transformations that Improve Text Compression Ratios,” preprint of the M5 Lab, Univ. of Central Florida, 2000.
[22] F. Awan, N. Zhang, N. Motgi, R. Iqbal, and A. Mukherjee, “LIPT: A Reversible Lossless Text Transform to Improve Compression Performance,” Proc. IEEE Data Compression Conf., pp. 481-210, 2001.
[23] R. Isal and A. Moffat, “Parsing Strategies for BWT Compression,” Proc. IEEE Data Compression Conf., pp. 429-438, 2001.
[24] R. Isal, A. Moffat, and A. Ngai, “Enhanced Word-Based Block-Sorting Text Compression,” Proc. 25th Australasian Conf. Computer Science, pp. 129-138, 2002.
[25] W. Teahan and D. Harper, “Combining PPM Models Using a Text Mining Approach,” Proc. IEEE Data Compression Conf., pp. 153-162, 2001.
[26] J. Bentley and R. Sedgewick, “Fast Algorithms for Sorting and Searching Strings,” Proc. Eighth Ann. ACM-SIAM Symp. Discrete Algorithms, 1997.
[27] P. Elias, “Universal Codeword Sets and Representations of the Integers,” IEEE Trans. Information Theory, pp. 194-203, 1975.
[28] S. Deorowicz, “Improvements to Burrows-Wheeler Compression Algorithm,” Software— Practice and Experience, pp. 1465-1483, 2000.
[29] P. Fenwick, “Block Sorting Text Compression— Final Report,” Dept. of Computer Science Report No. 130, Univ. of Auckland, Apr. 1996.

Index Terms:
Algorithms, data compression, BWT, LZ, PPM, preprocessing, text compression.
Citation:
J? Abel, William Teahan, "Universal Text Preprocessing for Data Compression," IEEE Transactions on Computers, vol. 54, no. 5, pp. 497-507, May 2005, doi:10.1109/TC.2005.85
Usage of this product signifies your acceptance of the Terms of Use.