This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Text Compression for Dynamic Document Databases
March-April 1997 (vol. 9 no. 2)
pp. 302-313

Abstract—For compression of text databases, semi-static word-based methods provide good performance in terms of both speed and disk space, but two problems arise. First, the memory requirements for the compression model during decoding can be unacceptably high. Second, the need to handle document insertions means that the collection must be periodically recompressed if compression efficiency is to be maintained on dynamic collections. Here we show that with careful management the impact of both of these drawbacks can be kept small. Experiments with a word-based model and over 500 Mb of text show that excellent compression rates can be retained even in the presence of severe memory limitations on the decoder, and after significant expansion in the amount of stored text.

[1] C. Faloutsos, “Access Methods for Text,” Computer Surveys, vol. 17, no. 1, pp. 49-74, 1985.
[2] Information Retrieval: Data Structures and Algorithms, W.B. Frakes and R. Baeza-Yates, eds. Prentice Hall, 1992.
[3] J. Bentley, D. Sleator, R. Tarjan, and V. Wei, "A Locally Adaptive Data Compression Scheme," Comm. ACM, vol. 29, no. 4, pp. 320-330, Apr. 1986.
[4] A. Moffat, "Word Based Text Compression," Software—Practice and Experience, vol. 19, no. 2, pp. 185-198, Feb. 1989.
[5] I.H. Witten, T.C. Bell, and C.G. Nevill, "Indexing and Compressing Full-Text Databases for CD-ROM," J. Information Science, vol. 17, pp. 265-271, 1992.
[6] R.N. Horspool and G.V. Cormack, "Constructing Word-Based Text Compression Algorithms," J.A. Storer and M. Cohn, eds., pp. 62-81, Proc. IEEE Data Compression Conf.,Snowbird, Utah, IEEE CS Press, Los Alamitos, Calif., Mar. 1992
[7] D.A. Huffman, "A Method for the Construction of Minimum-Redundancy Codes," Proc. Inst. Radio Engineers, vol. 40, no. 9, pp. 1,098-1,101, Sept. 1952.
[8] D. Hirschberg and D. Lelewer, "Efficient Decoding of Prefix Codes," Comm. ACM, vol. 33, no. 4, pp. 449-459, Apr. 1990.
[9] J. Zobel and A. Moffat, "Adding Compression to a Full-Text Retrieval System," Software—Practice and Experience, vol. 25, no. 8, pp. 891-903, Aug. 1995.
[10] T.C. Bell, A. Moffat, C.G. Nevill-Manning, I.H. Witten, and J. Zobel, "Data Compression in Full-Text Retrieval Systems," J. Am. Soc. Information Science, vol. 44, no. 9, pp. 508-531, Oct. 1993.
[11] A. Bookstein, S.T. Klein, and D.A. Ziff, "A Systematic Approach to Compressing a Full-Text Retrieval System," Information Processing and Management, vol. 28, no. 6, pp. 795-806, 1992.
[12] C.E. Shannon, "A Mathematical Theory of Communication," Bell Systems Technical J., vol. 27, pp. 379-423, 623-656, 1948.
[13] D. Manstetten, "Tight Upper Bounds on the Redundancy of Huffman Codes," IEEE Trans. Information Theory, vol. 38, no. 1, pp. 144-151, Jan. 1992.
[14] D.K. Harman, "Overview of the First Text Retrieval Conference," Proc. TREC Text Retrieval Conf., D.K. Harman, ed., pp. 1-20, National Institute of Standards Special Publication, Nov. 1992.
[15] A. Moffat, N. Sharman, I.H. Witten, and T.C. Bell, "An Empirical Evaluation of Coding Methods for Multi-Symbol Alphabets," Information Processing and Management, vol. 30, no. 6, pp. 791-804, Nov. 1994.
[16] P. Elias, “Universal Codword Sets and Representation of Integers,” IEEE Trans. Information Theory, vol. 21, pp. 194-203, 1975.
[17] I.H. Witten, A. Moffat, and T.C. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold, 1994.
[18] R. Simion and H.S. Wilf, "The Distribution of Prefix Overlap in Consecutive Dictionary Entries," SIAM J. Applied and Discrete Methods, vol. 7, pp. 470-475, July 1986.
[19] R.N. Williams, "An Extremely Fast Ziv-Lempel Data Compression Algorithm," Proc. IEEE Data Compression Conf., J.A. Storer and J.H. Reif, eds., Snowbird, Utah, pp. 362-371, IEEE CS Press, Los Alamitos, Calif., Apr. 1991.
[20] J.G. Cleary, R.M. Neal, and I.H. Witten, “Arithmetic Coding for Data Compression,” Comm. ACM, vol. 30, no. 6, pp. 520-540, June 1987.
[21] T.A. Welch, "A Technique for High Performance Data Compression," Computer, vol. 17, no. 6, pp. 8-20, June 1984.
[22] G. Cormack and R. Horspool, “Data Compression Using Dynamic Markov Modelling,” Computer J., vol. 30, pp. 541-550, 1987.
[23] J.G. Cleary and I.H. Witten,"Data Compression Using Adaptive Coding and Partial String Matching," IEEE Trans. Comm., vol. 32, no. 4, 1984, pp. 396-402.
[24] A. Moffat and J. Zobel, "Self-Indexing Inverted Files for Fast Text Retrieval," ACM Trans. Information Systems, vol. 14, no. 4, pp. 349-379, Oct. 1996.

Index Terms:
Document databases, text compression, dynamic databases, word-based compression, Huffman coding.
Citation:
Alistair Moffat, Justin Zobel, Neil Sharman, "Text Compression for Dynamic Document Databases," IEEE Transactions on Knowledge and Data Engineering, vol. 9, no. 2, pp. 302-313, March-April 1997, doi:10.1109/69.591454
Usage of this product signifies your acceptance of the Terms of Use.