This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Block-Oriented Compression Techniques for Large Statistical Databases
March-April 1997 (vol. 9 no. 2)
pp. 314-328

Abstract—Disk I/O has long been a performance bottleneck for very large databases. Database compression can be used to reduce disk I/O bandwidth requirements for large data transfers. In this paper, we explore the compression of large statistical databases and propose techniques for organizing the compressed data such that standard database operations such as retrievals, inserts, deletes and modifications are supported. We examine the applicability and performance of three methods. Two of these are adaptations of existing methods, but the third, called Tuple Differential Coding (TDC) [16], is a new method that allows conventional access mechanisms to be used with the compressed data to provide efficient access. We demonstrate how the performance of queries that involve large data transfers can be improved with these database compression techniques.

[1] P. Alsberg, "Space and Time Savings through Large Database Compression and Dynamic Restructuring," Proc. IEEE, vol. 63, pp. 1,114-1,122, Aug. 1975.
[2] M.A. Bassiouni, "Data Compression in Scientific and Statistical Databases," IEEE Trans. Software Eng., vol. 11, no. 10, pp. 1,047-1,058, Oct. 1985.
[3] M.A. Bassiouni and K. Hazboun, "Utilization of Character Reference Locality for Efficient Storage of Databases," Proc. Second Int'l Workshop Statistical Database Management, pp. 338-344, Sept. 1983.
[4] D.S. Batory, "On Searching Transposed Files," ACM Trans. Database Systems, vol. 4, no. 4, pp. 531-544, Dec. 1979.
[5] D.S. Batory, "Index Coding: A Compression Technique for Large Statistical Databases," Proc. Second Int'l Workshop Statistical Database Management, pp. 306-314, Sept. 1983.
[6] T.C. Bell, J.G. Cleary, and I.H. Witten, Text Compression.Englewood Cliffs, N.J.: Prentice Hall, 1990.
[7] G.V. Cormack, "Data Compression on a Database System," Comm. ACM, vol. 28, no.12, pp. 1,336-1,342, Dec. 1985.
[8] S.J. Eggers and A. Shoshani, "Efficient Access of Compressed Data. Proc. Sixth Int'l Conf. Very Large Databases, pp. 205-211, 1980.
[9] S.J. Eggers, F. Olken, and A. Shoshani, "A Compression Technique for Large Statistical Databases," Proc. Seventh Int'l Conf. Very Large Data Bases, pp. 424-434, 1981.
[10] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness.New York: W.H. Freeman, 1979.
[11] S.W. Golomb, "Run-Length Encoding," IEEE Trans. Information Theory, vol. IT-12, no. 3, 1966, pp. 399-401.
[12] R. Katz, G. Gibson, and D. Patterson, “Disk System Architectures for High Performance Computing,” Proc. IEEE, vol. 77, no. 12, pp. 1,842–1,858, Dec. 1989.
[13] G.G. Langdon, "An Introduction to Arithmetic Coding," IBM J. Research and Development, vol. 28, no. 2, pp. 135-149, 1984.
[14] J.Z. Li, D. Rotem, and H.K.T. Wong, "A New Compression Method with Fast Searching on Large Databases," Proc. 13th Int'l Conf. Very Large Databases, pp. 311-318, 1987.
[15] W.K. Ng and C.V. Ravishankar, "Attribute Enumerative Coding: A Compression Technique for Tuple Data Structures," Proc. Fourth Data Compression Conf., p. 461,Snowbird, Utah, Mar.29-31, 1994.
[16] W.K. Ng and C.V. Ravishankar, "Data Compression System and Method Representing Records as Differences between Sorted Domain Ordinals Representing Field Values," U.S. Patent No. 5,603,022, Feb. 1997.
[17] W.K. Ng and C.V. Ravishankar, "A Physical Storage Model for Efficient Statistical Query Processing," Proc. Seventh IEEE Int'l Working Conf. Statistical and Scientific Databases, pp. 97-106,Charlottesville, Va., Sept28-30, 1994.
[18] W.K. Ng and C.V. Ravishankar, "Relational Database Compression Using Augmented Vector Quantization," Proc. 11th IEEE Int'l Conf. Data Eng., pp. 540-549Taipei, Taiwan, Mar.6-10, 1995.
[19] "Census of Population and Housing, 1990: Public Use Microdata Samples U.S.," machine readable data files prepared by the Bureau of the Census. Washington, D.C., 1992.
[20] J.J. Rissanen and G.G. Langdon, "Universal Modeling and Coding," IEEE Trans. Information Theory, vol. 27, no. 1, pp. 12-23, 1981.
[21] F. Rubin, "Experiments in Text File Compression," Comm. ACM, vol. 19, no. 11, pp. 617-623, Nov. 1976.
[22] D.G. Severance, "A Practitioner's Guide to Database Compression: Tutorial," Information Systems, vol. 8, no. 1, pp. 51-62, 1983.
[23] C.E. Shannon, "A Mathematical Theory of Communication," Bell System Technical J., vol. 27, no. 3, pp.379-423, 1948.
[24] A. Shoshani and H.K.T. Wong, “Statistical and Scientific Database Issues,” IEEE Trans. Software Eng., Oct. 1985.
[25] H. Tanaka and A. Leon-Garcia, "Efficient Run-Length Encodings," IEEE Trans. Information Theory, vol. 28, no. 6, pp. 880-890, June 1982.
[26] T.A. Welch, "A Technique for High Performance Data Compression," Computer, vol. 17, no. 6, pp. 8-19, June 1984.
[27] G. Wiederhold, Database Design, Computer Science Series, second ed. New York: McGraw Hill, 1983.
[28] R.N. Williams, Adaptive Data Compression.Boston: Kluwer Academic, 1991.
[29] H.K.T. Wong, H.F Liu, F. Olken, D. Rotem, and L. Wong, "Bit Transposed Files," Proc. 11th Int'l Conf. Very Large Databases, pp. 448-457, 1985.
[30] J. Ziv and A. Lempel, "A Universal Algorithm for Sequential Data Compression," IEEE Trans. Information Theory, vol. 23, no. 3, pp. 337-343, 1977.
[31] J. Ziv and A. Lempel, "Compression of Individual Sequence via Variable-Rate Coding," IEEE Trans. Information Theory, vol. 24, no. 5, pp. 530-536, 1978.

Index Terms:
Database compression, data compression, physical organization, statistical database.
Citation:
Wee Keong Ng, Chinya V. Ravishankar, "Block-Oriented Compression Techniques for Large Statistical Databases," IEEE Transactions on Knowledge and Data Engineering, vol. 9, no. 2, pp. 314-328, March-April 1997, doi:10.1109/69.591455
Usage of this product signifies your acceptance of the Terms of Use.