loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Data Compression Conference (DCC '96)
Compressing semi-structured text using hierarchical phrase identifications
Snowbird, UT
March 31-April 03
ISBN: 0-8186-7358-3
C.G. Nevill-Manning, Dept. of Comput. Sci., Waikato Univ., Hamilton, New Zealand
I.H. Witten, Dept. of Comput. Sci., Waikato Univ., Hamilton, New Zealand
D.R. Olsen, Jr., Dept. of Comput. Sci., Waikato Univ., Hamilton, New Zealand
J.A. Storer, Dept. of Comput. Sci., Waikato Univ., Hamilton, New Zealand
M. Cohn, Dept. of Comput. Sci., Waikato Univ., Hamilton, New Zealand
This paper takes a compression scheme that infers a hierarchical grammar from its input, and investigates its application to semi-structured text. Although there is a huge range and variety of data that comes within the ambit of "semi-structured", we focus attention on a particular, and very large, example of such text. Consequently the work is a case study of the application of grammar-based compression to a large-scale problem. We begin by identifying some characteristics of semi-structured text that have special relevance to data compression. We then give a brief account of a particular large textual database, and describe a compression scheme that exploits its structure. In addition to providing compression, the system gives some insight into the structure of the database. Finally we show how the hierarchical grammar can be generalized, first manually and then automatically, to yield further improvements in compression performance.
Index Terms:
word processing; data compression; database management systems; data structures; grammars; large-scale systems; hierarchical phrase identifications; semistructured text compression; hierarchical grammar; grammar based compression; large-scale problem; data compression; large textual database; database structure; compression performance
Citation:
C.G. Nevill-Manning, I.H. Witten, D.R. Olsen, Jr., J.A. Storer, M. Cohn, "Compressing semi-structured text using hierarchical phrase identifications," dcc, pp.63, Data Compression Conference (DCC '96), 1996
Usage of this product signifies your acceptance of the Terms of Use.