loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Data Compression Conference (DCC '95)
Context models in the MDL framework
Snowbird, Utah
March 28-March 30
ISBN: 0-8186-7012-6
E.S. Ristad, Dept. of Comput. Sci., Princeton Univ., NJ, USA
R.G. Thomas, III, Dept. of Comput. Sci., Princeton Univ., NJ, USA
Current approaches to speech and handwriting recognition demand a strong language model with a small number of states and an even smaller number of parameters. We introduce four new techniques for statistical language models: multicontextual modeling, nonmonotonic contexts, implicit context growth, and the divergence heuristic. Together these techniques result in language models that have few states, even fewer parameters, and low message entropies. For example, our techniques achieve a message entropy of 2.16 bits/char on the Brown corpus using only 19374 contexts and 54621 parameters. Multicontextual modeling and nonmonotonic contexts, are generalizations of the traditional context model. Implicit context growth ensures that the state transition probabilities of a variable-length Markov process are estimated accurately. This technique is generally applicable to any variable-length Markov process whose state transition probabilities are estimated from string frequencies. In our case, each state in the Markov process represents a context, and implicit context growth conditions the shorter contexts on the fact that the longer contexts did not occur. In a traditional unicontext model, this technique reduces the message entropy of typical English text by 0.1 bits/char. The divergence heuristic, is a heuristic estimation algorithm based on Rissanen's (1978, 1983) minimum description length (MDL) principle and universal data compression algorithm.
Index Terms:
data compression; speech recognition; handwriting recognition; Markov processes; probability; heuristic programming; context-sensitive languages; speech coding; entropy codes; statistical analysis; handwriting recognition; speech recognition; multicontextual modeling; statistical language models; nonmonotonic contexts; implicit context growth; divergence heuristic; low message entropies; Brown corpus; state transition probabilities; variable-length Markov process; string frequencies; English text; heuristic estimation algorithm; minimum description length; universal data compression algorithm; code length
Citation:
E.S. Ristad, R.G. Thomas, III, "Context models in the MDL framework," dcc, pp.62, Data Compression Conference (DCC '95), 1995
Usage of this product signifies your acceptance of the Terms of Use.