This Article 
 Bibliographic References 
 Add to: 
Inducing Features of Random Fields
April 1997 (vol. 19 no. 4)
pp. 380-393

Abstract—We present a technique for constructing random fields from a set of training samples. The learning paradigm builds increasingly complex fields by allowing potential functions, or features, that are supported by increasingly large subgraphs. Each feature has a weight that is trained by minimizing the Kullback-Leibler divergence between the model and the empirical distribution of the training data. A greedy algorithm determines how features are incrementally added to the field and an iterative scaling algorithm is used to estimate the optimal values of the weights. The random field models and techniques introduced in this paper differ from those common to much of the computer vision literature in that the underlying random fields are non-Markovian and have a large number of parameters that must be estimated. Relations to other learning approaches, including decision trees, are given. As a demonstration of the method, we describe its application to the problem of automatic word classification in natural language processing.

[1] M. Almeida and B. Gidas, "A Variational Method for Estimating the Parameters of MRF from Complete or Incomplete Data," Annals of Applied Probability, vol. 3, no. 1, pp. 103-136, 1993.
[2] N. Balram and J. Moura, "Noncausal Gauss Markov Random Fields: Parameter Structure and Estimation," IEEE Transactions on Information Theory, vol. 39, no. 4, pp. 1,333-1,343, July 1993.
[3] A. Berger, V. Della Pietra, and S. Della Pietra, "A Maximum Entropy Approach to Natural Language Processing," Computational Linguistics, vol. 22, no. 1, pp. 39-71, 1996.
[4] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees.Belmont, Calif.: Wadsworth, 1984.
[5] D. Brown, "A Note on Approximations to Discrete Probability Distributions," Information and Control, vol. 2, pp. 386-392, 1959.
[6] P. Brown, V. Della Pietra, P. de Souza, J. Lai, and R. Mercer, "Class-Based n-Gram Models of Natural Language," Computational Linguistics, vol. 18, no. 4, pp. 467-479, 1992.
[7] P.F. Brown, J. Cocke, V. Della-Pietra, S. Della-Pietra, J.D. Lafferty, R.L. Mercer, and P.S. Roossin, "A Statistical Approach to Machine Translation," Computational Linguistics, vol. 16, no. 2, pp. 79-85, 1990.
[8] B. Chalmond, "An Iterative Gibbsian Technique for Reconstruction of m-Ary Images," Pattern Recognition, vol. 22, no. 6, pp. 747-761, 1989.
[9] I. Csiszár, "I-Divergence Geometry of Probability Distributions and Minimization Problems," Annals of Probability, vol. 3, no. 1, pp. 146-158, 1975.
[10] I. Csiszár, "A Geometric Interpretation of Darroch and Ratcliff's Generalized Iterative Scaling," Annals of Statistics, vol. 17, no. 3, pp. 1,409-1,413, 1989.
[11] I. Csiszár and G. Tusnády, "Information Geometry and Alternating Minimization Procedures," Statistics&Decisions, Supplement Issue, vol. 1, pp. 205-237, 1984.
[12] J. Darroch and D. Ratcliff, "Generalized Iterative Scaling for Log-Linear Models," Ann. Math. Statist., vol. 43, pp. 1,470-1,480, 1972.
[13] A.P. Dempster, N.M. Laird, and D.B. Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm," J. Royal Statistical Soc., vol. 39, no. B, pp. 1-38, 1977.
[14] P. Diaconis and D. Ylvisaker, "Conjugate Priors for Exponential Families," Ann. Statist., vol. 7, pp. 269-281, 1979.
[15] P. Ferrari, A. Frigessi, and R. Schonmann, "Convergence of Some Partially Parallel Gibbs Samplers with Annealing," Annals of Applied Probability, vol. 3, no. 1, pp. 137-152, 1993.
[16] A. Frigessi, C. Hwang, and L. Younes, "Optimal Spectral Structure of Reversible Stochastic Matrices, Monte Carlo Methods and the Simulation of Markov Random Fields," Annals of Applied Probability, vol. 2, no. 3, pp. 610-628, 1992.
[17] S. Geman and D. Geman, "Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images," IEEE Trans. Pattern Anal. Machine Intell., vol. 6, pp. 721-741, 1984.
[18] C. Geyer and E. Thomson, "Constrained Monte Carlo Maximum Likelihood for Dependent Data (with Discussion)," J. Royal Stat. Soc., vol. B-54, pp. 657-699, 1992.
[19] E. T. Jaynes, Papers on Probability, Statistics, and Statistical Physics, R. Rosenkrantz, ed. Dordrecht, Holland: D. Reidel Publishing Co., 1983.
[20] J. Lafferty and R. Mercer, "Automatic Word Classification Using Features of Spellings," Proc. Ninth Annual Conf. Univ. of Waterloo Centre for the New OED and Text Research.Oxford, England: Oxford Univ. Press, 1993.
[21] G.G. Potamianos and J.K. Goutsias, “Partition Function Estimation of Gibbs Random Field Images Using Monte Carlo Simulations,” IEEE Trans. Information Theory, vol. 39, pp. 1322-1332, 1993.
[22] L. Younes, "Estimation and Annealing for Gibbsian Fields," Ann. Inst. H. PoincaréProbab. Statist., vol. 24, no. 2, pp. 269-294, 1988.

Index Terms:
Random field, Kullback-Leibler divergence, iterative scaling, maximum entropy, EM algorithm, statistical learning, clustering, word morphology, natural language processing.
Stephen Della Pietra, Vincent Della Pietra, John Lafferty, "Inducing Features of Random Fields," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 4, pp. 380-393, April 1997, doi:10.1109/34.588021
Usage of this product signifies your acceptance of the Terms of Use.