This Article 
 Bibliographic References 
 Add to: 
Maximum Margin Bayesian Network Classifiers
March 2012 (vol. 34 no. 3)
pp. 521-532
M. Wohlmayr, Dept. of Electr. Eng., Graz Univ. of Technol., Graz, Austria
F. Pernkopf, Dept. of Electr. Eng., Graz Univ. of Technol., Graz, Austria
S. Tschiatschek, Dept. of Electr. Eng., Graz Univ. of Technol., Graz, Austria
We present a maximum margin parameter learning algorithm for Bayesian network classifiers using a conjugate gradient (CG) method for optimization. In contrast to previous approaches, we maintain the normalization constraints on the parameters of the Bayesian network during optimization, i.e., the probabilistic interpretation of the model is not lost. This enables us to handle missing features in discriminatively optimized Bayesian networks. In experiments, we compare the classification performance of maximum margin parameter learning to conditional likelihood and maximum likelihood learning approaches. Discriminative parameter learning significantly outperforms generative maximum likelihood estimation for naive Bayes and tree augmented naive Bayes structures on all considered data sets. Furthermore, maximizing the margin dominates the conditional likelihood approach in terms of classification performance in most cases. We provide results for a recently proposed maximum margin optimization approach based on convex relaxation [1]. While the classification results are highly similar, our CG-based optimization is computationally up to orders of magnitude faster. Margin-optimized Bayesian network classifiers achieve classification performance comparable to support vector machines (SVMs) using fewer parameters. Moreover, we show that unanticipated missing feature values during classification can be easily processed by discriminatively optimized Bayesian network classifiers, a case where discriminative classifiers usually require mechanisms to complete unknown feature values in the data first.

[1] Y. Guo, D. Wilkinson, and D. Schuurmans, “Maximum Margin Bayesian Networks,” Proc. Int'l Conf. Uncertainty in Artificial Intelligence, pp. 233-242, 2005.
[2] V. Vapnik, Statistical Learning Theory. Wiley & Sons, 1998.
[3] B. Taskar, C. Guestrin, and D. Koller, “Max-Margin Markov Networks,” Proc. Advances in Neural Information Processing Systems, 2003.
[4] H. Wettig, P. Grünwald, T. Roos, P. Myllymäki, and H. Tirri, “When Discriminative Learning of Bayesian Network Parameters Is Easy,” Proc. Int'l Joint Conf. Artificial Intelligence, pp. 491-496, 2003.
[5] T. Roos, H. Wettig, P. Grünwald, P. Myllymäki, and H. Tirri, “On Discriminative Bayesian Network Classifiers and Logistic Regression,” Machine Learning, vol. 59, pp. 267-296, 2005.
[6] F. Sha and L. Saul, “Comparison of Large Margin Training to Other Discriminative Methods for Phonetic Recognition by Hidden Markov Models,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 313-316, 2007.
[7] G. Heigold, T. Deselaers, R. Schlüter, and H. Ney, “Modified MMI/MPE: A Direct Evaluation of the Margin in Speech Recognition,” Proc. Int'l Conf. Machine Learning, pp. 384-391, 2008.
[8] R. Collobert, F. Siz, J. Weston, and L. Bottou, “Trading Convexity for Scalability,” Proc. Int'l Conf. Machine Learning, pp. 201-208, 2006.
[9] C. Bishop, Neural Networks for Pattern Recognition. Oxford Univ. Press, 1995.
[10] R. Greiner, X. Su, S. Shen, and W. Zhou, “Structural Extension to Logistic Regression: Discriminative Parameter Learning of Belief Net Classifiers,” Machine Learning, vol. 59, pp. 297-322, 2005.
[11] O. Gopalakrishnan, D. Kanevsky, A. Nàdas, and D. Nahamoo, “An Inequality for Rational Functions with Applications to Some Statistical Estimation Problems,” IEEE Trans. Information Theory, vol. 37, no. 1, pp. 107-113, Jan. 1991.
[12] F. Pernkopf and M. Wohlmayr, “On Discriminative Parameter Learning of Bayesian Network Classifiers,” Proc. European Conf. Machine Learning, pp. 221-237, 2009.
[13] P. Woodland and D. Povey, “Large Scale Discriminative Training of Hidden Markov Models for Speech Recognition,” Computer Speech and Language, vol. 16, pp. 25-47, 2002.
[14] R. Schlüter, W. Macherey, M.B., and H. Ney, “Comparison of Discriminative Training Criteria and Optimization Methods for Speech Recognition,” Speech Comm., vol. 34, pp. 287-310, 2001.
[15] F. Pernkopf and M. Wohlmayr, “Maximum Margin Bayesian Network Classifiers,” technical report, Inst. Signal Processing and Speech Comm., Graz Univ. of Tech nology, 2010.
[16] F. Pernkopf and M. Wohlmayr, “Large Margin Learning of Bayesian Classifiers Based on Gaussian Mixture Models,” Proc. European Conf. Machine Learning, pp. 50-66, 2010.
[17] L. Lamel, R. Kassel, and S. Seneff, “Speech Database Development: Design and Analysis of the Acoustic-Phonetic Corpus,” Proc. US Defense Advanced Research Projects Agency Speech Recognition Workshop, 1986.
[18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-Based Learning Applied to Document Recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.
[19] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.
[20] F. Pernkopf and J. Bilmes, “Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers,” J. Machine Learning Research, vol. 11, pp. 2323-2360, 2010.
[21] N. Friedman, D. Geiger, and M. Goldszmidt, “Bayesian Network Classifiers,” Machine Learning, vol. 29, pp. 131-163, 1997.
[22] P. Domingos and M. Pazzani, “On the Optimality of the Simple Bayesian Classifier under Zero-One Loss,” Machine Learning, vol. 29, nos. 2/3, pp. 103-130, 1997.
[23] J. Bilmes, “Dynamic Bayesian Multinets,” Proc. 16th Int'l Conf. Uncertainty in Artificial Intelligence, pp. 38-45, 2000.
[24] R. Cowell, A. Dawid, S. Lauritzen, and D. Spiegelhalter, Probabilistic Networks and Expert Systems. Springer Verlag, 1999.
[25] C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
[26] S. Acid, L. de Campos, and J. Castellano, “Learning Bayesian Network Classifiers: Searching in a Space of Partially Directed Acyclic Graphs,” Machine Learning, vol. 59, pp. 213-235, 2005.
[27] B. Schölkopf and A. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2001.
[28] P. Huber, “Robust Estimation of a Location Parameter,” Annals of Statistics, vol. 53, pp. 73-101, 1964.
[29] O. Chapelle, “Training a Support Vector Machine in the Primal,” Neural Computation, vol. 19, no. 5, pp. 1155-1178, 2007.
[30] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery, Numerical Recipes in C. Cambridge Univ. Press, 1992.
[31] T. Cover and J. Thomas, Elements of Information Theory. John Wiley & Sons, 1991.
[32] E. Keogh and M. Pazzani, “Learning Augmented Bayesian Classifiers: A Comparison of Distribution-Based and Classification-Based Approaches,” Proc. Workshop Artificial Intelligence and Statistics, pp. 225-230, 1999.
[33] F. Pernkopf, “Bayesian Network Classifiers versus Selective $k$ -NN Classifier,” Pattern Recognition, vol. 38, no. 3, pp. 1-10, 2005.
[34] D. Grossman and P. Domingos, “Learning Bayesian Network Classifiers by Maximizing Conditional Likelihood,” Proc. Int'l Conf. Machine Lerning, pp. 361-368, 2004.
[35] P. Bartlett, M. Jordan, and J. McAuliffe, “Convexity, Classification, and Risk Bounds,” J. Am. Statistical Assoc., vol. 101, no. 473, pp. 138-156, 2006.
[36] F. Pernkopf and M. Wohlmayr, “Stochastic Margin-Based Structure Learning of Bayesian Network Classifiers,” technical report, Laboratory of Signal Processing and Speech Comm., Graz Univ. of Tech nology, 2011.
[37] F. Pernkopf and J. Bilmes, “Order-Based Discriminative Structure Learning for Bayesian Network Classifiers,” Proc. Int'l Symp. Artificial Intelligence and Math., 2008.
[38] U. Fayyad and K. Irani, “Multi-Interval Discretizaton of Continuous-Valued Attributes for Classification Learning,” Proc. Joint Conf. Artificial Intelligence, pp. 1022-1027, 1993.
[39] F. Pernkopf, T. Van Pham, and J. Bilmes, “Broad Phonetic Classification Using Discriminative Bayesian Networks,” Speech Comm., vol. 143, no. 1, pp. 123-138, 2008.
[40] A. Wächter and L. Biegler, “On the Implementation of an Interior-Point Filter Line-Search Algorithm for Large-Scale Nonlinear Programming,” Math. Programming, vol. 106, pp. 25-57, 2006.
[41] L. Biegler and V. Zavala, “Large-Scale Nonlinear Programming Using IPOPT: An Integrating Framework for Enterprise-Wide Dynamic Optimization,” Computers & Chemical Eng., vol. 33, no. 3, pp. 575-582, 2009.
[42] P. Amestoy, I. Duff, J.-Y. L'Excellent, and J. Koster, “MUMPS: A General Purpose Distributed Memory Sparse Solver,” Proc. Fifth Int'l Workshop Applied Parallel Computing, pp. 122-131, 2000.
[43] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge Univ. Press, Mar. 2004.

Index Terms:
pattern classification,belief networks,conjugate gradient methods,convex programming,feature extraction,learning (artificial intelligence),maximum likelihood estimation,margin-optimized Bayesian network classifiers,maximum margin Bayesian network classifiers,maximum margin parameter learning algorithm,conjugate gradient method,normalization constraints,probabilistic interpretation,missing feature handling,conditional likelihood learning,maximum likelihood learning,discriminative parameter learning,maximum margin optimization approach,convex relaxation,CG-based optimization,Bayesian methods,Optimization,Niobium,Fasteners,Random variables,Training,Algorithm design and analysis,convex relaxation.,Bayesian network classifier,discriminative learning,discriminative classifiers,large margin training,missing features
M. Wohlmayr, F. Pernkopf, S. Tschiatschek, "Maximum Margin Bayesian Network Classifiers," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp. 521-532, March 2012, doi:10.1109/TPAMI.2011.149
Usage of this product signifies your acceptance of the Terms of Use.