Publication 2009 Issue No. 11 - November Abstract - Bregman Divergences and Surrogates for Learning
Bregman Divergences and Surrogates for Learning
November 2009 (vol. 31 no. 11)
pp. 2048-2059
 ASCII Text x Richard Nock, Frank Nielsen, "Bregman Divergences and Surrogates for Learning," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 11, pp. 2048-2059, November, 2009.
 BibTex x @article{ 10.1109/TPAMI.2008.225,author = {Richard Nock and Frank Nielsen},title = {Bregman Divergences and Surrogates for Learning},journal ={IEEE Transactions on Pattern Analysis and Machine Intelligence},volume = {31},number = {11},issn = {0162-8828},year = {2009},pages = {2048-2059},doi = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2008.225},publisher = {IEEE Computer Society},address = {Los Alamitos, CA, USA},}
 RefWorks Procite/RefMan/Endnote x TY - JOURJO - IEEE Transactions on Pattern Analysis and Machine IntelligenceTI - Bregman Divergences and Surrogates for LearningIS - 11SN - 0162-8828SP2048EP2059EPD - 2048-2059A1 - Richard Nock, A1 - Frank Nielsen, PY - 2009KW - Ensemble learningKW - boostingKW - Bregman divergencesKW - linear separatorsKW - decision trees.VL - 31JA - IEEE Transactions on Pattern Analysis and Machine IntelligenceER -
Richard Nock, Université Antilles-Guyane, CEREGMIA-UFR Droit et Sciences Economiques, France
Frank Nielsen, Ecole Polytechnique, France
Bartlett et al. (2006) recently proved that a ground condition for surrogates, classification calibration, ties up their consistent minimization to that of the classification risk, and left as an important problem the algorithmic questions about their minimization. In this paper, we address this problem for a wide set which lies at the intersection of classification calibrated surrogates and those of Murata et al. (2004). This set coincides with those satisfying three common assumptions about surrogates. Equivalent expressions for the members—sometimes well known—follow for convex and concave surrogates, frequently used in the induction of linear separators and decision trees. Most notably, they share remarkable algorithmic features: for each of these two types of classifiers, we give a minimization algorithm provably converging to the minimum of any such surrogate. While seemingly different, we show that these algorithms are offshoots of the same “master” algorithm. This provides a new and broad unified account of different popular algorithms, including additive regression with the squared loss, the logistic loss, and the top-down induction performed in CART, C4.5. Moreover, we show that the induction enjoys the most popular boosting features, regardless of the surrogate. Experiments are provided on 40 readily available domains.

[1] P. Bartlett , M. Jordan , and J.D. McAuliffe , “Convexity, Classification, and Risk Bounds,” J. Am. Statistical Assoc., vol. 101, pp. 138-156, 2006.
[2] P. Bartlett and M. Traskin , “Adaboost is Consistent,” Proc. Neural Information Processing Systems Conf., 2006.
[3] M.J. Kearns and Y. Mansour , “On the Boosting Ability of Top-Down Decision Tree Learning Algorithms,” J. Computer and System Sciences, vol. 58, pp. 109-128, 1999.
[4] R.E. Schapire and Y. Singer , “Improved Boosting Algorithms Using Confidence-Rated Predictions,” Proc. Conf. Computational Learning Theory, pp. 80-91, 1998.
[5] J. Friedman , T. Hastie , and R. Tibshirani , “Additive Logistic Regression: A Statistical View of Boosting,” Annals of Statistics, vol. 28, pp. 337-374, 2000.
[6] V. Vapnik , Statistical Learning Theory. John Wiley, 1998.
[7] N. Murata , T. Takenouchi , T. Kanamori , and S. Eguchi , “Information Geometry of ${\cal U}$ -Boost and Bregman Divergence,” Neural Computation, vol. 16, pp. 1437-1481, 2004.
[8] P. Grünwald and P. Dawid , “Game Theory, Maximum Entropy, Minimum Discrepancy and Robust Bayesian Decision Theory,” Annals of Statistics, vol. 32, pp. 1367-1433, 2004.
[9] M. Collins , R. Schapire , and Y. Singer , “Logistic Regression, Adaboost and Bregman Distances,” Proc. Conf. Computational Learning Theory, pp. 158-169, 2000.
[10] R.E. Schapire and Y. Singer , “Improved Boosting Algorithms Using Confidence-Rated Predictions,” Machine Learning, vol. 37, pp. 297-336, 1999.
[11] A. Azran and R. Meir , “Data Dependent Risk Bounds for Hierarchical Mixture of Experts Classifiers,” Proc. Conf. Computational Learning Theory, pp. 427-441, 2004.
[12] A. Banerjee , X. Guo , and H. Wang , “On the Optimality of Conditional Expectation As a Bregman Predictor,” IEEE Trans. Information Theory, vol. 51, pp. 2664-2669, 2005.
[13] C. Gentile and M. Warmuth , “Linear Hinge Loss and Average Margin,” Proc. 1998 Conf. Advances in Neural Information Processing Systems, pp. 225-231, 1998.
[14] D. Helmbold , J. Kivinen , and M. Warmuth , “Relative Loss Bounds for Single Neurons,” IEEE Trans. Neural Networks, vol. 10, no. 6, pp.1291-1304, Nov. 1999.
[15] A. Banerjee , S. Merugu , I. Dhillon , and J. Ghosh , “Clustering with Bregman Divergences,” J. Machine Learning Research, vol. 6, no. 6, pp.1705-1749, Nov. 2005.
[16] R. Nock and F. Nielsen , “A ${\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}$ eal Generalization of Discrete AdaBoost,” Artificial Intelligence, vol. 171, pp. 25-41, 2007.
[17] R.E. Schapire , Y. Freund , P. Bartlett , and W.S. Lee , “Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods,” Annals of Statistics, vol. 26, pp. 1651-1686, 1998.
[18] L. Breiman , J.H. Freidman , R.A. Olshen , and C.J. Stone , Classification and Regression Trees. Wadsworth, 1984.
[19] J.R. Quinlan , C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[20] K. Matsushita , “Decision Rule, Based on Distance, for the Classification Problem,” Annals of the Inst. of Statistical Math., vol. 8, pp. 67-77, 1956.
[21] M. Warmuth , J. Liao , and G. Rätsch , “Totally Corrective Boosting Algorithms that Maximize the Margin,” Proc. Int'l Conf. Machine Learning, pp. 1001-1008, 2006.
[22] J. Kivinen and M. Warmuth , “Boosting As Entropy Projection,” Proc. Conf. Computational Learning Theory, pp. 134-144, 1999.
[23] R. Nock and F. Nielsen , “On Domain-Partitioning Induction Criteria: Worst-Case Bounds for the Worst-Case Based,” Theoretical Computer Science, vol. 321, pp. 371-382, 2004.
[24] C. Henry , R. Nock , and F. Nielsen , “ ${\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}$ eal Boosting a la Carte with an Application to Boosting Oblique Decision Trees,” Proc. 21st Int'l Joint Conf. Artificial Intelligence, pp. 842-847, 2007.
[25] C.L. Blake , E. Keogh , and C.J. Merz , “UCI Repository of Machine Learning Databases,” http://www.ics.uci.edu/~mlearnMLRepository.html , 1998.

Index Terms:
Ensemble learning, boosting, Bregman divergences, linear separators, decision trees.
Citation:
Richard Nock, Frank Nielsen, "Bregman Divergences and Surrogates for Learning," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 11, pp. 2048-2059, Nov. 2009, doi:10.1109/TPAMI.2008.225