This Article 
 Bibliographic References 
 Add to: 
Optimal Partitioning for Classification and Regression Trees
April 1991 (vol. 13 no. 4)
pp. 340-354

An iterative algorithm that finds a locally optimal partition for an arbitrary loss function, in time linear in N for each iteration is presented. The algorithm is a K-means-like clustering algorithm that uses as its distance measure a generalization of Kullback's information divergence. Moreover, it is proven that the globally optimal partition must satisfy a nearest neighbour condition using divergence as the distance measure. These results generalize similar results of L. Breiman et al. (1984) to an arbitrary number of classes or regression variables and to an arbitrary number of bills. Experimental results on a text-to-speech example are provided and additional applications of the algorithm, including the design of variable combinations, surrogate splits, composite nodes, and decision graphs, are suggested.

[1] J. E. G. Henrichon and K. S. Fu, "A nonparametric partitioning procedure for pattern classification,"IEEE Trans. Comput., vol. C-18, pp. 604-624, May 1969.
[2] W. S. Meisel and D.A. Michalopoulos, "A partitioning algorithm with application in pattern classification and the optimization of decision trees,"IEEE Trans. Comput., vol. C-22, pp. 93-103, Jan. 1973.
[3] H. J. Payne and W.S. Meisel, "An algorithm for constructing optimal binary decision trees,"IEEE Trans. Comput., vol. C-26, pp. 905-916, Sept. 1977.
[4] P. H. Swain and H. Hauska, "The decision tree classifier: design and potential,"IEEE Trans. Geosci. Electron., vol. GE-15, pp. 142-147, July 1977.
[5] I.K. Sethi and B. Chatterjee, "Efficient decision tree design for discrete variable pattern recognition problems,"Pattern Recog., vol. 9, pp. 197-206, 1977.
[6] I. K. Sethi and G. P. R. Sarvarayudu, "Hierarchical classifier design using mutual information,"IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-4, pp. 441-445, July 1982.
[7] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone,Classification and Regression Trees(The Wadsworth Statistics/Probability Series). Belmont, CA: Wadsworth, 1984.
[8] J. M. Lucassen and R.L. Mercer, "An information theoretic approach to the automatic determination of phonemic baseforms," inProc. Int. Conf. Acoustics, Speech, and Signal Processing, San Diego, CA, IEEE, Mar. 1984, pp. 42.5.1-42.5.4.
[9] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer, "A tree-based statistical language model for natural language speech recognition,"IEEE Trans. Acoust., Speech, Signal Processing, vol. 37, pp. 1001-1008, July 1989.
[10] P.A. Chou, "Applications of information theory to pattern recognition and the design of decision trees and trellises," Ph.D. dissertation, Stanford Univ., Stanford, CA, June 1988.
[11] Y. Brandman, "Spectral lower-bound techniques for logic circuits," Comput. Syst. Lab., Stanford, CA, Tech. Rep. CSL-TR-87-325, Mar. 1987.
[12] R. W. Payne and D.A. Preece, "Identification keys and diagnostic tables: A review,"J. Roy. Stat. Soc. A., vol. 143, pp. 253-292, 1980.
[13] E. B. Hunt, J. Marin, and P. T. Stone,Experiments in Induction. New York: Academic, 1966.
[14] J. R. Quinlan, "Induction over large data bases," Heuristic Programming Project, Stanford Univ., Tech. Rep. HPP-79-14, 1979.
[15] J. R. Quinlan, "Induction of decision trees,"Machine Learning, vol. 1, no. 1, pp. 81-106, 1986.
[16] J. R. Quinlan, "The effect of noise on concept learning," inMachine Learning-An Artificial Intelligence Approach, vol. II, R. S. Michalski, J.G. Carbonell, and T.M. Mitchell, Eds. Los Altos, CA: Kaufmann, 1986, ch. 6, pp. 149-166.
[17] J. Cheng, U. M. Fayyad, K.B. Irani, and Z. Qian, "Improved decision trees: A generalized version of ID3," inProc. Fifth Int. Conf. Machine Learning, Ann Arbor, MI, June 1988, pp. 100-107.
[18] J. R. Quinlan and R. L. Rivest, "Inferring decision trees using the minimum description length principle,"Inform. Computat., vol 80, pp. 227-248, 1989.
[19] P. Clark and T. Niblett, "The CN2 induction algorithm,"Machine Learning, vol. 3, pp. 261-283, 1989.
[20] J. Mingers, "Empirical comparison of selection measures for decision tree induction,"Machine Learning, vol. 3, pp. 319-342, 1989.
[21] M. Montalbano, "Tables, flow charts, and program logic,"IBM Syst. J., pp. 51-63, Sept. 1962.
[22] J. Egler, "A procedure for converting logic table conditions into an efficient sequence of test instructions,"Commun. ACM, vol. 6, pp. 510-514, Sept. 1963.
[23] S. L. Pollack, "Conversion of limited-entry decision tables to computer programs,"Commun. ACM, vol. 11, pp. 677-682, Nov. 1965.
[24] L. T. Reinwald and R. M. Soland, "Conversion of limited-entry decision tables to optimal computer programs II: Minimum storage requirement,"J. ACM, vol. 14, pp. 742-755, Oct. 1967.
[25] D.E. Knuth, "Optimal binary search trees,"Acta Inform., vol. 1, pp. 14-25, 1971.
[26] K. Shwayder, "Conversion of limited-entry decision tables to computer programs-A proposed modification to Pollack's algorithm,"Commun. ACM, vol. 14, pp. 69-73, Feb. 1971.
[27] A. Bayes, "A dynamic programming algorithm to optimise decision table code,"Australian Comput. J., vol. 5, pp. 77-79, May 1973.
[28] S. Ganapathy and V. Rajamaran, "Information theory applied to the conversion of decision tables to computer programs,"Commun. ACM, vol. 16, pp. 532-539, Sept. 1973.
[29] H. Schumacher and K.C. Sevcik, "The synthetic approach to decision table conversion,"Commun. ACM, vol. 19, pp. 343-351, June 1976.
[30] A. Martelli and U. Montanari, "Optimizing decision trees through heuristically guided search,"Commun. ACM, vol. 21, pp. 1025-1039, Dec. 1978.
[31] C. R. P. Hartmann, P. K. Varshney, K. G. Mehrotra, and C. L. Gerberich, "Application of information theory to the construction of efficient decision trees,"IEEE Trans. Inform. Theory, vol. IT-28, pp. 565-577, July 1982.
[32] J.A. Morgan and J.A. Sonquist, "Problems in the analysis of survey data, and a proposal,"J. Amer. Statist. Assoc., vol. 58, pp. 415-434, 1963.
[33] A. Fielding, "Binary segmentation: The automatic interaction detector and related techniques for exploring data structure," inExploring Data Structures, vol. I, C. A. O'Muircheartaigh and C. Payne, Eds. London: Wiley, 1977, ch. 8, pp. 221-257.
[34] A. Buzo, A. H. Gray Jr., R.M. Gray, and J. D. Markel, "Speech coding based upon vector quantization,"IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-28, pp. 562-574, Oct. 1980.
[35] D. Y. Wong, B. H. Juang, and A. H. Gray Jr., "An 800 bit/s vector quantization LPC vocoder,"IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-30, pp. 770-780, Oct. 1982.
[36] P. A. Chou, T. Lookabaugh, and R. M. Gray, "Optimal pruning with applications to tree structured source coding and modeling,"IEEE Trans. Inform. Theory, vol. 35, pp. 299-315, Mar. 1989.
[37] L. Hyafil and R.L. Rivest, "Constructing optimal binary decision trees is NP-complete,"Inform. Processing Lett., vol. 5, pp. 15-17, May 1976.
[38] M. R. Garey and D. S. Johnson,Computers and Intractability: A Guide to Theory of NP-Completeness. San Francisco, CA: Freeman, 1979.
[39] W. D. Fisher, "On grouping for maximum homogeneity,"J. Amer. Statist. Assoc., vol. 53, pp. 789-798, Dec. 1958.
[40] G. H. Ball and D. J. Hall, "A clustering technique for summarizing multivariate data,"Behavioral Sci., vol. 12, pp. 153-155, Mar. 1967.
[41] E. W. Forgey, "Cluster analysis of multivariate data: efficiency versus interpretability of classifications,"Biometrics, vol. 21, no. 3, p. 768, 1965.
[42] J.B. MacQueen, "Some methods for classification and analysis of multivariate observations," inProc. 5th Berkeley Symp. Mathematical Statistics and Probability, vol. 1. Berkeley, CA: University of California Press, 1967, pp. 281-297.
[43] R. O. Duda and P. E. Hart,Pattern Classification and Scene Analysis. New York: Wiley, 1973.
[44] A. K. Jain and R. C. Dubes,Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice-Hall, 1988.
[45] S.P. Lloyd, "Least squares quantization in PCM,"IEEE Trans. Inform. Theory, vol. IT-28, pp. 129-136, Mar. 1982; previously an unpublished Bell Laboratories Tech. Note, 1957.
[46] Y. Linde, A. Buzo, and R. M. Gray, "An algorithm for vector quantizer design,"IEEE Trans. Commun., vol. COM-28, pp. 84-95, Jan. 1980.
[47] D. Burshtein, V. D. Pietra, D. Kanevsky, and A. Nádas, "A splitting theorem for tree construction," IBM, Yorktown Heights, NY, Tech. Rep. RC 14754 (#66136), July 1989.
[48] D. Burshtein, V. D. Pietra, D. Kanevsky, and A. Nádas, "Minimum impurity partitions,"Ann. Stat., Aug. 1989, submitted for publication.
[49] P. Chou, "Using decision trees for noiseless compression," inProc. Int. Symp. Inform. Theory, IEEE, San Diego, CA, Jan. 1990, abstract only.
[50] S. Kullback,Information Theory and Statistics. New York: Wiley, 1959; republished by Dover, 1968.
[51] B. Efron, "Regression and ANOVA with zero-one data: Measures of residual variation,"J. Amer. Statist. Assoc., vol. 73, pp. 113-121, Mar. 1978.
[52] T. M. Cover, "Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition,"IEEE Trans. Electron. Comput., vol. EC-14, pp. 326-334, 1965.
[53] T. J. Sejnowski and C. R. Rosenberg, "Parallel networks that learn to pronounce English text,"Complex Syst., vol. 1, pp. 144-168, 1987.
[54] R. A. Becker, J. M. Chambers, and A. R. Wilks,The New S Language. Pacific Grove, CA: Wadsworth&Brooks, 1988.
[55] M. H. Becker, L. A. Clark, and D. Pregibon, "Tree-based models," inStatistical Software in S. Pacific Grove, CA: Wadsworth, 1989.
[56] R. M. Gray, "Applications of information theory to pattern recognition and the design of decision tree classifiers," proposal to NSF Division of Information Science and Technology, IST-8509860, Dec. 1985.
[57] S.M. Weiss, R.S. Galen, and P.V. Tadepalli, "Optimizing the predictive value of diagnostic decision rules," inProc. Nat. Conf. Artificial Intelligence, AAAI, Seattle, WA, 1987, pp. 521-526.
[58] P. A. Chou, T. Lookabaugh, and R. M. Gray, "Entropy-constrained vector quantization,"IEEE Trans. Acoust., Speech, Signal Processing, vol. 37, pp. 31-42, Jan. 1989.
[59] W. Equitz, "Fast algorithms for vector quantization picture coding," inProc. Int. Conf. Acoustics, Speech, Signal Processing, IEEE, Dallas, TX, Apr. 1987, pp. 18.1.1-18.1.4.
[60] J. E. Shore and R. M. Gray, "Minimum cross-entropy pattern classification and cluster analysis,"IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-4, pp. 11-17, Jan. 1982.
[61] R. M. Gray, A. H. Gray Jr., G. Rebolledo, and J.E. Shore, "Rate-distortion speech coding with a minimum discrimination information distortion measure,"IEEE Trans. Inform. Theory, vol. IT-27, pp. 708-721, Nov. 1981.
[62] R.M. Gray, A. Buzo, A.H. Gray, and Y. Matsuyama, "Distortion measures for speech processing,"IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-28, pp. 367-376, Aug. 1980.
[63] J. D. Markel and A. H. Gray,Linear Prediction of Speech(Communication and Cybernetics). New York: Springer-Verlag, 1976.
[64] A. Nadas, "On Turing's formula for word probabilities,"IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-33, pp. 1414-1416, Dec. 1985.
[65] J. Rissanen, "Complexity of strings in the class of Markov processes,"IEEE Trans. Inform. Theory, vol. IT-32, pp. 526-532, July 1986.

Index Terms:
speech recognition; partitioning; regression trees; iterative algorithm; clustering algorithm; Kullback's information divergence; text-to-speech; surrogate splits; composite nodes; decision graphs; decision theory; iterative methods; speech recognition; trees (mathematics)
P.A. Chou, "Optimal Partitioning for Classification and Regression Trees," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, no. 4, pp. 340-354, April 1991, doi:10.1109/34.88569
Usage of this product signifies your acceptance of the Terms of Use.