This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Data Categorization Using Decision Trellises
September/October 1999 (vol. 11 no. 5)
pp. 697-712

Abstract—We introduce a probabilistic graphical model for supervised learning on databases with categorical attributes. The proposed belief network contains hidden variables that play a role similar to nodes in decision trees and each of their states either corresponds to a class label or to a single attribute test. As a major difference with respect to decision trees, the selection of the attribute to be tested is probabilistic. Thus, the model can be used to assess the probability that a tuple belongs to some class, given the predictive attributes. Unfolding the network along the hidden states dimension yields a trellis structure having a signal flow similar to second order connectionist networks. The network encodes context specific probabilistic independencies to reduce parametric complexity. We present a custom tailored inference algorithm and derive a learning procedure based on the expectation-maximization algorithm. We propose decision trellises as an alternative to decision trees in the context of tuple categorization in databases, which is an important step for building data mining systems. Preliminary experiments on standard machine learning databases are reported, comparing the classification accuracy of decision trellises and decision trees induced by C4.5. In particular, we show that the proposed model can offer significant advantages for sparse databases in which many predictive attributes are missing.

[1] U.M. Fayyad, G. Piatesky-Shapiro, and P. Smith, “From Data Mining to Knowledge Discovery: An Overview,” Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, U.M. Fayyad et al., eds., pp. 1-34, 1996.
[2] M. Holsheimer and A. Siebes, "Data Mining: The Search for Knowledge in Databases," Technical Report CS-R9406, CWI, Amsterdam 1994.
[3] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis. New York: Wiley, 1973.
[4] P. Langley, W. Iba, and K. Thompson, “An Analysis of Bayesian Classifiers,” Proc. 10th Nat'l Conf. Artificial Intelligence, pp. 223–228, AAAI Press and MIT Press, 1992.
[5] J.R. Quinlan, C4.5: Programs for Machine Learning,San Mateo, Calif.: Morgan Kaufman, 1992.
[6] R.L. Rivest, “Learning Decision Lists,” Machine Learning, vol. 2, pp. 229–246, 1987.
[7] C.M. Bishop, Neural Networks for Pattern Recognition. Clarendon Press, 1995.
[8] J. Pearl, Probabilistic Reasoning in Intelligent Systems. San Mateo, Calif.: Morgan Kaufman, 1988.
[9] J. Whittaker, Graphical Models in Applied Multivariate Statistics. Chichester: Wiley, 1990.
[10] W.L. Buntine, “A Guide to the Literature on Learning Probabilistic Networks from Data,” IEEE Trans. Knowledge and Data Engineering, 1996.
[11] F. Jensen, S. Lauritzen, and K. Olesen, “Bayesian Updating in Recursive Graphical Models by Local Computations,” Computational Statistical Quarterly, vol. 4, pp. 269–282, 1990.
[12] D. Heckerman, D. Geiger, and D.M. Chickering, “Learning Bayesian Networks: The Combination of Knowledge and Statistical Data,” Machine Learning, vol. 20, pp. 197–243, 1995.
[13] D. Heckerman, “A Tutorial on Learning with Bayesian Networks,” Technical Report MSR-TR-95-06, Microsoft Research, Redmond, Wash., Mar. 1995.
[14] C. Glymour, “Available Technology for Discovering Causal Models, Building Bayes Nets, and Selecting Predictors: The TETRAD II Program,” Proc. First Int'l Conf. Knowledge Discovery and Data Mining, Montreal, pp. 130–135, 1995.
[15] W.L. Buntine, “Operations for Learning with Graphical Models,” J. Artificial Intelligence Research, vol. 2, pp. 159–225, 1994.
[16] P Smyth, D. Heckerman, and M. Jordan, “Probabilistic Independence Networks for Hidden Markov Probability Models,” AI memo 1565, MIT, Cambridge, Mass., Feb. 1996.
[17] P. Cheeseman and J. Stutz, “Bayesian Classification (AutoClass): Theory and Results,” Advances in Knowledge Discovery and Data Mining, AAAI Press/MIT Press, pp. 61-83, 1996.
[18] N. Friedman and M. Goldszmidt, “Building Classifiers Using Bayesian Networks,” Proc. Int'l Conf. Machine Learning, 1996.
[19] R.M. Neal, "Connectionist Learning of Belief Networks," Artificial Intelligence, vol. 56, pp. 71-113, 1992.
[20] C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller, “Context-Specific Independence in Bayesian Networks,” Proc. 12th Conf. Uncertainty in Artificial Intelligence, E. Horwitz and F. Jensen, eds., pp. 115–123, Portland, Ore., 1996.
[21] M.I. Jordan and R.A. Jacobs, “Hierarchical Mixtures of Experts and the EM Algorithm,” Neural Computation, vol. 6, pp. 181-214, 1994.
[22] J. Dougherty, R. Kohavi, and M. Sahami, “Supervised and Unsupervised Discretization of Continuous Attributes,” Proc. 12th Int'l Conf. Machine Learning, A. Prieditis and S. Russell, eds. San Francisco: Morgan Kaufmann, 1995.
[23] R.C. Holte, “Very Simple Classification Rules Perform Well on Most Commonly Used Datasets,” Machine Learning, vol. 11, pp. 63–91, 1993.
[24] J.D. Ullman, Principles of Database and Knowledge-Base Systems, vol. II: The New Tech nologies. New York: Computer Science Press, 1989.
[25] R. Kohavi and D. Sommerfield, “Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology,” Proc. First Int'l Conf. Knowledge Discovery and Data Mining, pp. 192–197, Montreal, 1995.
[26] S.K. Murthy, S. Kasif, and S. Salzberg, “A System for Induction of Oblique Decision Trees,” J. Artificial Intelligence Research, vol. 2, pp. 1–32, 1994.
[27] F.M. Malvestuto, “A Unique Formal System for Binary Decompositions of Database Relations, Probability Distributions, and Graphs,” Information Sciences, vol. 59, pp. 21–52, 1992.
[28] F.M. Malvestuto, “Statistical vs. Relational Join Dependencies,” Proc. Seventh Int'l Working Conf. Scientific and Statistical Database Management, J.C. French and H. Hinterberger, eds., Charlottesville, Va., IEEE/CS Press, 1994.
[29] M. Studeny, “Structural Semigraphoids,” Int'l J. General Systems, vol. 22, no. 2, pp. 207–217, 1994.
[30] R.M. Neal, “Asymmetric Parallel Boltzmann Machines Are Belief Networks,” Neural Computation, vol. 4, no. 6, pp. 832–834, 1992.
[31] L.R. Rabiner, “Tutorial on Hidden Markov Model and Selected Applications in Speech Recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257-285, 1989.
[32] Y. Bengio and P. Frasconi, “Input/Output HMMs for Sequence Processing,” IEEE Trans. Neural Networks, vol. 7, no. 5, pp. 1,231–1,249, 1996.
[33] S. Lauritzen, Graphical Models. Oxford, Clarendon, 1996.
[34] M.I. Jordan, “Why the Logistic Function? A Tutorial Discussion on Probabilities and Neural Networks,” Technical Report 9503, Computational Cognitive Science, Massachusetts Inst. of Tech nology, 1995, URL:ftp://psyche.mit.edu/pub/jordanuai.ps.
[35] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression Trees. Belmont, Calif.: Wadsworth, 1984.
[36] G.F. Cooper,"The computational complexity of probabilistic inference using Bayesian belief networks," Artificial Intelligence, vol. 42, pp. 393-405, 1990.
[37] P. Smyth, D. Heckerman, and M. Jordan, “Probabilistic Independence Networks for Hidden Markov Probability Models,” Technical Report TR-96-03, Microsoft Research, 1996.
[38] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum-Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc. B, vol. 39, pp. 1–38, 1977.
[39] G.J. McLachlan and K.E. Basford, Mixture Models: Inference and Applications to Clustering. Marcel Dekker, 1988.
[40] M.H. DeGroot, Optimal Statistical Decision. New York: McGraw-Hill, 1970.
[41] Z. Ghahramani and M.I. Jordan, “Supervised Learning from Incomplete Data via an EM Approach,” Advances in Neural Information Processing Systems, vol. 6,J.D. Cowan, G. Tesauro, and J. Alspector, eds., Morgan Kaufmann, 1994.
[42] Y.C. Lee, G. Doolen, H.H. Chen, G.Z. Sun, T. Maxwell, H.Y. Lee, and C.L. Giles, “Machine Learning Using a Higher Order Correlational Network,” Physica D, vol. 22-D, nos. 1-3, p. 276, 1986.
[43] C.J. Merz, P.M. Murphy, and D.W. Aha, “UCI Repository of Machine Learning Databases,” Dept. of Information and Computer Science, Univ. of California, Irvine, Calif., 1996, URL:http://www.ics.uci.edu/mlearnMLRepository.html .
[44] U.M. Fayyad and K.B. Irani, “Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning,” Proc. 13th Int'l Joint Conf. Artificial Intelligence, pp. 1,022–1,027, Morgan Kaufmann, 1993.
[45] R. Kohavi, G. John, R. Long, D. Manley, and K. Pfleger, "MLC++: A Machine Learning Library in C++," Tools with Artificial Intelligence, pp. 740-743, IEEE C. S. Press, 1994.
[46] J. Siöberg and L. Ljung, “Overtraining, Regularization, and Searching for Minimum in Neural Networks,” technical report, Linköping Univ., Sweden, 1992.
[47] J. W. Shavlik,"A framework of combining symbolic and neural learning," Machine Learning, vol. 14, no. 3, pp. 321-331, 1994.

Index Terms:
Belief networks, classification, connectionist models, context specific independence, data mining, decision trees, machine learning.
Citation:
Paolo Frasconi, Marco Gori, Giovanni Soda, "Data Categorization Using Decision Trellises," IEEE Transactions on Knowledge and Data Engineering, vol. 11, no. 5, pp. 697-712, Sept.-Oct. 1999, doi:10.1109/69.806931
Usage of this product signifies your acceptance of the Terms of Use.