This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Algorithms for Finding Attribute Value Group for Binary Segmentation of Categorical Databases
November/December 2002 (vol. 14 no. 6)
pp. 1269-1279

Abstract—We consider the problem of finding a set of attribute values that give a high quality binary segmentation of a database. The quality of a segmentation is defined by an objective function suitable for the user's objective, such as "mean squared error," "mutual information," or "\chi^2," each of which is defined in terms of the distribution of a given target attribute. Our goal is to find value groups on a given conditional domain that split databases into two segments, optimizing the value of an objective function. Though the problem is intractable for general objective functions, there are feasible algorithms for finding high quality binary segmentations when the objective function is convex, and we prove that the typical criteria mentioned above are all convex. We propose two practical algorithms, based on computational geometry techniques, which find a much better value group than conventional heuristics.

[1] T. Asano and T. Tokuyama, “Topological Walk Revisited,” Proc. Sixth CCCG, 1994.
[2] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth, "Learnability and the Vapnik-Chervonenkis Dimension," J. ACM, vol. 36, pp. 929-965, 1989.
[3] L. Breiman, “Bagging Predictors,” Machine Learning, vol. 24, pp. 123-140, 1996.
[4] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression Trees. Wadsworth, 1984.
[5] N.A.R. Center, “Introduction to IND Version 2.1,” GA23-2475-02, 1992.
[6] C.K. Cowan, "Model based synthesis of sensor location," Proc. 1988 IEEE Int'l Conf. Robotics and Automation, pp. 900-905, 1988.
[7] Y. Freund and R.E. Schapire, A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting J. Computer and Systems Science, vol. 55, pp. 119-139, 1997.
[8] T. Fukuda, Y. Morimoto, S. Morishira, and T. Tokuyama, Constructing Efficient Decision Trees by Using Optimized Numeric Association Rules Proc. 22nd Int'l Conf. Very Large Databases, Dec. 1996.
[9] T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, “Data Mining Using Two-Dimensional Optimized Association Rules: Scheme, Algorithms, and Visualization,” Proc. 1996 ACM-SIGMOD Int'l Conf. Management of Data, pp. 13-23, June 1996.
[10] T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, “Finding Optimal Intervals Using Computational Geometry,” Proc. Int'l Symp. Algorithm and Computing '96, pp. 55-64, 1996.
[11] T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, Mining Optimized Association Rules for Numeric Attributes Proc. 1996 ACM Symp. Principles of Database Systems, pp. 182-191, 1996.
[12] T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, “Mining Optimized Association Rules for Numeric Attributes,” J. Computer and System Sciences, 1999.
[13] J. Gehrke, R. Ramakrishnan, and V. Ganti, “RainForest—A Framework for Fast Decision Tree Construction of Large Datasets,” Proc. 24th Int'l Conf. Very Large Data Bases, pp. 416-427, 1998.
[14] S. Hasegawa, H. Imai, and M. Ishiguro, “$\big. \epsilon{\hbox{-}}{\rm{Approximations}}\bigr.$of k-Label Spaces,” Theoretical Computer Science, vol. 137, pp. 145-157, 1995.
[15] D. Haussler and E. Welzl, “Epsilon-Nets and Simplex Range Queries,” Discrete and Computational Geometry, vol. 2, pp. 127-151, 1987.
[16] M. Mehta, R. Agrawal, and J. Rissanen, “SLIQ: A Fast Scalable Classifier for Data Mining,” Proc. Fifth Int'l Conf. Extending Database Technology, pp. 18-32, 1996.
[17] Y. Morimoto, T. Fukuda, S. Morishita, and T. Tokuyama, “Implementation and Evaluation of Decision Trees with Range and Region Splitting,” Constraint, vol. 2, no. 3/4, pp. 401-427, Dec. 1997.
[18] Y. Morimoto, H. Ishii, and S. Morishita, “Efficient Construction of Regression Trees with Range and Region Splitting,” Proc. 23nd VLDB Conf., pp. 166-175, 1997.
[19] P.M. Murphy and M.J. Pazzani, “Id2-of-3: Constructive Induction of m-of-n Concepts for Discriminators in Decision Trees,” Proc. Eighth Int'l Workshop Machine Learning, pp. 183-187, 1991.
[20] Z. Pawlak, J.W. Grzymala-Busse, R. Slowinski, and W. Ziarko, “Rough Sets,” Comm. ACM, vol. 38, no. 11, pp. 89–95, 1995.
[21] F.P. Preparata and M.I. Shamos, Computational Geometry. Springer-Verlag, 1985.
[22] J.R. Quinlan, C4.5: Programs for Machine Learning,San Mateo, Calif.: Morgan Kaufman, 1992.
[23] J.R. Quinlan, “Bagging, Boosting, and C4.5,” Proc. 13th Nat'l Conf. Artificial Intelligence and the Eighth Innovative Applications of Artificial Intelligence Conf., pp. 725-730, 1996.
[24] R. Rastogi and K. Shim, “PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning,” Proc. 24th Int'l Conf. Very Large Data Bases, pp. 404-415, 1998.
[25] J. Shafer, R. Agrawal, and M. Mehta, “SPRINT: A Scalable Parallel Classifier for Data Mining,” Proc. 22th Int'l Conf. Very Large Databases, Sept. 1996.
[26] K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, “Computing Optimized Rectilinear Regions for Association Rules,” Proc. KDD'97, pp. 96-103, Aug. 1997.

Index Terms:
Value groups, binary segmentation, categorical test, decision tree, data reduction, data mining.
Citation:
Yasuhiko Morimoto, Takeshi Fukuda, Takeshi Tokuyama, "Algorithms for Finding Attribute Value Group for Binary Segmentation of Categorical Databases," IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 6, pp. 1269-1279, Nov.-Dec. 2002, doi:10.1109/TKDE.2002.1047767
Usage of this product signifies your acceptance of the Terms of Use.