This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
A Discretization Algorithm Based on a Heterogeneity Criterion
September 2005 (vol. 17 no. 9)
pp. 1166-1173
Xiaoyan Liu, IEEE Computer Society
Discretization, as a preprocessing step for data mining, is a process of converting the continuous attributes of a data set into discrete ones so that they can be treated as the nominal features by machine learning algorithms. Those various discretization methods, that use entropy-based criteria, form a large class of algorithm. However, as a measure of class homogeneity, entropy cannot always accurately reflect the degree of class homogeneity of an interval. Therefore, in this paper, we propose a new measure of class heterogeneity of intervals from the viewpoint of class probability itself. Based on the definition of heterogeneity, we present a new criterion to evaluate a discretization scheme and analyze its property theoretically. Also, a heuristic method is proposed to find the approximate optimal discretization scheme. Finally, our method is compared, in terms of predictive error rate and tree size, with Ent-MDLC, a representative entropy-based discretization method well-known for its good performance. Our method is shown to produce better results than those of Ent-MDLC, although the improvement is not significant. It can be a good alternative to entropy-based discretization methods.

[1] C.L. Blake and C.J. Merz, UCI Repository of Machine Learning Databases, http://www.ics.uci.edu/~mlearnMLRepository. html , , UC Irvine, Dept. of Information and Computer Science, 1998.
[2] M. Boulle, “Khiops: A Statistical Discretization Method of Continuous Attributes,” Machine Learning, vol. 55, pp. 53-59, 2004.
[3] J. Catlett, “On Changing Continuous Attributes into Ordered Discrete Attributes,” Machine Learning-EWSL-91, Proc. European Working Session on Learning, pp. 164-178, Mar. 1991.
[4] J. Cerquides and R.L. Mantaras, “Proposal and Empirical Comparison of a Parallelizable Distance-Based Discretization Method,” Proc. Third Int'l Conf. Knowledge Discovery and Data Mining (KDD-97), pp. 139-142, 1997.
[5] J.Y. Ching, A.K.C. Wong, and K.C.C. Chan, “Class-Dependent Discretization for Inductive Learning from Continuous and Mixed Mode Data,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 17, no. 7, pp. 641-651, 1995.
[6] M.R. Chmielewski and J.W. Grzymala-Busse, “Global Discretization of Continuous Attributes as Preprocessing for Machine Learning,” Int'l J. Approximate Reasoning, vol. 5, pp. 319-331, 1996.
[7] J. Dougherty, R. Kohavi, and M. Sahami, “Supervised and Unsupervised Discretization of Continuous Features,” Proc. 12th Int'l Conf. Machine Learning, pp. 194-202, 1995.
[8] U.M. Fayyad and K.B. Irani, “Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning,” Proc. 13th Int'l Joint Conf. Artificial Intelligence, pp. 1022-1027, 1993
[9] K.M. Ho and P.D. Scott, “Zeta: A Global Method for Discretization of Continuous Variables,” Proc. Third Int'l Conf. Knowledge Discovery and Data Mining (KDD97), pp. 191-194, 1997.
[10] R. Kerber, “ChiMerge: Discretization of Numeric Attributes,” Proc. 10th Int'l Conf. Artificial Intelligence (AAAI-91), pp. 123-128, 1992.
[11] R. Kohavi, “Bottom-Up Induction of Oblivious Read-Once Decision Graphs: Strengths and Limitation,” Proc. 12th Nat'l Conf. Artificial Intelligence, pp. 613-618, 1994.
[12] R. Kohavi and M. Sahami, “Error-Based and Entropy-Based Discretization of Continuous Features,” Proc. Second Int'l Conf. Knowledge Discovery and Data Mining (KDD-96), pp. 114-119, 1996.
[13] L.A. Kurgan and K.J. Cios, “Discretization Algorithm that Uses Class-Attribute Interdependence Maximization,” Proc. Int'l Conf. Artificial Intelligence (IC-AI-2001), pp. 980-987, 2001.
[14] L.A. Kurgan and K.J. Cios, “CAIM Discretization Algorithm,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 2, pp. 145-153, Feb. 2004.
[15] H. Liu, F. Hussain, C.L. Tan, and M. Dash, “Discretization: An Enabling Technique,” Data Mining and Knowledge Discovery, vol. 6, pp. 393-423, 2002.
[16] H. Liu and R. Setiono, “Feature Selection via Discretization,“ IEEE Trans. Knowledge and Data Eng., vol. 9, no. 4, pp. 642-645, 1997.
[17] B. Pfahringer, “Compression-Based Discretization of Continuous Attributes,” Proc. 12th Int'l Conf. Machine Learning, 1995.
[18] J.R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, Calif.: Morgan Kaufmann, 1993.
[19] M. Richeldi and M. Rossotto, “Class-Driven Statistical Discretization of Continuous Attributes (extended abstract),” Machine Learning: ECLM-95 (Proc. European Conf. Machine Learning, 1995), N. Lavrac and S. Wrobel, eds., pp. 335-338, 1995.
[20] R.J. Roiger and M.W. Geatz, Data Mining: A Tutorial Based Primer. Addison Wesley, 2002.
[21] C. Shannon and W. Weaver, The Mathematical Theory of Information. Urbana: Univ. Illinois Press, 1949.
[22] F.E.H. Tay and L.X. Shen, “A Modified Chi2 Algorithm for Discretization,” IEEE Trans. Knowledge and Data Eng., vol. 14, no. 3, pp. 666-670, 2002.
[23] A.K.C. Wong and D.K.Y. Chiu, “Synthesizing Statistical Knowledge from Incomplete Mixed-Mode Data,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 9, pp. 796-805, 1987.
[24] S.C. Zhang, Q. Yang, and C.Q. Zhang, “Data Preparation for Data Mining,” Applied Artificial Intelligence, vol. 17, nos. 5-6, pp. 375-382, 2003.
[25] S.C. Zhang, C.Q. Zhang, and Q. Yang, “Information Enhancement for Data Mining,” IEEE Intelligent Systems, pp. 12-13, 2004.

Index Terms:
Index Terms- Data mining, data preparation, discretization, entropy, heterogeneity.
Citation:
Xiaoyan Liu, Huaiqing Wang, "A Discretization Algorithm Based on a Heterogeneity Criterion," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 9, pp. 1166-1173, Sept. 2005, doi:10.1109/TKDE.2005.135
Usage of this product signifies your acceptance of the Terms of Use.