This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Efficient C4.5
March/April 2002 (vol. 14 no. 2)
pp. 438-444

We present an analytic evaluation of the runtime behavior of the C4.5 algorithm which highlights some efficiency improvements. Based on the analytic evaluation, we have implemented a more efficient version of the algorithm, called EC4.5. It improves on C4.5 by adopting the best among three strategies for computing the information gain of continuous attributes. All the strategies adopt a binary search of the threshold in the whole training set starting from the local threshold computed at a node. The first strategy computes the local threshold using the algorithm of C4.5, which, in particular, sorts cases by means of the quicksort method. The second strategy also uses the algorithm of C4.5, but adopts a counting sort method. The third strategy calculates the local threshold using a main-memory version of the RainForest algorithm, which does not need sorting. Our implementation computes the same decision trees as C4.5 with a performance gain of up to five times.

[1] K. Alsabti, S. Ranka, and V. Singh, “CLOUDS: Classification for Large Out-of-Core Datasets,” Proc. Int'l Conf. Knowledge Discovery and Data Mining, pp. 2-8, 1998.
[2] S.D. Bay, “UCI KDD Archive,” http:/kdd.ics.uci.edu, 1999.
[3] E. Keogh, C. Blake, and C.J. Merz, “UCI Repository of Machine Learning Databases,” http://www.ics.uci.edu/~mlearnMLRepository.html , 1998.
[4] T.H. Cormen,C.E. Leiserson, and R.L. Rivest,Introduction to Algorithms.Cambridge, Mass.: MIT Press/McGraw-Hill, 1990.
[5] T. Elomaa and J. Rousu, “General and Efficient Multisplitting of Numerical Attributes,” Machine Learning, vol. 36, no. 3, pp. 201-244, Sept. 1999.
[6] U.M. Fayyad and K.B. Irani,“On the handling of continuous-valued attributes in decision tree generation,” Machine Learning, vol. 8, pp. 87-102, 1992.
[7] T. Fukuda, Y. Morimoto, S. Morishira, and T. Tokuyama, Constructing Efficient Decision Trees by Using Optimized Numeric Association Rules Proc. 22nd Int'l Conf. Very Large Databases, Dec. 1996.
[8] J. Gehrke, V. Ganti, R. Ramakrishnan, and W.-Y. Loh, “BOAT—Optimistic Decision Tree Construction,” Proc. ACM SIGMOD Int'l Conf. Management of Data, June 1999.
[9] J.E. Gehrke, R. Ramakrishnan, and V. Ganti, “RainForest—A Framework for Fast Decision Tree Construction of Large Datasets,” Data Mining and Knowledge Discovery, vol. 4, nos. 2 and 3, pp. 127-162, July 2000.
[10] S. Hong, "Use of Contextual Information for Feature Ranking and Discretization," IEEE Trans. Knowledge and Data Eng., vol. 9, no. 5, pp. 718-730, Sept./Oct. 1997.
[11] M. Joshi, G. Karypis, and V. Kumar, “ScalParC: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets,” Proc. 1998 Int'l Parallel Processing Symp. and Symp. Parallel and Distributed Processing, pp. 573-579, 1998.
[12] T.S. Lim, W.Y. Loh, and Y.S. Shih, “A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Tree Old and New Classification Algorithms,” Machine Learning, vol. 40, no. 3, pp. 203-228, Sept. 2000.
[13] Rulequest Research Ltd, “C5. 0,” Online documentation,http:/www.rulequest.com, 1999.
[14] M. Mehta, R. Agrawal, and J. Rissanen, “SLIQ: A Fast Scalable Classifier for Data Mining,” Proc. Fifth Int'l Conf. Extending Database Technology, pp. 18-32, 1996.
[15] Quest Group, “Quest Synthetic Data Generation Code,” Online documentation,http://www.almaden.ibm.com/cs/questsyndata.html , 1999.
[16] J.R. Quinlan,"Induction of decision trees," Machine Learning, vol. 1, pp. 81-106, 1986.
[17] J.R. Quinlan, C4.5: Programs for Machine Learning,San Mateo, Calif.: Morgan Kaufman, 1992.
[18] J.R. Quinlan, “Improved Use of Continuous Attributes in C4.5,” J. Artificial Intelligence Research, vol. 4, pp. 77-90, 1996.
[19] R. Rastogi and K. Shim, “PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning,” Data Mining and Knowledge Discovery, vol. 4, no. 4, pp. 315-344, Oct. 2000.
[20] J. Shafer, R. Agrawal, and M. Mehta, “SPRINT: A Scalable Parallel Classifier for Data Mining,” Proc. 22th Int'l Conf. Very Large Databases, Sept. 1996.
[21] A. Srivastava, E.-H.(Sam) Han, V. Kumar, and V. Singh, “Parallel Formulations of Decision-Tree Classification Algorithms,” Data Mining and Knowledge Discovery, vol. 3, no. 3, pp. 237-261, Sept. 1999.
[22] A. Srivastava, V. Singh, E.H. Han, and V. Kumar, An Efficient Scalable Parallel Classifier for Data Mining, Technical Report TR-97-010, Dept. Computer Science, Univ. of Minnesota, Minneapolis, 1997.

Index Terms:
C4.5, decision trees, inductive learning, supervised learning, data mining
Citation:
S. Ruggieri, "Efficient C4.5," IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 2, pp. 438-444, March-April 2002, doi:10.1109/69.991727
Usage of this product signifies your acceptance of the Terms of Use.