Subscribe
Issue No.01 - Jan. (2014 vol.26)
pp: 108-119
Leszek Rutkowski , Czestochowa University of Technology, Czestochowa
Maciej Jaworski , Czestochowa University of Technology, Czestochowa
Lena Pietruczuk , Czestochowa University of Technology, Czestochowa
Piotr Duda , Czestochowa University of Technology, Czestochowa
ABSTRACT
Since the Hoeffding tree algorithm was proposed in the literature, decision trees became one of the most popular tools for mining data streams. The key point of constructing the decision tree is to determine the best attribute to split the considered node. Several methods to solve this problem were presented so far. However, they are either wrongly mathematically justified (e.g., in the Hoeffding tree algorithm) or time-consuming (e.g., in the McDiarmid tree algorithm). In this paper, we propose a new method which significantly outperforms the McDiarmid tree algorithm and has a solid mathematical basis. Our method ensures, with a high probability set by the user, that the best attribute chosen in the considered node using a finite data sample is the same as it would be in the case of the whole data stream.
INDEX TERMS
Decision trees, Entropy, Training, Data mining, Impurities, Indexes, Random variables,Gaussian approximation, Data steam, decision trees, information gain
CITATION
Leszek Rutkowski, Maciej Jaworski, Lena Pietruczuk, Piotr Duda, "Decision Trees for Mining Data Streams Based on the Gaussian Approximation", IEEE Transactions on Knowledge & Data Engineering, vol.26, no. 1, pp. 108-119, Jan. 2014, doi:10.1109/TKDE.2013.34
REFERENCES
 [1] C. Aggarwal, Data Streams: Models and Algorithms. Springer, 2007. [2] A. Bifet, Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams. IOS Press, 2010. [3] A. Bifet, G. Holmes, G. Pfahringer, R. Kirkby, and R. Gavalda, "New Ensemble Methods for Evolving Data Streams," Proc. 15th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '09), June/July 2009. [4] A. Bifet and R. Kirkby, "Data Stream Mining a Practical Approach," technical report, Univ. of Waikato, 2009. [5] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression Trees. Chapman and Hall, 1993. [6] T. Cover and P. Hart, "Nearest Neighbor Pattern Classification," IEEE Trans. Information Theory, vol. 13, no. 1, pp. 21-27, Jan. 1967. [7] P. Domingos and G. Hulten, "Mining High-Speed Data Streams," Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 71-80, 2000. [8] W. Fan, Y. Huang, H. Wang, and P.S. Yu, "Active Mining of Data Streams," Proc. SIAM Int'l Conf. Data Mining (SDM '04), 2004. [9] C. Franke, "Adaptivity in Data Stream Mining," PhD dissertation, Univ. of California, 2009. [10] M.M Gaber, A. Zaslavsky, and S. Krishnaswamy, "Mining Data Streams: A Review," ACM SIGMOD Record, vol. 34, no. 2, pp. 18-26, June 2005. [11] J. Gama, R. Fernandes, and R. Rocha, "Decision Trees for Mining Data Streams," Intelligent Data Analysis, vol. 10, no. 1, pp. 23-45, Mar. 2006. [12] J. Gao, W. Fan, and J. Hang, "On Appropriate Assumptions to Mine Data Streams: Analysis and Practice," Proc. IEEE Int'l Conf. Data Mining (ICDM '07), Oct. 2007. [13] J. Han and M. Kamber, Data Mining: Concepts and Techniques, second ed., Elsevier, 2006. [14] J. Hertz, A. Krogh, and R.G. Palmer, Introduction to the Theory of Neural Computation. Addison-Wesley, 1991. [15] W. Hoeffding, "Probability Inequalities for Sums of Bounded Random Variables," J. Am. Statistical Assoc., vol. 58, no. 301, pp. 13-30, Mar. 1963. [16] "UCI Machine Learning Repository," http://archive.ics.uci.eduml/, 2012. [17] G. Hulten, L. Spencer, and P. Domingos, "Mining Time-Changing Data Streams," Proc. Seventh ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 97-106, 2001. [18] R. Jin and G. Agrawal, "Efficient Decision Tree Construction on Streaming Data," Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2003. [19] O. Kardaun, Classical Methods of Statistics, first ed., Springer, 2005. [20] D.T. Larose, Discovering Knowledge in Data: An Introduction to Data Mining. Wiley & Sons, Inc., 2005. [21] J. Liu, X. Li, and W. Hong, "Ambiguous Decision Trees for Mining Concept-Drifting Data Streams," Pattern Recognition Letters, vol. 30, no. 15, pp. 1347-1355, Nov. 2009. [22] C. McDiarmid, "On the Method of Bounded Differences," Surveys in Combinatorics, J. Siemons, ed., pp. 148-188, Cambridge Univ. Press, 1989. [23] M. Narasimha Murty and V. Susheela Devi, Pattern Recognition: An Algorithmic Approach. Springer, 2011. [24] B. Pfahringer, G. Holmes, and R. Kirkby, "New Options for Hoeffding Trees," Proc. 20th Australian Joint Conf. Advances in Artificial Intelligence (AI '07), pp. 90-99, 2007. [25] J.R. Quinlan, "Learning Efficient Classification Procedures and Their Application to Chess End Games," Machine Learning: An Artificial Intelligence Approach, R.S. Michalski, J.G. Garbonell, and T.M. Mitchell, eds., pp. 463-482, Morgan Kaufmann, 1983. [26] J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [27] L. Rutkowski, Computational Intelligence: Methods and Techniques. Springer-Verlag, 2008. [28] L. Rutkowski, "Adaptive Probabilistic Neural-Networks for Pattern Classification in Time-Varying Environment," IEEE Trans. Neural Networks, vol. 15, no. 4, pp. 811-827, July 2004. [29] L. Rutkowski, L. Pietruczuk, P. Duda, and M. Jaworski, "Decision Trees for Mining Data Streams Based on the McDiarmid's Bound," IEEE Trans. Knowledge and Data Eng., vol. 25, no. 6,pp. 1272-1279, 2013. [30] A. Tsymbal, "The Problem of Concept Drift: Definitions and Related Work," Technical Report TCD-CS-2004-15, Computer Science Dept., Trinity College Dublin, Apr. 2004. [31] L. Wasserman, All of Statistics: A Concise Course in Statistical Inference. Springer, 2005. [32] I.H. Witten, E. Frank, and G. Holmes, Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufman, 2005.