The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.01 - January (2011 vol.23)
pp: 22-36
Hanady Abdulsalam , Kuwait University, Kuwait
David B. Skillicorn , Queen's University, Kingston
Patrick Martin , Queen's University, Kingston
ABSTRACT
We consider the problem of data stream classification, where the data arrive in a conceptually infinite stream, and the opportunity to examine each record is brief. We introduce a stream classification algorithm that is online, running in amortized {\cal O}(1) time, able to handle intermittent arrival of labeled records, and able to adjust its parameters to respond to changing class boundaries (“concept drift”) in the data stream. In addition, when blocks of labeled data are short, the algorithm is able to judge internally whether the quality of models updated from them is good enough for deployment on unlabeled records, or whether further labeled records are required. Unlike most proposed stream-classification algorithms, multiple target classes can be handled. Experimental results on real and synthetic data show that accuracy is comparable to a conventional classification algorithm that sees all of the data at once and is able to make multiple passes over it.
INDEX TERMS
Data stream mining, data stream classification, decision tree ensembles, random forests.
CITATION
Hanady Abdulsalam, David B. Skillicorn, Patrick Martin, "Classification Using Streaming Random Forests", IEEE Transactions on Knowledge & Data Engineering, vol.23, no. 1, pp. 22-36, January 2011, doi:10.1109/TKDE.2010.36
REFERENCES
[1] Pandora music station, http:/www.pandora.com/, 2010.
[2] T. Dasu, S. Krishnan, S. Venkatasubramanian, and K. Yi, "An Information-Theoretic Approach to Detecting Changes in Multi-Dimensional Data Streams," Proc. 38th Symp. Interface of Statistics, May 2006.
[3] P. Vorburger and A. Bernstein, "Entropy-Based Concept Shift Detection," Proc. Sixth IEEE Int'l Conf. Data Mining (ICDM), pp. 1113-1118, Dec. 2006.
[4] W. Li, X. Jin, and X. Ye, "Detecting Change in Data Stream: Using Sampling Technique," Proc. Third Int'l Conf. Natural Computation (ICNC), pp. 130-134, Aug. 2007.
[5] A. Bulut and A. Singh, "A Unified Framework for Monitoring Data Streams in Real Time," Proc. 21st Int'l Conf. Data Eng. (ICDE), pp. 44-55, Apr. 2005.
[6] S. Nassar and J. Sander, "Effective Summarization of Multi-Dimensional Data Streams for Historical Stream Mining," Proc. 19th Int'l Conf. Scientific and Statistical Database Management (SSDBM), pp. 30-39, July 2007.
[7] A. Metwally, D. Agrawal, and A. El-Abbadi, "Efficient Computation of Frequent and Top-k Elements in Data Streams," Proc. 10th Int'l Conf. Database Theory (ICDT), pp. 398-412, Jan. 2005.
[8] X. Wang, H. Liu, and J. Han, "Finding Frequent Items in Data Streams Using Hierarchical Information," Proc. IEEE Int'l Conf. Systems, Man and Cybernetics (ISIC), pp. 431-436, Oct. 2007.
[9] P. Domingos and G. Hulten, "Mining High-Speed Data Streams," Proc. Sixth ACM SIGKDD, pp. 71-80, Aug. 2000.
[10] G. Hulten, L. Spencer, and P. Domingos, "Mining Time-Changing Data Streams," Proc. Seventh ACM SIGKDD, pp. 97-106, Aug. 2001.
[11] H. Wang, W. Fan, S. Philip, and J. Han, "Mining Concept-Drifting Data Streams Using Ensemble Classifiers," Proc. Ninth ACM SIGKDD, pp. 226-235, Aug. 2003.
[12] M. Masud, J. Gao, L. Khan, J. Han, and B. Thuraisingham, "A Multi-Partition Multi-Chunk Ensemble Technique to Classify Concept-Drifting Data Streams," Proc. Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD '09), Apr. 2009.
[13] F. Chu and C. Zaniolo, "Fast and Light Boosting for Adaptive Mining of Data Streams," Proc. Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD), pp. 282-292, 2004.
[14] W. Street and Y. Kim, "A Streaming Ensemble Algorithm (SEA) for Large-Scale Classification," Proc. Seventh ACM SIGKDD, pp. 377-382, Aug. 2001.
[15] Y. Sun, G. Mao, X. Liu, and C. Liu, "Mining Concept Drifts from Data Streams Based on Multi-Classifiers," Proc. 21st Int'l Conf. Advanced Information Networking and Applications Workshops (AINAW), pp. 257-263, May 2007.
[16] Z. Li, T. Wang, R. Wang, Y. Yan, and H. Chen, "A New Fuzzy Decision Tree Classification Method for Mining High-Speed Data Streams Based on Binary Search Trees," Proc. First Int'l Frontiers in Algorithmics WorkShop (FAW), pp. 216-227, Aug. 2007.
[17] C.J. Tsai, C.I. Lee, and W.P. Yang, "An Efficient and Sensitive Decision Tree Approach to Mining Concept-Drifting Data Streams," Informatica, vol. 19, no. 1, pp. 135-156, Feb. 2008.
[18] Y. Liu, J. Cai, J. Yin, and A.W.-C. Fu, "Clustering Text Data Streams," J. Computer Science Technology, vol. 23, no. 1, pp. 112-128, Jan. 2008.
[19] K. Udommanetanakit, T. Rakthanmanon, and K. Waiyamai, "E-Stream: Evolution-Based Technique for Stream Clustering," Proc. Third Int'l Conf. Advanced Data Mining and Applications (ADMA), pp. 605-615, Aug. 2007.
[20] Y. Zhang and X. Jin, "An Automatic Construction and Organization Strategy for Ensemble Learning on Data Streams," SIGMOD Record, vol. 35, no. 3, pp. 28-33, Sept. 2006.
[21] F. Chu, Y. Wang, and C. Zaniolo, "An Adaptive Learning Approach for Noisy Data Streams," Proc. Fourth IEEE Int'l Conf. Data Mining (ICDM), pp. 351-354, Nov. 2004.
[22] K. Nishida, K. Yamauchi, and T. Omori, "ACE: Adaptive Classifiers-Ensemble System for Concept-Drifting Environments," Proc. Sixth Int'l Workshop Multiple Classifier Systems (MCS), pp. 176-185, June 2005.
[23] H. Abdulsalam, D. Skillicorn, and P. Martin, "Streaming Random Forests," Proc. 11th Int'l Database Eng. and Applications Symp. (IDEAS), pp. 225-232, Sept. 2007.
[24] H. Abdulsalam, D. Skillicorn, and P. Martin, "Classifying Evolving Data Streams Using Dynamic Streaming Random Forests," Proc. 19th Int'l Conf. Database and Expert Systems Applications (DEXA), pp. 643-651, Sept. 2008.
[25] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees. Wadsworth Int'l, 1984.
[26] W. Hoeffding, "Probability Inequalities for Sums of Bounded Random Variables," J. Am. Statistical Assoc., vol. 58, no. 1, pp. 13-30, 1963.
[27] M. Kantardzic, Data Mining: Concepts, Models, Methods, and Algorithms. IEEE Press, 2003.
[28] C.E. Shannon, "A Mathematical Theory of Communication," Proc. ACM SIGMOBILE, vol. 5, no. 1, p. 355, Jan. 2001.
[29] L. Breiman, "Random Forests," Machine Learning, vol. 45, no. 1, pp. 5-32, Oct. 2001.
[30] Random Forest FORTRAN Code, http://www.stat.berkeley.edu/breiman/RandomForests cc_software.htm/, http://www.stat. berkeley.edu/~breiman/RandomForests cc_software.htm, 2010.
[31] G. Melli, "(SCDS-A) Synthetic Classification Data Set Generator," Simon Fraser Univ., School of Computer Science, http:/www.datasetgenerator.com/, 1997.
[32] The Sloan Digital Sky Survey (SDSS), http:/www.sdss.org, 2010.
[33] "The Sixth Data Release of the Sloan Digital Sky Survey," The Astrophysical J. Supplement Series, vol. 175, pp. 297-313, Apr. 2008.
[34] S. McConnell and D. Skillicorn, "Distributed Data Mining for Astrophysical Datasets," Proc. Astronomical Data Analysis Software and Systems XIV, P. Shopbell, M. Britton, and R. Ebert, eds., vol. 347, pp. 360-364, Dec. 2005.
34 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool