This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
A Framework for On-Demand Classification of Evolving Data Streams
May 2006 (vol. 18 no. 5)
pp. 577-589
Current models of the classification problem do not effectively handle bursts of particular classes coming in at different times. In fact, the current model of the classification problem simply concentrates on methods for one-pass classification modeling of very large data sets. Our model for data stream classification views the data stream classification problem from the point of view of a dynamic approach in which simultaneous training and test streams are used for dynamic classification of data sets. This model reflects real-life situations effectively, since it is desirable to classify test streams in real time over an evolving training and test stream. The aim here is to create a classification system in which the training model can adapt quickly to the changes of the underlying data stream. In order to achieve this goal, we propose an on-demand classification process which can dynamically select the appropriate window of past training data to build the classifier. The empirical results indicate that the system maintains a high classification accuracy in an evolving data stream, while providing an efficient solution to the classification task.

[1] C.C. Aggarwal, J. Han, J. Wang, and P. Yu, “On Demand Classification of Data Streamsm,” Proc. ACM KDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 503-508, Aug. 2004.
[2] C.C. Aggarwal, J. Han, J. Wang, and P. Yu, “CluStream: A Framework for Clustering Evolving Data Streams,” Proc. Int'l Conf. Very Large Data Bases, pp. 81-92, Sept. 2003.
[3] C.C. Aggarwal, “A Framework for Diagnosing Changes in Evolving Data Streams,” Proc. ACM SIGMOD Conf., pp. 575-586, June 2003.
[4] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and Issues in Data Stream Systems,” Proc. 21st ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems, pp. 1-16, June 2002.
[5] L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani, “Streaming-Data Algorithms For High-Quality Clustering,” Proc. 18th Int'l Conf. Data Eng., pp. 685-696, Feb. 2002.
[6] P. Bradley, U. Fayyad, and C. Reina, “Scaling Clustering Algorithms to Large Databases,” Proc. Knowledge Discovery and Data Mining Conf., pp. 9-15, 1998.
[7] Y. Chen, G. Dong, J. Han, B.W. Wah, and J. Wang, “MultiDimensional Regression Analysis of Time-Series Data Streams,” Proc. 28th Int'l Conf. Very Large Data Bases, pp. 323-334, Aug. 2002.
[8] P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 71-80, Aug. 2000.
[9] P. Domingos and G. Hulten, “A General Method for Scaling Up Machine Learning Algorithms and Its Application to Clustering,” Proc. Int'l Conf. Machine Learning, pp. 106-113, 2001.
[10] R. Duda and P. Hart, Pattern Classification and Scene Analysis. New York: Wiley, 1973.
[11] J.H. Friedman, “A Recursive Partitioning Decision Rule for Non-Parametric Classifiers,” IEEE Trans. Computers, vol. 26, pp. 404-408, 1977.
[12] J. Gehrke, V. Ganti, R. Ramakrishnan, and W.-Y. Loh, “BOAT: Optimistic Decision Tree Construction,” Proc. 1999 ACM SIGMOD Int'l Conf. Management of Data, pp. 169-180, June 1999.
[13] G. Hulten, L. Spencer, and P. Domingos, “Mining Time Changing Data Streams,” Proc. Seventh ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 97-106, Aug. 2001.
[14] F. Farnstrom, J. Lewis, and C. Elkan, “Scalability for Clustering Algorithms Revisited,” SIGKDD Explorations, vol. 2, no. 1, pp. 51-57, 2000.
[15] J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan, “Testing and Spot-Checking of Data Streams,” Proc. 11th Ann. ACM-SIAM Symp. Discrete Algorithms, pp. 165-174, Jan. 2000.
[16] F.J. Ferrer-Troyano, J.S. Aguilar-Ruiz, and J.C. Riquelme, “Discovering Decision Rules from Numerical Data Streams,” ACM Symp. Applied Computing, pp. 649-653, 2004.
[17] J. Fong and M. Strauss, “An Approximate $L^p{\hbox{-}}{\rm{Difference}}$ Algorithm for Massive Data Streams,” Proc. 17th Ann. Symp. Theoretical Aspects of Computer Science, pp. 193-204, Feb. 2000.
[18] J. Gama, R. Rocha, and P. Medas, “Accurate Decision Trees for Mining High-Speed Data Streams,” Proc. Ninth Int'l Conf. Knowledge Discovery and Data Mining, pp. 523-528, Aug. 2003.
[19] J. Gehrke, F. Korn, and D. Srivastava, “On Computing Correlated Aggregates over Continual Data Streams,” Proc. 2001 ACM SIGMOD Int'l Conf. Management of Data, pp. 271-282, May 2001.
[20] S. Guha and N. Koudas, “Approximating a Data Stream for Querying and Estimation: Algorithms and Performance Evaluation,” Proc. 18th Int'l Conf. Data Eng., pp. 567-578, Feb. 2002.
[21] S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering Data Streams,” Proc. 41st Annual Symp. Foundations of Computer Science, pp. 359-366, Nov. 2000.
[22] A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, “Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries,” Proc. 27th Int'l Conf. Very Large Data Bases, pp. 79-88, Sept. 2001.
[23] R. Jin and G. Agrawal, “Efficient Decision Tree Construction on Streaming Data,” Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 571-576, Aug. 2003.
[24] R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein, and R. Varma, “Query Processing, Resource Management, and Approximation in a Data Stream Management System,” Proc. First Biennial Conf. Innovative Data Systems Research, Jan. 2003.
[25] H. Wang, W. Fan, P. Yu, and J. Han, “Mining Concept-Drifting Data Streams Using Ensemble Classifiers,” Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 226-235, Aug. 2003.
[26] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An Efficient Data Clustering Method for Very Large Databases,” Proc. 1996 ACM SIGMOD Int'l Conf. Management of Data, pp. 103-114, June 1996.

Index Terms:
Stream classification, geometric time frame, microclustering, nearest neighbor.
Citation:
Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu, "A Framework for On-Demand Classification of Evolving Data Streams," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 5, pp. 577-589, May 2006, doi:10.1109/TKDE.2006.69
Usage of this product signifies your acceptance of the Terms of Use.