The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.06 - June (2011 vol.23)
pp: 859-874
Jing Gao , University of Illinois at Urbana Champaign , Urbana Urbana
Latifur Khan , University of Texas at Dallas, Richardson
Jiawei Han , Univ. of Illinois at Urbana-Champaign, Urbana
Mohammad M. Masud , University of Texas at Dallas, Richardson
ABSTRACT
Most existing data stream classification techniques ignore one important aspect of stream data: arrival of a novel class. We address this issue and propose a data stream classification technique that integrates a novel class detection mechanism into traditional classifiers, enabling automatic detection of novel classes before the true labels of the novel class instances arrive. Novel class detection problem becomes more challenging in the presence of concept-drift, when the underlying data distributions evolve in streams. In order to determine whether an instance belongs to a novel class, the classification model sometimes needs to wait for more test instances to discover similarities among those instances. A maximum allowable wait time T_c is imposed as a time constraint to classify a test instance. Furthermore, most existing stream classification approaches assume that the true label of a data point can be accessed immediately after the data point is classified. In reality, a time delay T_l is involved in obtaining the true label of a data point since manual labeling is time consuming. We show how to make fast and correct classification decisions under these constraints and apply them to real benchmark data. Comparison with state-of-the-art stream classification techniques prove the superiority of our approach.
INDEX TERMS
Data streams, concept-drift, novel class, ensemble classification, K-means clustering, k-nearest neighbor classification, silhouette coefficient.
CITATION
Jing Gao, Latifur Khan, Jiawei Han, Mohammad M. Masud, "Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints", IEEE Transactions on Knowledge & Data Engineering, vol.23, no. 6, pp. 859-874, June 2011, doi:10.1109/TKDE.2010.61
REFERENCES
[1] D. Agarwal, "An Empirical Bayes Approach to Detect Anomalies in Dynamic Multidimensional Arrays," Proc. IEEE Int'l Conf. Data Mining (ICDM), p. 8, 2005.
[2] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, "A Framework for On-Demand Classification of Evolving Data Streams," IEEE Trans. Knowledge and Data Eng., vol. 18, no. 5, pp. 577-589, May 2006.
[3] T. Ahmed, M. Coates, and A. Lakhina, "Multivariate Online Anomaly Detection Using Kernel Recursive Least Squares," Proc. IEEE INFOCOM, pp. 625-633, May 2007.
[4] S.D. Bay and M. Schwabacher, "Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule," Proc. ACM SIGKDD, pp. 29-38, 2003.
[5] M.M. Breunig, H.-P. Kriegel, R.T. Ng, and J. Sander, "Lof: Identifying Density-Based Local Outliers," Proc. ACM SIGMOD, pp. 93-104, 2000.
[6] S. Chen, H. Wang, S. Zhou, and P. Yu, "Stop Chasing Trends: Discovering High Order Models in Evolving Data," Proc. IEEE Int'l Conf. Data Eng. (ICDE), pp. 923-932, 2008.
[7] T.M. Cover and P.E. Hart, "Nearest Neighbor Pattern Classification," IEEE Trans. Information Theory, vol. IT-13, no. 1, pp. 21-27, Jan. 1967.
[8] V. Crupi, E. Guglielmino, and G. Milazzo, "Neural-Network-Based System for Novel Fault Detection in Rotating Machinery," J. Vibration and Control, vol. 10, no. 8, pp. 1137-1150, 2004.
[9] W. Fan, "Systematic Data Selection to Mine Concept-Drifting Data Streams," Proc. ACM SIGKDD, pp. 128-137, 2004.
[10] J. Gao, W. Fan, and J. Han, "On Appropriate Assumptions to Mine Data Streams," Proc. Seventh IEEE Int'l Conf. Data Mining (ICDM), pp. 143-152, Oct. 2007.
[11] G. Hulten, L. Spencer, and P. Domingos, "Mining Time-Changing Data Streams," Proc. ACM SIGKDD, pp. 97-106, Aug. 2001.
[12] L. Khan, M. Awad, and B.M. Thuraisingham, "A New Intrusion Detection System Using Support Vector Machines and Hierarchical Clustering," Int'l J. Very Large Data Bases, vol. 16, no. 4, pp. 507-521, 2007.
[13] J. Kolter and M. Maloof, "Using Additive Expert Ensembles to Cope with Concept Drift," Proc. Int'l Conf. Machine Learning (ICML), pp. 449-456, Aug. 2005.
[14] A. Lazarevic and V. Kumar, "Feature Bagging for Outlier Detection," Proc. ACM SIGKDD, pp. 157-166, 2005.
[15] P. Mahoney and M.V. Chan, "Learning Rules for Anomaly Detection of Hostile Network Traffic," Proc. IEEE Int'l Conf. Data Mining (ICDM), pp. 601-604, 2003.
[16] M. Markou and S. Singh, "Novelty Detection: A Review. Part 1: Statistical Approaches, Part 2: Neural Network Based Approaches," Signal Processing, vol. 83, pp. 2481-2497, 2499-2521, 2003.
[17] M.M. Masud, J. Gao, L. Khan, J. Han, and B. Thuraisingham, "A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data," Proc. Int'l Conf. Data Mining (ICDM), pp. 929-934, Dec. 2008.
[18] M.M. Masud, J. Gao, L. Khan, J. Han, and B. Thuraisingham, "Integrating Novel Class Detection with Classification for Concept-Drifting Data Streams," Proc. European Conf. Machine Learning and Knowledge Discovery in Databases: Part II (ECML PKDD), pp. 79-94, Sept. 2009.
[19] A. Nairac, T. Corbett-Clark, R. Ripley, N. Townsend, and L. Tarassenko, "Choosing an Appropriate Model for Novelty Detection," Proc. Int'l Conf. Artificial Neural Networks, pp. 117-122, 1997.
[20] B. Pang and L. Lee, "A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts," Proc. Assoc. for Computational Linguistics, pp. 271-278, 2004.
[21] S.J. Roberts, "Extreme Value Statistics for Novelty Detection in Biomedical Signal Processing," Proc. Int'l Conf. Advances in Medical Signal and Information Processing, pp. 166-172, 2000.
[22] M. Scholz and R. Klinkenberg, "An Ensemble Classifier for Drifting Concepts," Proc. Second Int'l Workshop Knowledge Discovery in Data Streams (IWKDDS), pp. 53-64, Oct. 2005.
[23] E.J. Spinosa, A.P. de Leon, F. de Carvalho, and J. Gama, "Cluster-Based Novel Concept Detection in Data Streams Applied to Intrusion Detection in Computer Networks," Proc. 2008 ACM Symp. Applied Computing, pp. 976-980, 2008.
[24] S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki, and D. Gunopulos, "Online Outlier Detection in Sensor Data Using Non-Parametric Models," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 187-198, 2006.
[25] G. Tandon and P. Chan, "Weighting versus Pruning in Rule Validation for Detecting Network and Host Anomalies," Proc. ACM SIGKDD, pp. 697-706, 2007.
[26] K. Tumer and J. Ghosh, "Error Correlation and Error Reduction in Ensemble Classifiers," Connection Science, vol. 8, no. 304, pp. 385-403, 1996.
[27] H. Wang, W. Fan, P.S. Yu, and J. Han, "Mining Concept-Drifting Data Streams Using Ensemble Classifiers," Proc. ACM SIGKDD, pp. 226-235, Aug. 2003.
[28] D. yan Yeung and C. Chow, "Parzen-Window Network Intrusion Detectors," Proc. Int'l Conf. Pattern Recognition, pp. 385-388, 2002.
[29] Y. Yang, X. Wu, and X. Zhu, "Combining Proactive and Reactive Predictions for Data Streams," Proc. ACM SIGKDD, pp. 710-715, 2005.
[30] Y. Yang, J. Zhang, J. Carbonell, and C. Jin, "Topic-Conditioned Novelty Detection," Proc. ACM SIGKDD, pp. 688-693, 2002.
[31] X. Zhu, "Semi-Supervised Learning Literature Survey," Technical Report TR 1530, Univ. of Wisconsin Madison, July 2008.
23 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool