The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.05 - May (2009 vol.21)
pp: 652-665
Hung-Leng Chen , National Taiwan University, Taipei
Ming-Syan Chen , National Taiwan University, Taipei
Su-Chen Lin , National Taiwan University, Taipei
ABSTRACT
Although the problem of clustering numerical time-evolving data is well-explored, the problem of clustering categorical time-evolving data remains as a challenge issue. In this paper, we propose a generalized clustering framework which utilizes existing clustering algorithms and adopts sliding window technique to detect if there is a drifting-concept or not in the incoming sliding window. The framework is composed of two algorithms: Drifting Concept Detecting (abbreviated as DCD) algorithm detecting the changes of cluster distributions between the current sliding window and the last clustering result, and Cluster Relationship Analysis (abbreviated as CRA) algorithm analyzing the relationship between clustering results at different time. In DCD, the concept is said to drift if quite a large number of outliers are found in the current sliding window, or if quite a large number of clusters are varied in the ratio of data points. The drifted sliding window will perform re-clustering to capture the recent concept. In CRA, a visualizing method is devised to facilitate the observation of the evolving clustering results. The framework is validated on real and synthetic data sets, and is shown to not only accurately detect the drifting-concepts but also attain clustering results of better quality.
INDEX TERMS
Clustering, classification, and association rules, Data mining, Mining methods and algorithms
CITATION
Hung-Leng Chen, Ming-Syan Chen, Su-Chen Lin, "Catching the Trend: A Framework for Clustering Concept-Drifting Categorical Data", IEEE Transactions on Knowledge & Data Engineering, vol.21, no. 5, pp. 652-665, May 2009, doi:10.1109/TKDE.2008.192
REFERENCES
[1] C. Aggarwal, J. Han, J. Wang, and P. Yu, “A Framework for Clustering Evolving Data Streams,” Proc. 29th Int'l Conf. Very Large Data Bases (VLDB), 2003.
[2] C.C. Aggarwal, J.L. Wolf, P.S. Yu, C. Procopiuc, and J.S. Park, “Fast Algorithms for Projected Clustering,” Proc. ACM SIGMOD '99, pp. 61-72, 1999.
[3] P. Andritsos, P. Tsaparas, R.J. Miller, and K.C. Sevcik, “Limbo: Scalable Clustering of Categorical Data,” Proc. Ninth Int'l Conf. Extending Database Technology (EDBT), 2004.
[4] D. Barbará, Y. Li, and J. Couto, “Coolcat: An Entropy-Based Algorithm for Categorical Clustering,” Proc. ACM Int'l Conf. Information and Knowledge Management (CIKM), 2002.
[5] F. Cao, M. Ester, W. Qian, and A. Zhou, “Density-Based Clustering over an Evolving Data Stream with Noise,” Proc. Sixth SIAM Int'l Conf. Data Mining (SDM), 2006.
[6] D. Chakrabarti, R. Kumar, and A. Tomkins, “Evolutionary Clustering,” Proc. ACM SIGKDD '06, pp. 554-560, 2006.
[7] H.-L. Chen, K.-T. Chuang, and M.-S. Chen, “Labeling Unclustered Categorical Data into Clusters Based on the Important Attribute Values,” Proc. Fifth IEEE Int'l Conf. Data Mining (ICDM), 2005.
[8] Y. Chi, X.-D. Song, D.-Y. Zhou, K. Hino, and B.L. Tseng, “Evolutionary Spectral Clustering by Incorporating Temporal Smoothness,” Proc. ACM SIGKDD '07, pp. 153-162, 2007.
[9] B.-R. Dai, J.-W. Huang, M.-Y. Yeh, and M.-S. Chen, “Adaptive Clustering for Multiple Evolving Streams,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 9, pp. 1166-1180, Sept. 2006.
[10] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc., 1977.
[11] D.H. Fisher, “Knowledge Acquisition via Incremental Conceptual Clustering,” Machine Learning, 1987.
[12] M.M. Gaber and P.S. Yu, “Detection and Classification of Changes in Evolving Data Streams,” Int'l J. Information Technology and Decision Making, vol. 5, no. 4, pp. 659-670, 2006.
[13] V. Ganti, J. Gehrke, and R. Ramakrishnan, “CACTUS—Clustering Categorical Data Using Summaries,” Proc. ACM SIGKDD, 1999.
[14] D. Gibson, J.M. Kleinberg, and P. Raghavan, “Clustering Categorical Data: An Approach Based on Dynamical Systems,” VLDB J., vol. 8, nos. 3-4, pp. 222-236, 2000.
[15] M.A. Gluck and J.E. Corter, “Information Uncertainty and the Utility of Categories,” Proc. Seventh Ann. Conf. Cognitive Science Soc., pp. 283-287, 1985.
[16] S. Guha, R. Rastogi, and K. Shim, “ROCK: A Robust Clustering Algorithm for Categorical Attributes,” Proc. 15th Int'l Conf. Data Eng. (ICDE), 1999.
[17] E.-H. Han, G. Karypis, V. Kumar, and B. Mobasher, “Clustering Based on Association Rule Hypergraphs,” Proc. ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD), 1997.
[18] J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001.
[19] Z. Huang, “Extensions to the $k$ -Means Algorithm for Clustering Large Data Sets with Categorical Values,” Data Mining and Knowledge Discovery, 1998.
[20] Z. Huang and M.K. Ng, “A Fuzzy $k$ -Modes Algorithm for Clustering Categorical Data,” IEEE Trans. Fuzzy Systems, 1999.
[21] G. Hulten, L. Spencer, and P. Domingos, “Mining Time-Changing Data Streams,” Proc. ACM SIGKDD, 2001.
[22] A. Jain and R. Dubes, Algorithms for Clustering Data. Prentice Hall, 1988.
[23] A.K. Jain, M.N. Murty, and P.J. Flynn, “Data Clustering: A Review,” ACM Computing Surveys, 1999.
[24] O. Nasraoui and C. Rojas, “Robust Clustering for Tracking Noisy Evolving Data Streams,” Proc. Sixth SIAM Int'l Conf. Data Mining (SDM), 2006.
[25] O. Nasraoui, M. Soliman, E. Saka, A. Badia, and R. Germain, “A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites,” IEEE Trans. Knowledge and Data Eng., vol. 20, no. 2, pp. 202-215, Feb. 2008.
[26] G. Salton and M.J. McGill, Introduction to Modern Information Retrieval. McGraw-Hill, 1986.
[27] G. Salton, A. Wong, and C.S. Yang, “A Vector Space Model for Automatic Indexing,” Comm. ACM, vol. 18, no. 11, pp. 613-620, 1975.
[28] C.E. Shannon, “A Mathematical Theory of Communication,” Bell System Technical J., 1948.
[29] Y. Sun, Q. Zhu, and Z. Chen, “An Iterative Initial-Points Refinement Algorithm for Categorical Data Clustering,” Pattern Recognition Letters, vol. 23, no. 7, 2002.
[30] H. Wang, W. Fan, P. Yun, and J. Han, “Mining Concept-Drifting Data Streams Using Ensemble Classifiers,” Proc. ACM SIGKDD, 2003.
[31] G. Widmer and M. Kubat, “Learning in the Presence of Concept Drift and Hidden Contexts,” Machine Learning, 1996.
[32] M.-Y. Yeh, B.-R. Dai, and M.-S. Chen, “Clustering over Multiple Evolving Streams by Events and Correlations,” IEEE Trans. Knowledge and Data Eng., vol. 19, no. 10, pp. 1349-1362, Oct. 2007.
[33] M.J. Zaki and M. Peters, “Clicks: Mining Subspace Clusters in Categorical Data via $k$ -Partite Maximal Cliques,” Proc. 21st Int'l Conf. Data Eng., 2005.
[34] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An Efficient Data Clustering Method for Very Large Database,” Proc. ACM SIGMOD, 1996.
[35] A. Zhou, F. Cao, W. Qian, and C. Jin, “Tracking Clusters in Evolving Data Streams over Sliding Windows,” Knowledge and Information Systems, 2007.
12 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool