This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Adaptive Clustering for Multiple Evolving Streams
September 2006 (vol. 18 no. 9)
pp. 1166-1180
In the data stream environment, the patterns generated at different time instances are different due to data evolution. As time progresses, the behavior and members of clusters usually change. Hence, clustering continuous data streams allows us to observe the changes of group behavior. In order to support flexible clustering requirements, we devise in this paper a Clustering on Demand framework, abbreviated as COD framework, to dynamically cluster multiple data streams. While providing a general framework of clustering on multiple data streams, the COD framework has two advantageous features, namely, one data scan for online statistics collection and compact multiresolution approximations, which are designed to address, respectively, the time and the space constraints in a data stream environment. The COD framework consists of two phases, i.e., the online maintenance phase and the offline clustering phase. The online maintenance phase provides an efficient mechanism to maintain summary hierarchies of data streams with multiple resolutions in time linear in both the number of streams and the number of data points in each stream. On the other hand, an adaptive clustering algorithm is devised for the offline phase to retrieve approximations of desired substreams from summary hierarchies according to clustering queries. We propose two summarization techniques, based on wavelet and regression analyses, to construct the summary hierarchies. The regression-based summary hierarchy approximates the data stream more precisely and provides better clustering results, at the cost of slightly longer time than and twice the storage space as the wavelet-based one. An adaptive version of COD framework is designed to make a selection between a wavelet-based model and a regression-based model for building the summary hierarchy. By the adaptive COD, we can obtain clustering results with almost the same quality as the regression-based COD while using much less storage space for the summary hierarchy. As shown in the complexity analyses and also validated by our empirical studies, the COD framework performs very efficiently in the data stream environment while producing clustering results of very high quality.

[1] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and Issues in Data Stream Systems,” Proc. ACM Symp. Principles of Database Systems (PODS), June 2002.
[2] M.R. Henzinger, P. Raghavan, and S. Rajagopalan, “Computing on Data Streams,” Dimacs Series in Discrete Mathematics and Theoretical Computer Science, vol. 50, pp. 107-118, 1999.
[3] A. Bulut and A.K. Singh, “Swat: Hierarchical Stream Summarization in Large Networks,” Proc. Int'l Conf. Data Eng., pp. 303-314, Mar. 2003.
[4] A. Bulut and A.K. Singh, “A Unified Framework for Monitoring Data Streams in Real Time,” Proc. Int'l Conf. Data Eng., pp. 44-55, 2005.
[5] M. Datar, A. Gionis, P. Indyk, and R. Motwani, “Maintaining Stream Statistics over Sliding Windows,” Proc. SIAM Symp. Discrete Algorithms, pp. 635-644, Jan. 2002.
[6] W. Fan, “Systematic Data Selection to Mine Concept-Drifting Data Streams,” Proc. ACM SIGKDD, pp. 128-137, 2004.
[7] C.C. Aggarwal, “On Change Diagnosis in Evolving Data Streams,” IEEE Trans. Knowledge and Data Eng., vol. 17, no. 5, pp. 587-600, May 2005.
[8] C.C. Aggarwal and P.S. Yu, “Online Analysis of Community Evolution in Data Streams,” Proc. ACM SIAM on Data Mining (SDM '05), 2005.
[9] T. Johnson, S. Muthukrishnan, and I. Rozenbaum, “Sampling Algorithms in a Stream Operator,” Proc. ACM SIGMOD Conf., pp. 1-12, 2005.
[10] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, “A Framework for Clustering Evolving Data Streams,” Proc. Very Large Data Bases Conf., Sept. 2003.
[11] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, “A Framework for Projected Clustering of High Dimensional Data Streams,” Proc. Very Large Data Bases Conf., pp. 852-863, 2004.
[12] S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering Data Streams,” Proc. Symp. Foundations of Computer Science, pp. 359-366, Nov. 2000.
[13] L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani, “Streaming-Data Algorithms for High-Quality Clustering,” Proc. Int'l Conf. Data Eng., 2002.
[14] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, “On Demand Classification of Data Streams,” Proc. ACM SIGKDD, pp. 503-508, 2004.
[15] P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” Proc. ACM SIGKDD, pp. 71-80, Aug. 2000.
[16] G. Hulten, L. Spencer, and P. Domingos, “Mining Time-Changing Data Streams,” Proc. ACM SIGKDD, pp. 97-106, Aug. 2001.
[17] E. Keogh, J. Lin, and W. Truppel, “Clustering of Time Series Subsequences is Meaningless: Implications for Past and Future Research,” Proc. IEEE Int'l Conf. Data Mining, Nov. 2003.
[18] J. Yang, “Dynamic Clustering of Evolving Streams with a Single Pass,” Proc. IEEE Int'l Conf. Data Mining (ICDE '03), pp. 695-697, Mar. 2003.
[19] T. Li, Q. Li, S. Zhu, and M. Ogihara, “A Survey on Wavelet Applications in Data Mining,” SIGKDD Explorations, vol. 4, no. 2, pp. 49-68, 2003.
[20] W.-G. Teng, M.-S. Chen, and P.S. Yu, “Using Wavelet-Based Resource-Aware Mining to Explore Temporal and Support Count Granularities in Data Streams,” Proc. SIAM Int'l Conf. Data Mining, Apr. 2004.
[21] Y. Chen, G. Dong, J. Han, B.W. Wah, and J. Wang, “Multi-Dimensional Regression Analysis of Time-Series Data Streams,” Proc. Very Large Data Bases Conf., pp. 323-334, 2002.
[22] W.-G. Teng, M.-S. Chen, and P.S. Yu, “A Regression-Based Temporal Pattern Mining Scheme for Data Streams,” Proc. Very Large Data Bases Conf., pp. 93-104, Sept. 2003.
[23] P.S. Bradley and U.M. Fayyad, “Refining Initial Points for k-Means Clustering,” Proc. Int'l Conf. Machine Learning, pp. 91-99, July 1998.

Index Terms:
Data mining, clustering of multiple data streams, time-series clustering.
Citation:
Bi-Ru Dai, Jen-Wei Huang, Mi-Yen Yeh, Ming-Syan Chen, "Adaptive Clustering for Multiple Evolving Streams," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 9, pp. 1166-1180, Sept. 2006, doi:10.1109/TKDE.2006.137
Usage of this product signifies your acceptance of the Terms of Use.