The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.05 - May (2008 vol.20)
pp: 615-627
ABSTRACT
This paper presents and analyzes an incremental system for clustering streaming time series. The Online Divisive-Agglomerative Clustering (ODAC) system continuously maintains a tree-like hierarchy of clusters that evolves with data. ODAC uses a top-down strategy. The splitting criterion is a correlation-based dissimilarity measure among time series, splitting each node by the farthest pair of streams, which defines the diameter of the cluster. In stationary environments expanding the structure leads to a decrease in the diameters of the clusters. The system uses a merge operator, which agglomerates two sibling clusters, in order to react to changes in the correlation structure between time series. The split and merge operators are triggered in response to changes in the diameters of existing clusters. The system is designed to process thousands of data streams that flow at high-rate. The main features of the system include update time and memory consumption that do not depend on the number of examples in the stream. Moreover, the time and memory required to process an example decreases whenever the cluster structure expands. Experimental results on artificial and real data assess the processing qualities of the system, suggesting competitive performance on clustering streaming time series, exploring also its ability to deal with concept drift.
INDEX TERMS
Data mining, Clustering, Correlation and regression analysis, Industrial control, Real time
CITATION
Pedro Pereira Rodrigues, João Pedro Pedroso, "Hierarchical Clustering of Time-Series Data Streams", IEEE Transactions on Knowledge & Data Engineering, vol.20, no. 5, pp. 615-627, May 2008, doi:10.1109/TKDE.2007.190727
REFERENCES
[1] P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” Proc. ACM SIGKDD '00, pp. 71-80, 2000.
[2] D. Barbará, “Requirements for Clustering Data Streams,” SIGKDD Explorations, vol. 3, no. 2, pp. 23-27, Jan. 2002.
[3] E. Keogh and S. Kasetty, “On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration,” Proc. ACM SIGKDD '02, pp. 102-111, July 2002.
[4] M. Halkidi, Y. Batistakis, and M. Varzirgiannis, “On Clustering Validation Techniques,” J. Intelligent Information Systems, vol. 17, no. 2-3, pp. 107-145, 2001.
[5] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer Verlag, 2000.
[6] P.P. Rodrigues, J. Gama, and J.P. Pedroso, “ODAC: Hierarchical Clustering of Time Series Data Streams,” Proc. Sixth SIAM Int'l Conf. Data Mining (ICDM '06), pp. 499-503, Apr. 2006.
[7] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering Data Streams: Theory and Practice,” IEEE Trans. Knowledge and Data Eng., vol. 15, no. 3, pp. 515-528, May/June 2003.
[8] F. Ferrer, J. Aguilar, and J. Riquelme, “Incremental Rule Learning and Border Examples Selection from Numerical Data Streams,” J.Universal Computer Science, vol. 11, no. 8, pp. 1426-1439, 2005.
[9] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Prentice Hall, 1988.
[10] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, 1990.
[11] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proc. ACM SIGKDD '96, pp. 226-231, 1996.
[12] W. Wang, J. Yang, and R.R. Muntz, “STING: A Statistical Information Grid Approach to Spatial Data Mining,” Proc. 23rd Int'l Conf. Very Large Data Bases (VLDB '97), pp. 186-195, 1997.
[13] D.H. Fisher, “Knowledge Acquisition via Incremental Conceptual Clustering,” Machine Learning, vol. 2, no. 2, pp. 139-172, 1987.
[14] P. Bradley, U. Fayyad, and C. Reina, “Scaling Clustering Algorithms to Large Databases,” Proc. ACM SIGKDD '98, pp. 9-15, 1998.
[15] L. O'Callaghan, A. Meyerson, R. Motwani, N. Mishra, and S. Guha, “Streaming-Data Algorithms for High-Quality Clustering,” Proc. 18th IEEE Int'l Conf. Data Eng. (ICDE '02), pp. 685-696, 2002.
[16] G. Cormode, S. Muthukrishnan, and W. Zhuang, “Conquering the Divide: Continuous Clustering of Distributed Data Streams,” Proc. 23nd IEEE Int'l Conf. Data Eng. (ICDE '07), pp. 1036-1045, 2007.
[17] T.F. Gonzalez, “Clustering to Minimize the Maximum Inter-Cluster Distance,” Theoretical Computer Science, vol. 38, nos. 2-3, pp.293-306, 1985.
[18] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An Efficient Data Clustering Method for Very Large Databases,” Proc. ACM SIGMOD '96, pp. 103-114, 1996.
[19] C. Aggarwal, J. Han, J. Wang, and P. Yu, “A Framework for Clustering Evolving Data Streams,” Proc. 29th Int'l Conf. Very Large Data Bases (VLDB '03), pp. 81-92, Sept. 2003.
[20] S. Guha, R. Rastogi, and K. Shim, “CURE: An Efficient Clustering Algorithm for Large Databases,” Proc. ACM SIGMOD '98, pp. 73-84, 1998.
[21] J. Gama, P. Medas, and P. Rodrigues, “Learning Decision Trees from Dynamic Data Streams,” Proc. 20th ACM Symp. Applied Computing (SAC '05), pp. 573-577, Mar. 2005.
[22] G. Hulten, L. Spencer, and P. Domingos, “Mining Time-Changing Data Streams,” Proc. ACM SIGKDD '01, pp. 97-106, 2001.
[23] W. Hoeffding, “Probability Inequalities for Sums of Bounded Random Variables,” J. Am. Statistical Assoc., vol. 58, no. 301, pp.13-30, 1963.
[24] K. Pearson, “Regression, Heredity and Panmixia,” Philosophical Trans. Royal Soc., vol. 187, pp. 253-318, 1896.
[25] L. Leydesdorff, “Similarity Measures, Author Cocitation Analysis, and Information Theory,” J. Am. Soc. Information Science and Technology, vol. 56, no. 7, pp. 769-772, 2005.
[26] M. Wang and X.S. Wang, “Efficient Evaluation of Composite Correlations for Streaming Time Series,” Proc. Fourth Int'l Conf. Web-Age Information Management (WAIM '03), pp. 369-380, 2003.
[27] E. Fowlkes and C. Mallows, “A Method for Comparing Two Hierarchical Clusterings,” J. Am. Statistical Assoc., vol. 78, no. 383, pp. 553-569, 1983.
[28] L. Hubert and J. Schultz, “Quadratic Assignment as a General Data-Analysis Strategy,” British J. Math. and Statistical Psychology, vol. 29, pp. 190-241, 1975.
[29] J. Dunn, “Well-Separated Clusters and Optimal Fuzzy Partitions,” J. Cybernetics, vol. 4, no. 1, pp. 95-104, 1974.
[30] “R: A Language and Environment for Statistical Computing,” RFoundation for Statistical Computing. R Development Core Team, http:/www.R-project.org, 2005.
[31] L.A. Zadeh, “Fuzzy Sets,” Information and Control, vol. 8, no. 3, pp.338-353, 1965.
[32] A. Bagnall and G. Janacek, “Clustering Time Series with Clipped Data,” Machine Learning, vol. 58, no. 2-3, pp. 151-178, 2005.
32 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool