This Article 
 Bibliographic References 
 Add to: 
Cluster Kernels: Resource-Aware Kernel Density Estimators over Streaming Data
July 2008 (vol. 20 no. 7)
pp. 880-893
A variety of real-world applications heavily relies on an adequate analysis of transient data streams. Due to the rigid processing requirements of data streams, common analysis techniques as known from data mining are not directly applicable. A fundamental building block of many data mining and analysis approaches is density estimation. It provides a well-defined estimation of a continuous data distribution, a fact, which makes its adaptation to data streams desirable. A convenient method for density estimation utilizes kernels. The computational complexity of kernel density estimation, however, renders its application to data streams impossible. In this paper, we tackle this problem and propose our Cluster Kernel approach which provides continuously computed kernel density estimators over streaming data. Not only do Cluster Kernels meet the rigid processing requirements of data streams, they also allocate only a constant amount of memory, even with the opportunity to adapt it dynamically to changing system resources. For this purpose, we develop an intelligent merge scheme for Cluster Kernels and utilize continuously collected local statistics to resample already processed data. We focus on Cluster Kernels for one-dimensional data streams, but also address the multi-dimensional case. We validate the efficacy of Cluster Kernels for a variety of real-world data streams in an extensive experimental study.

[1] P. Domingos and G. Hulten, “A General Framework for Mining Massive Data Streams,” J. Computational and Graphical Statistics, 2003.
[2] A. Gray and A. Moore, “Nonparametric Density Estimation: Toward Computational Tractability,” Proc. Third IEEE Int'l Conf. Data Mining, 2003.
[3] B. Silverman, Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986.
[4] D.W. Scott, Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons, 1992.
[5] E. Keogh and T. Folias, “The UCR Time Series Data Mining Archive,”, 2002.
[6] Z. Cai, W. Qian, L. Wei, and A. Zhou, “M-Kernel Merging: Towards Density Estimation over Data Streams,” Proc. Eighth Int'l Conf. Database Systems for Advanced Applications, 2003.
[7] M. Gaber, A. Zaslavsky, and S. Krishnaswamy, “Mining Data Streams: A Review,” SIGMOD Record, vol. 34, no. 2, 2005.
[8] L. Auvil, Y.D. Cai, D. Clutter, J. Han, G. Pape, and M. Welge, “MAIDS: Mining Alarming Incidents from Data Streams,” Proc. ACM SIGMOD, 2004.
[9] S. Ben-David, J. Gehrke, and D. Kifer, “Detecting Change in Data Streams,” Proc. 30th Int'l Conf. Very Large Data Bases, 2004.
[10] C. Aggarwal, J. Han, J. Wang, and P. Yu, “A Framework for Clustering Evolving Data Streams,” Proc. 29th Int'l Conf. Very Large Data Bases, 2003.
[11] W.-G. Teng, M.-S. Chen, and P.S. Yu, “Resource-Aware Mining with Variable Granularities in Data Streams,” Proc. Fourth IEEE Int'l Conf. Data Mining, 2004.
[12] C. Heinz and B. Seeger, “Resource-Aware Kernel Density Estimators over Streaming Data,” Proc. 15th ACM Int'l Conf. Information and Knowledge Management, 2006.
[13] C. Heinz and B. Seeger, “Towards Kernel Density Estimation over Streaming Data,” Proc. 13th Int'l Conf. Management of Data, 2006.
[14] C. Heinz, “Density Estimation over Data Streams,” PhD dissertation, Univ. of Marburg, 2007.
[15] V. Vapnik and S. Mukherjee, “Support Vector Method for Multivariate Density Estimation,” Advances in Neural Information Processing Systems, vol. 12, 2000.
[16] G. Kollios, D. Gunopulos, N. Koudas, and S. Berchtold, “An Efficient Approximation Scheme for Data Mining Tasks,” Proc. Int'l Conf. Data Eng., 2001.
[17] D. Gunopulos, G. Kollios, V.J. Tsotras, and C. Domeniconi, “Selectivity Estimators for Multidimensional Range Queries over Real Attributes,” The VLDB J., vol. 14, no. 2, 2005.
[18] B. Blohsfeld, D. Korus, and B. Seeger, “A Comparison of Selectivity Estimators for Range Queries on Metric Attributes,” Proc. ACM SIGMOD, 1999.
[19] C. Lambert, S. Harrington, C. Harvey, and A. Glodjo, “Efficient On-Line Nonparametric Kernel Density Estimation,” Algorithmica, vol. 25, no. 1, 1999.
[20] C.M. Procopiuc and O. Procopiuc, “Density Estimation for Spatial Data Streams,” Proc. Ninth Int'l Symp. Spatial and Temporal Databases, 2005.
[21] D.K. Tasoulis, N.M. Adams, and D.J. Hand, “Unsupervised Clustering in Streaming Data,” Proc. Sixth IEEE Int'l Conf. Data Mining, 2006.
[22] S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki, and D. Gunopulos, “Online Outlier Detection in Sensor Data Using Non-Parametric Models,” Proc. 32nd Int'l Conf. Very Large Data Bases, 2006.
[23] B. Turlach, “Bandwidth Selection in Kernel Density Estimation: A Review,” 1993.
[24] P. Hall, S.N. Lahiri, and Y.K. Truong, “On Bandwidth Choice for Density Estimation with Dependant Data,” Annals of Statistics, vol. 23, 1995.
[25] D. Bosq, Nonparametric Statistics for Stochastic Processes. Springer, 1998.
[26] T.F. Chan, G. Golub, and R. LeVeque, “Algorithms for Computing the Sample Variance: Analysis and Recommendations,” The Am. Statistician, vol. 37, 1983.
[27] J. Nelder and R. Mead, “A Simplex Method for Function Minimization,” Computer J., vol. 7, 1965.
[28] E. Dellis and B. Seeger, “Efficient Computation of Reverse Skyline Queries,” Proc. 33rd Int'l Conf. Very Large Data Bases, 2007.
[29] J.S. Vitter, “Random Sampling with a Reservoir,” ACM Trans. Math. Software, 1985.
[30] J. Krämer and B. Seeger, “PIPES—A Public Infrastructure for Processing and Exploring Streams,” Proc. ACM SIGMOD, 2004.

Index Terms:
Nonparametric statistics, Statistical computing, Statistical databases, Data mining
Christoph Heinz, Bernhard Seeger, "Cluster Kernels: Resource-Aware Kernel Density Estimators over Streaming Data," IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 7, pp. 880-893, July 2008, doi:10.1109/TKDE.2008.21
Usage of this product signifies your acceptance of the Terms of Use.