This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Clustering Data Streams: Theory and Practice
May/June 2003 (vol. 15 no. 3)
pp. 515-528

Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm's performance on synthetic and real data streams.

[1] D. Achlioptas and F. McSherry, “Fast Computation of Low-Rank Approximations,” Proc. Ann. ACM Symp. Theory of Computing, pp. 611-618, 2001.
[2] P.K. Agarwal and C. Procopiuc, “Approximation Algorithms for Projective Clustering,” Proc. ACM Symp. Discrete Algorithms, pp. 538-547, 2000.
[3] R. Aggrawal et al., "Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1998, pp. 94-105.
[4] N. Alon, Y. Matias, and M. Szegedy, “The Space Complexity of Approximating the Frequency Moments,” Proc. ACM Symp. Theory of Computing (STOC), pp. 20-29, 1996.
[5] M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander, “OPTICS: Ordering Points To Identify the Clustering Structure,” Proc. 1999 ACM Special Interest Group on Management of Data, pp. 49–60, 1999.
[6] S. Arora, P. Raghavan, and S. Rao, “Approximation Schemes for Euclidean$k \hbox {-} {\rm Medians}$and Related Problems,” Proc. 30th ACM STOC, pp. 106-113, 1998.
[7] V. Arya, N. Garg, R. Khandekar, K. Munagala, and V. Pandit, “Local Search Heuristic for k-Median and Facility Location Problems,” Proc. Ann. ACM Symp. Theory of Computing, pp. 21-29, 2001.
[8] B. Babcock, M. Datar, and R. Motwani, “Sampling from a Moving Window over Streaming Data,” Proc. ACM Symp. Discrete Algorithms, 2002.
[9] Y. Bartal, M. Charikar, and D. Raz, “Approximating Min-Sum$\big. k\bigr.$-Clustering in Metric Spaces,” Proc. Ann. ACM Symp. Theory of Computing, 2001.
[10] A. Borodin, R. Ostrovsky, and Y. Rabani, Subquadratic Approximation Algorithms for Clustering Problems in High Dimensional Spaces Proc. Ann. ACM Symp. Theory of Computing, pp. 435-444, May 1999.
[11] P.S. Bradley, U.M. Fayyad, and C. Reina, “Scaling Clustering Algorithms to Large Databases,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 9-15, 1998.
[12] M. Charikar, S. Chaudhuri, R. Motwani, and V.R. Narasayya, “Towards Estimation Error Guarantees for Distinct Values,” Proc. 19th Symp. Principles of Database Systems, pp. 268-279, 2000.
[13] M. Charikar, C. Chekuri, T. Feder, and R. Motwani, “Incremental Clustering and Dynamic Information Retrieval,” Proc. Ann. ACM Symp. Theory of Computing, pp. 626-635, 1997.
[14] M. Charikar and S. Guha, “Improved Combinatorial Algorithms for the Facility Location and k-Median Problems,” Proc. ACM Symp. Foundations of Computer Science, pp. 378-388, 1999.
[15] M. Charikar, S. Guha, É. Tardos, and D.B. Shmoys, “A Constant Factor Approximation Algorithm for the k-Median Problem,” Proc. Ann. ACM Symp. Theory of Computing, 1999.
[16] F. Chudak, “Improved Approximation Algorithms for Uncapacitated Facility Location,” Proc. Conf. Integer Programming and Combinatorial Optimization, pp. 180-194, 1998.
[17] M. Datar, A. Gionis, P. Indyk, and R. Motwani, “Maintaining Stream Statistics over Sliding Windows,” Proc. ACM Symp. Discrete Algorithms, 2002.
[18] P. Drineas et al., "Clustering in Large Graphs and Matrices," Proc. Symp. Discrete Algorithms, SIAM, Philadelphia, 1999, pp. 291-299; .
[19] M. Ester, H. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 226-231, 1996.
[20] F. Farnstrom, J. Lewis, and C. Elkan, “True Scalability for Clustering Algorithms,” SIGKDD Explorations, 2000.
[21] T. Feder and D.H. Greene, “Optimal Algorithms for Appropriate Clustering,” Proc. Ann. ACM Symp. Theory of Computing, pp. 434-444, 1988.
[22] J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan, “An Approximate$\big. L_1{\hbox{-}}{\rm{Difference}}\bigr.$Algorithm for Massive Data Streams,” Proc. 40th Ann. Symp. Foundations of Computer Science, pp. 501-511, 1999.
[23] P. Flajolet and G.N. Martin, “Probabilistic Counting Algorithms for Database Applications,” J. Computer and System Sciences, vol. 31, pp. 182-209, 1985.
[24] A. Frieze, R. Kannan, and S. Vempala, “Fast Monte-Carlo Algorithms for Finding Low-Rank Approximations,” Proc. ACM Symp. Foundations of Computer Science, 1998.
[25] V. Ganti, J. Gehrke, and R. Ramakrishnan, “DEMON: Mining and Monitoring Evolving Data,” Knowledge and Data Eng., vol. 13, no. 1, pp. 50-63, 2001.
[26] P. Gibbons and Y. Matias, “Synopsis Structures for Massive Data Sets,” DIMACS, Series in Discrete Math. and Theoretical Computer Science, A, 1999.
[27] A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. Strauss, “Fast, Small-Space Algorithms for Approximate Histogram Maintenance,” Proc. ACM Symp. Theory of Computing (STOC), 2002.
[28] A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, “How to Summarize the Universe: Dynamic Maintenance of Quantiles,” Proc. Int'l Conf. Very Large Data Bases, 2002.
[29] A. Gilbert, S. Guha, P. Indyk, S. Muthukrishnan, and M. Strauss, “Near-Optimal Sparse Fourier Representations via Sampling,” Proc. ACM Symp. Theory of Computing (STOC), pp. 152-161, 2002.
[30] M. Greenwald and S. Khanna, “Space-Efficient Online Computation of Quantile Summaries,” Proc. SIGMOD, 2001.
[31] S. Guha and S. Khuller, “Greedy Strikes Back: Improved Facility Location Algorithms,” Proc. Nineth ACM-SIAM Symp. Discrete Algorithms, 1998.
[32] S. Guha and N. Koudas, “Approximating a Data Stream for Querying and Estimation: Algorithms and Performance Evaluation,” Proc. Int'l Conf. Data Eng., 2002.
[33] S. Guha, N. Koudas, and K. Shim, “Data Streams and Histograms,” Proc. Symp. Theory of Computing, pp. 471-475, 2001.
[34] S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering Data Streams,” Proc. 41st Ann. Symp. Foundations of Computer Science, 2000.
[35] S. Guha, R. Rastogi, and K. Shim, CURE: An Efficient Clustering Algorithm for Large Databases Proc. ACM SIGMOD, pp. 73-84, June 1998.
[36] P.J. Haas, J.F. Naughton, S. Seshadri, and L. Stokes, “Sampling-Based Estimation of the Number of Distinct Values of an Attribute,” Proc. 21st Int'l Conf. Very Large Databases, pp. 311-322, 1995.
[37] J. Han and M. Kambert, Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, 2001.
[38] M. Henzinger, P. Raghavan, and S. Rajagopalan, “Computing on Data Streams,” Digital Equipment Corp., Technical Report TR-1998-011, Aug. 1998.
[39] A. Hinneburg and D. Keim, “An Efficient Approach to Clustering Large Multimedia Databases with Noise,” Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining, 1998.
[40] A. Hinneburg and D.A. Keim, "Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering," Proc. 25th Int'l Conf. Very Large Databases, Morgan Kaufmann, 1999, pp. 506-517.
[41] D. Hochbaum and D.B. Shmoys, “A Best Possible Heuristic for the k-Center Problem,” Math. of Operations Research, vol. 10, no. 2, pp. 180-184, 1985.
[42] P. Indyk, Sublinear Time Algorithms for Metric Space Problems Proc. 31st Symp. Theory of Computing, 1999.
[43] P. Indyk, A Sublinear-Time Approximation Scheme for Clustering in Metric Spaces Proc. 40th Symp. Foundations of Computer Science, 1999.
[44] P. Indyk, “Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation,” Proc. 40th Symp. Foundations of Computer Science, pp. 189-197, 2000.
[45] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, N.J.: Prentice Hall, 1988.
[46] K. Jain, M. Mahdian, and A. Saberi, “A New Greedy Approach for Facility Location Problem,” Proc. Ann. ACM Symp. Theory of Computing, 2002.
[47] K. Jain and V. Vazirani, “Primal-Dual Approximation Algorithms for Metric Facility Location and k-Median Problems,” Proc. ACM Symp. Foundations of Computer Science, 1999.
[48] R. Kannan, S. Vempala, and A. Vetta, “On Clusterings: Good, Bad and Spectral,” Proc. ACM Symp. Foundations of Computer Science, pp. 367-377, 2000.
[49] O. Kariv and S.L. Hakimi, “An Algorithmic Approach to Network Location Problems, Part II:$\big. p\bigr.$-Media ns,” SIAM J. Applied Math., pp. 539-560, 1979.
[50] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data. An Introduction to Cluster Analysis. New York: Wiley, 1990.
[51] L.E. Kavraki, J.C. Latombe, R. Motwani, and P. Raghavan, “Randomized Query Processing in Robot Path Planning,” J. Computer and System Sciences, vol. 57, pp. 50-60, 1998.
[52] S. Kolliopoulos and S. Rao, “A Nearly Linear-Time Approximation Scheme for the Euclidean$\big. k\hbox{-}{\rm{median}}\bigr.$Problem,” Proc. Seventh Ann. European Symp. Algorithms, J. Nesetril, ed., pp. 362-371, July 1999.
[53] J.H. Lin and J.S. Vitter, “Approximation Algorithms for Geometric Median Problems,” Information Processing Letters, vol. 44, pp. 245-249, 1992.
[54] J.H. Lin and J.S. Vitter, “$\big. \epsilon\bigr.$-Approximations with Minimum Packing Constraint Violations,” Proc. Ann. ACM Symp. Theory of Computing, 1992.
[55] O.L. Mangasarian, “Mathematical Programming in Data Mining,” Data Mining and Knowledge Discovery, vol. 42, no. 1, pp. 183-201, 1997.
[56] G.S. Manku, S. Rajagopalan, and B. Lindsay, “Approximate Medians and Other Quantiles in One Pass and with Limited Memory,” Proc. ACM SIGMOD, 1998.
[57] G. Singh, S. Rajagopalan, and B. Lindsay, Random Sampling Techniques for Space Efficient Computation Of Large Data Sets Proc. SIGMOD, June 1999.
[58] D. Marchette, “A Statistical Method for Profiling Network Traffic,” Proc. Workshop Intrusion Detection and Network Monitoring, 1999.
[59] R. Mettu and C.G. Plaxton, “The Onlike Median Problem,” Proc. ACM Symp. Foundations of Computer Science, 2000.
[60] R. Mettu and C.G. Plaxton, “Optimal Time Bounds for Approximate Clustering,” Proc. Conf. Uncertainty of Artificial Intelligence, 2002.
[61] A. Meyerson, “Online Facility Location,” Proc. ACM Symp. Foundations of Computer Science, 2001.
[62] Discrete Location Theory, P. Mirchandani and R. Francis, eds. New York: John Wiley and Sons, Inc. 1990.
[63] N. Mishra, D. Oblinger, and L. Pitt, “Sublinear Time Approximate Clustering,” Proc. ACM Symp. Discrete Algorithms, 2001.
[64] J. Munro and M. Paterson, “Selection and Sorting with Limited Storage,” Theoretical Computer Science, pp. 315-323, 1980.
[65] K. Nauta and F. Lieble, “Offline Network Intrusion Detection: Looking for Footprints,” SAS White Paper, 2000.
[66] R.T. Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining," Proc. 20th Int'l Conf. Very Large Databases, Morgan Kaufmann, 1994, pp. 144-155.
[67] L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani, “Streaming-Data Algorithms for High-Quality Clustering,” Proc. Int'l Conf. Data Eng., 2002.
[68] R. Ostrovsky and Y. Rabani, “Polynomial Time Approximation Schemes for Geometric k-Clustering,” Proc. ACM Symp. Foundations of Computer Science, 2000.
[69] C.M. Procopiuc et al., "A Monte Carlo Algorithm for Fast Projective Clustering," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 2002, pp. 418-427.
[70] G. Sheikholeslami, S. Chatterjee, and A. Zhang, WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases Proc. Very Large Date Bases Conf., pp. 428-439, Aug. 1998.
[71] D.B. Shmoys, E. Tardos, and K. Aardal, “Approximation Algorithms for Facility Location Problems (Extended Abstract),” Proc. 29th ACM STOC, pp. 265-274, 1997.
[72] N. Thaper, S. Guha, P. Indyk, and N. Koudas, “Multidimensional Dynamic Histograms,” Proc. ACM SIGMOD Int'l Conf. Management of Data, 2002.
[73] M. Thorup, “Quick k-Median, k-Center, and Facility Location for Sparse Graphs,” Proc. Int'l Colloquium on Automata, Languages, and Programming, pp. 249-260, 2001.
[74] V. Vazirani, Approximation Algorithms. Springer Verlag, 2001.
[75] W. Wang, J. Yang, and R.R. Muntz, "Sting: A Statistical Information Grid Approach to Spatial Data Mining," Proc. 23rd Int'l Conf. Very Large Databases, Morgan Kaufmann, 1997, pp. 186-195.
[76] T. Zhang, R. Ramakrishnan, and M. Livny, "Birch: An Efficient Data Clustering Method for Very Large Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1996, pp. 103-114.

Index Terms:
Clustering, data streams, approximation algorithms.
Citation:
Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, Liadan O'Callaghan, "Clustering Data Streams: Theory and Practice," IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 3, pp. 515-528, May-June 2003, doi:10.1109/TKDE.2003.1198387
Usage of this product signifies your acceptance of the Terms of Use.