
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, Liadan O'Callaghan, "Clustering Data Streams: Theory and Practice," IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 3, pp. 515528, May/June, 2003.  
BibTex  x  
@article{ 10.1109/TKDE.2003.1198387, author = {Sudipto Guha and Adam Meyerson and Nina Mishra and Rajeev Motwani and Liadan O'Callaghan}, title = {Clustering Data Streams: Theory and Practice}, journal ={IEEE Transactions on Knowledge and Data Engineering}, volume = {15}, number = {3}, issn = {10414347}, year = {2003}, pages = {515528}, doi = {http://doi.ieeecomputersociety.org/10.1109/TKDE.2003.1198387}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Knowledge and Data Engineering TI  Clustering Data Streams: Theory and Practice IS  3 SN  10414347 SP515 EP528 EPD  515528 A1  Sudipto Guha, A1  Adam Meyerson, A1  Nina Mishra, A1  Rajeev Motwani, A1  Liadan O'Callaghan, PY  2003 KW  Clustering KW  data streams KW  approximation algorithms. VL  15 JA  IEEE Transactions on Knowledge and Data Engineering ER   
Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm's performance on synthetic and real data streams.
[1] D. Achlioptas and F. McSherry, “Fast Computation of LowRank Approximations,” Proc. Ann. ACM Symp. Theory of Computing, pp. 611618, 2001.
[2] P.K. Agarwal and C. Procopiuc, “Approximation Algorithms for Projective Clustering,” Proc. ACM Symp. Discrete Algorithms, pp. 538547, 2000.
[3] R. Aggrawal et al., "Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1998, pp. 94105.
[4] N. Alon, Y. Matias, and M. Szegedy, “The Space Complexity of Approximating the Frequency Moments,” Proc. ACM Symp. Theory of Computing (STOC), pp. 2029, 1996.
[5] M. Ankerst, M. Breunig, H.P. Kriegel, and J. Sander, “OPTICS: Ordering Points To Identify the Clustering Structure,” Proc. 1999 ACM Special Interest Group on Management of Data, pp. 49–60, 1999.
[6] S. Arora, P. Raghavan, and S. Rao, “Approximation Schemes for Euclidean$k \hbox {} {\rm Medians}$and Related Problems,” Proc. 30th ACM STOC, pp. 106113, 1998.
[7] V. Arya, N. Garg, R. Khandekar, K. Munagala, and V. Pandit, “Local Search Heuristic for kMedian and Facility Location Problems,” Proc. Ann. ACM Symp. Theory of Computing, pp. 2129, 2001.
[8] B. Babcock, M. Datar, and R. Motwani, “Sampling from a Moving Window over Streaming Data,” Proc. ACM Symp. Discrete Algorithms, 2002.
[9] Y. Bartal, M. Charikar, and D. Raz, “Approximating MinSum$\big. k\bigr.$Clustering in Metric Spaces,” Proc. Ann. ACM Symp. Theory of Computing, 2001.
[10] A. Borodin, R. Ostrovsky, and Y. Rabani, Subquadratic Approximation Algorithms for Clustering Problems in High Dimensional Spaces Proc. Ann. ACM Symp. Theory of Computing, pp. 435444, May 1999.
[11] P.S. Bradley, U.M. Fayyad, and C. Reina, “Scaling Clustering Algorithms to Large Databases,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 915, 1998.
[12] M. Charikar, S. Chaudhuri, R. Motwani, and V.R. Narasayya, “Towards Estimation Error Guarantees for Distinct Values,” Proc. 19th Symp. Principles of Database Systems, pp. 268279, 2000.
[13] M. Charikar, C. Chekuri, T. Feder, and R. Motwani, “Incremental Clustering and Dynamic Information Retrieval,” Proc. Ann. ACM Symp. Theory of Computing, pp. 626635, 1997.
[14] M. Charikar and S. Guha, “Improved Combinatorial Algorithms for the Facility Location and kMedian Problems,” Proc. ACM Symp. Foundations of Computer Science, pp. 378388, 1999.
[15] M. Charikar, S. Guha, É. Tardos, and D.B. Shmoys, “A Constant Factor Approximation Algorithm for the kMedian Problem,” Proc. Ann. ACM Symp. Theory of Computing, 1999.
[16] F. Chudak, “Improved Approximation Algorithms for Uncapacitated Facility Location,” Proc. Conf. Integer Programming and Combinatorial Optimization, pp. 180194, 1998.
[17] M. Datar, A. Gionis, P. Indyk, and R. Motwani, “Maintaining Stream Statistics over Sliding Windows,” Proc. ACM Symp. Discrete Algorithms, 2002.
[18] P. Drineas et al., "Clustering in Large Graphs and Matrices," Proc. Symp. Discrete Algorithms, SIAM, Philadelphia, 1999, pp. 291299; .
[19] M. Ester, H. Kriegel, J. Sander, and X. Xu, “A DensityBased Algorithm for Discovering Clusters in Large Spatial Databases,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 226231, 1996.
[20] F. Farnstrom, J. Lewis, and C. Elkan, “True Scalability for Clustering Algorithms,” SIGKDD Explorations, 2000.
[21] T. Feder and D.H. Greene, “Optimal Algorithms for Appropriate Clustering,” Proc. Ann. ACM Symp. Theory of Computing, pp. 434444, 1988.
[22] J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan, “An Approximate$\big. L_1{\hbox{}}{\rm{Difference}}\bigr.$Algorithm for Massive Data Streams,” Proc. 40th Ann. Symp. Foundations of Computer Science, pp. 501511, 1999.
[23] P. Flajolet and G.N. Martin, “Probabilistic Counting Algorithms for Database Applications,” J. Computer and System Sciences, vol. 31, pp. 182209, 1985.
[24] A. Frieze, R. Kannan, and S. Vempala, “Fast MonteCarlo Algorithms for Finding LowRank Approximations,” Proc. ACM Symp. Foundations of Computer Science, 1998.
[25] V. Ganti, J. Gehrke, and R. Ramakrishnan, “DEMON: Mining and Monitoring Evolving Data,” Knowledge and Data Eng., vol. 13, no. 1, pp. 5063, 2001.
[26] P. Gibbons and Y. Matias, “Synopsis Structures for Massive Data Sets,” DIMACS, Series in Discrete Math. and Theoretical Computer Science, A, 1999.
[27] A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. Strauss, “Fast, SmallSpace Algorithms for Approximate Histogram Maintenance,” Proc. ACM Symp. Theory of Computing (STOC), 2002.
[28] A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, “How to Summarize the Universe: Dynamic Maintenance of Quantiles,” Proc. Int'l Conf. Very Large Data Bases, 2002.
[29] A. Gilbert, S. Guha, P. Indyk, S. Muthukrishnan, and M. Strauss, “NearOptimal Sparse Fourier Representations via Sampling,” Proc. ACM Symp. Theory of Computing (STOC), pp. 152161, 2002.
[30] M. Greenwald and S. Khanna, “SpaceEfficient Online Computation of Quantile Summaries,” Proc. SIGMOD, 2001.
[31] S. Guha and S. Khuller, “Greedy Strikes Back: Improved Facility Location Algorithms,” Proc. Nineth ACMSIAM Symp. Discrete Algorithms, 1998.
[32] S. Guha and N. Koudas, “Approximating a Data Stream for Querying and Estimation: Algorithms and Performance Evaluation,” Proc. Int'l Conf. Data Eng., 2002.
[33] S. Guha, N. Koudas, and K. Shim, “Data Streams and Histograms,” Proc. Symp. Theory of Computing, pp. 471475, 2001.
[34] S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering Data Streams,” Proc. 41st Ann. Symp. Foundations of Computer Science, 2000.
[35] S. Guha, R. Rastogi, and K. Shim, CURE: An Efficient Clustering Algorithm for Large Databases Proc. ACM SIGMOD, pp. 7384, June 1998.
[36] P.J. Haas, J.F. Naughton, S. Seshadri, and L. Stokes, “SamplingBased Estimation of the Number of Distinct Values of an Attribute,” Proc. 21st Int'l Conf. Very Large Databases, pp. 311322, 1995.
[37] J. Han and M. Kambert, Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, 2001.
[38] M. Henzinger, P. Raghavan, and S. Rajagopalan, “Computing on Data Streams,” Digital Equipment Corp., Technical Report TR1998011, Aug. 1998.
[39] A. Hinneburg and D. Keim, “An Efficient Approach to Clustering Large Multimedia Databases with Noise,” Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining, 1998.
[40] A. Hinneburg and D.A. Keim, "Optimal GridClustering: Towards Breaking the Curse of Dimensionality in HighDimensional Clustering," Proc. 25th Int'l Conf. Very Large Databases, Morgan Kaufmann, 1999, pp. 506517.
[41] D. Hochbaum and D.B. Shmoys, “A Best Possible Heuristic for the kCenter Problem,” Math. of Operations Research, vol. 10, no. 2, pp. 180184, 1985.
[42] P. Indyk, Sublinear Time Algorithms for Metric Space Problems Proc. 31st Symp. Theory of Computing, 1999.
[43] P. Indyk, A SublinearTime Approximation Scheme for Clustering in Metric Spaces Proc. 40th Symp. Foundations of Computer Science, 1999.
[44] P. Indyk, “Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation,” Proc. 40th Symp. Foundations of Computer Science, pp. 189197, 2000.
[45] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, N.J.: Prentice Hall, 1988.
[46] K. Jain, M. Mahdian, and A. Saberi, “A New Greedy Approach for Facility Location Problem,” Proc. Ann. ACM Symp. Theory of Computing, 2002.
[47] K. Jain and V. Vazirani, “PrimalDual Approximation Algorithms for Metric Facility Location and kMedian Problems,” Proc. ACM Symp. Foundations of Computer Science, 1999.
[48] R. Kannan, S. Vempala, and A. Vetta, “On Clusterings: Good, Bad and Spectral,” Proc. ACM Symp. Foundations of Computer Science, pp. 367377, 2000.
[49] O. Kariv and S.L. Hakimi, “An Algorithmic Approach to Network Location Problems, Part II:$\big. p\bigr.$Media ns,” SIAM J. Applied Math., pp. 539560, 1979.
[50] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data. An Introduction to Cluster Analysis. New York: Wiley, 1990.
[51] L.E. Kavraki, J.C. Latombe, R. Motwani, and P. Raghavan, “Randomized Query Processing in Robot Path Planning,” J. Computer and System Sciences, vol. 57, pp. 5060, 1998.
[52] S. Kolliopoulos and S. Rao, “A Nearly LinearTime Approximation Scheme for the Euclidean$\big. k\hbox{}{\rm{median}}\bigr.$Problem,” Proc. Seventh Ann. European Symp. Algorithms, J. Nesetril, ed., pp. 362371, July 1999.
[53] J.H. Lin and J.S. Vitter, “Approximation Algorithms for Geometric Median Problems,” Information Processing Letters, vol. 44, pp. 245249, 1992.
[54] J.H. Lin and J.S. Vitter, “$\big. \epsilon\bigr.$Approximations with Minimum Packing Constraint Violations,” Proc. Ann. ACM Symp. Theory of Computing, 1992.
[55] O.L. Mangasarian, “Mathematical Programming in Data Mining,” Data Mining and Knowledge Discovery, vol. 42, no. 1, pp. 183201, 1997.
[56] G.S. Manku, S. Rajagopalan, and B. Lindsay, “Approximate Medians and Other Quantiles in One Pass and with Limited Memory,” Proc. ACM SIGMOD, 1998.
[57] G. Singh, S. Rajagopalan, and B. Lindsay, Random Sampling Techniques for Space Efficient Computation Of Large Data Sets Proc. SIGMOD, June 1999.
[58] D. Marchette, “A Statistical Method for Profiling Network Traffic,” Proc. Workshop Intrusion Detection and Network Monitoring, 1999.
[59] R. Mettu and C.G. Plaxton, “The Onlike Median Problem,” Proc. ACM Symp. Foundations of Computer Science, 2000.
[60] R. Mettu and C.G. Plaxton, “Optimal Time Bounds for Approximate Clustering,” Proc. Conf. Uncertainty of Artificial Intelligence, 2002.
[61] A. Meyerson, “Online Facility Location,” Proc. ACM Symp. Foundations of Computer Science, 2001.
[62] Discrete Location Theory, P. Mirchandani and R. Francis, eds. New York: John Wiley and Sons, Inc. 1990.
[63] N. Mishra, D. Oblinger, and L. Pitt, “Sublinear Time Approximate Clustering,” Proc. ACM Symp. Discrete Algorithms, 2001.
[64] J. Munro and M. Paterson, “Selection and Sorting with Limited Storage,” Theoretical Computer Science, pp. 315323, 1980.
[65] K. Nauta and F. Lieble, “Offline Network Intrusion Detection: Looking for Footprints,” SAS White Paper, 2000.
[66] R.T. Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining," Proc. 20th Int'l Conf. Very Large Databases, Morgan Kaufmann, 1994, pp. 144155.
[67] L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani, “StreamingData Algorithms for HighQuality Clustering,” Proc. Int'l Conf. Data Eng., 2002.
[68] R. Ostrovsky and Y. Rabani, “Polynomial Time Approximation Schemes for Geometric kClustering,” Proc. ACM Symp. Foundations of Computer Science, 2000.
[69] C.M. Procopiuc et al., "A Monte Carlo Algorithm for Fast Projective Clustering," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 2002, pp. 418427.
[70] G. Sheikholeslami, S. Chatterjee, and A. Zhang, WaveCluster: A MultiResolution Clustering Approach for Very Large Spatial Databases Proc. Very Large Date Bases Conf., pp. 428439, Aug. 1998.
[71] D.B. Shmoys, E. Tardos, and K. Aardal, “Approximation Algorithms for Facility Location Problems (Extended Abstract),” Proc. 29th ACM STOC, pp. 265274, 1997.
[72] N. Thaper, S. Guha, P. Indyk, and N. Koudas, “Multidimensional Dynamic Histograms,” Proc. ACM SIGMOD Int'l Conf. Management of Data, 2002.
[73] M. Thorup, “Quick kMedian, kCenter, and Facility Location for Sparse Graphs,” Proc. Int'l Colloquium on Automata, Languages, and Programming, pp. 249260, 2001.
[74] V. Vazirani, Approximation Algorithms. Springer Verlag, 2001.
[75] W. Wang, J. Yang, and R.R. Muntz, "Sting: A Statistical Information Grid Approach to Spatial Data Mining," Proc. 23rd Int'l Conf. Very Large Databases, Morgan Kaufmann, 1997, pp. 186195.
[76] T. Zhang, R. Ramakrishnan, and M. Livny, "Birch: An Efficient Data Clustering Method for Very Large Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1996, pp. 103114.