
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Graham Cormode, Mayur Datar, Piotr Indyk, S. Muthukrishnan, "Comparing Data Streams Using Hamming Norms (How to Zero In)," IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 3, pp. 529540, May/June, 2003.  
BibTex  x  
@article{ 10.1109/TKDE.2003.1198388, author = {Graham Cormode and Mayur Datar and Piotr Indyk and S. Muthukrishnan}, title = {Comparing Data Streams Using Hamming Norms (How to Zero In)}, journal ={IEEE Transactions on Knowledge and Data Engineering}, volume = {15}, number = {3}, issn = {10414347}, year = {2003}, pages = {529540}, doi = {http://doi.ieeecomputersociety.org/10.1109/TKDE.2003.1198388}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Knowledge and Data Engineering TI  Comparing Data Streams Using Hamming Norms (How to Zero In) IS  3 SN  10414347 SP529 EP540 EPD  529540 A1  Graham Cormode, A1  Mayur Datar, A1  Piotr Indyk, A1  S. Muthukrishnan, PY  2003 KW  Data stream analysis KW  approximate query processing KW  data structures and algorithms KW  data reduction. VL  15 JA  IEEE Transactions on Knowledge and Data Engineering ER   
Abstract—Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases and instead must be processed “on the fly” as they are produced. Similarly, sensor networks produce multiple data streams of observations from their sensors. There is growing focus on manipulating data streams and, hence, there is a need to identify basic operations of interest in managing data streams, and to support them efficiently. We propose computation of the Hamming norm as a basic operation of interest. The Hamming norm formalizes ideas that are used throughout data processing. When applied to a single stream, the Hamming norm gives the number of distinct items that are present in that data stream, which is a statistic of great interest in databases. When applied to a pair of streams, the Hamming norm gives an important measure of (dis)similarity: the number of unequal item counts in the two streams. Hamming norms have many uses in comparing data streams. We present a novel approximation technique for estimating the Hamming norm for massive data streams; this relies on what we call the “l_0
[1] N. Alon, Y. Matias, and M. Szegedy, “The Space Complexity of Approximating the Frequency Moments,” JCSS: J. Computer and System Sciences, vol. 58, pp. 137147, 1999.
[2] Amazon Project,www.cs.yale.edu/users/kannan/Papers/cluster.pshttp:/ /www.cs.cornell.edu/database/ amazonamazon.htm, 2002.
[3] S. Babu, L. Subramanian, and J. Widom, “A Data Stream Management System for Network Traffic Management,” Proc. Workshop NetworkRelated Data Management, 2001.
[4] S. Babu and J. Widom, “Continuous Queries over Data Streams,” SIGMOD Record, vol. 30, no. 3, Sept. 2001.
[5] Z. BarYossef, T.S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisian, “Counting Distinct Elements in a Data Stream.” Proc. RANDOM 2002, pp. 110, 2002.
[6] Z. BarYossef, R. Kumar, and D. Sivakumar, “Reductions in Streaming Algorithms, with an Application to Counting Triangles in Graphs,” Proc. ACMSIAM Ann. Symp. Discrete Algorithms, pp. 623632, 2002.
[7] J. Bunge and M. Fitzpatrick, “Estimating the Number of Species: A Review,” J. Am. Statistical Assoc., vol. 88, no. 421, p. 364, 1993.
[8] J.M. Chambers, C.L. Mallows, and B.W. Stuck, “A Method for Simulating Stable Random Variables,” J. Am. Statistical Assoc., vol. 71, no. 354, pp. 340344, 1976.
[9] M. Charikar, S. Chaudhuri, R. Motwani, and V.R. Narasayya, “Towards Estimation Error Guarantees for Distinct Values,” Proc. 19th Symp. Principles of Database Systems, pp. 268279, 2000.
[10] S. Chaudhuri, R. Motwani, and V. Narasayya, Random Sampling for Histogram Construction: How Much Is Enough? Proc. SIGMOD, pp. 436447, June 1998.
[11] E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang, “Finding Interesting Associations without Support Pruning,” Proc. 16th Int'l Conf. Data Eng., pp. 489500, 2000.
[12] G. Cormode, P. Indyk, N. Koudas, and S. Muthukrishnan, “Fast Mining of Tabular Data via Approximate Distance Computations,” Proc. Int'l Conf. Data Eng., 2002.
[13] G. Cormode and S. Muthukrishnan, “Estimating Dominance Norms of Multiple Data Streams,” Technical Report 200235, DIMACS, 2002.
[14] C. Cranor, L. Gao, T. Johnson, and O. Spatscheck, “Gigascope: High Performance Network Monitoring with an SQL Interface,” Proc. ACM SIGMOD Int'l Conf. Management of Data, p. 262, 2002.
[15] P. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk, “Mining Database Structures or How to Build a Data Quality Browser,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 240251, 2002.
[16] DIMACS Workshop Streaming Data Analysis and Mining; and DIMACS Workshop Sublinear Algorithms, Center for Discrete Mathematics and Computer Science,http://dimacs.rutgers.edu/WorkshopsStreaming /,http://dimacs.rutgers.edu/WorkshopsSublinear , 2001.
[17] P. Domingos, G. Hulten, and L. Spencer, “Mining TimeChanging Data Streams,” Proc. Seventh Int'l Conf. Knowledge Discovery and Data Mining, pp. 97106, 2001.
[18] J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan, “An Approximate$\big. L_1{\hbox{}}{\rm{Difference}}\bigr.$Algorithm for Massive Data Streams,” Proc. 40th Ann. Symp. Foundations of Computer Science, pp. 501511, 1999.
[19] A. Feldmann, A. Greenberg, C. Lund, N. Reingold, and J. Rexford, “Netscope: Traffic Engineering for IP Networks,” IEEE Network Magazine, pp. 1119, 2000.
[20] A. Feldmann, A. Greenberg, C. Lund, N. Reingold, J. Rexford, and F. True, “Deriving Traffic Demands for Operational IP Networks: Methodology and Experience,” IEEE/ACM Trans. Networking, pp. 265279, 2001.
[21] P. Flajolet and G.N. Martin, “Probabilistic Counting,” Proc. 24th Ann. Symp. Foundations of Computer Science, pp. 7682, 1983.
[22] P. Flajolet and G.N. Martin, “Probabilistic Counting Algorithms for Database Applications,” J. Computer and System Sciences, vol. 31, pp. 182209, 1985.
[23] P. Gibbons, “Distinct Sampling for HighlyAccurate Answers to Distinct Values Queries and Event Reports,” Proc. 27th Int'l Conf. Very Large Databases, pp. 541550, 2001.
[24] P. Gibbons and Y. Matias, “Synopsis Structures for Massive Data Sets,” DIMACS, Series in Discrete Math. and Theoretical Computer Science, A, 1999.
[25] P. Gibbons and S. Tirthapura, “Estimating Simple Functions on the Union of Data Streams,” Proc. 13th ACM Symp. Parallel Algorithms and Architectures, pp. 281290, 2001.
[26] A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, “QuickSAND: Quick Summary and Analysis of Network Data,” Technical Report 200143, DIMACS, 2001.
[27] A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M.J. Strauss, “Surfing Wavelets on Streams: OnePass Summaries for Approximate Aggregate Queries,” Proc. 27th Int'l Conf. Very Large Data Bases, pp. 7988, Sept. 2001.
[28] M. Grossglauser, N. Koudas, Y. Park, and A. Varriot, “Falcon: Fault Management via Alarm Warehousing and Mining,” Proc. Workshop NetworkRelated Data Management, 2001.
[29] S. Guha, N. Koudas, and K. Shim, “Data Streams and Histograms,” Proc. Symp. Theory of Computing, pp. 471475, 2001.
[30] S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering Data Streams,” Proc. 41st Ann. Symp. Foundations of Computer Science, 2000.
[31] P.J. Haas, J.F. Naughton, S. Seshadri, and L. Stokes, “SamplingBased Estimation of the Number of Distinct Values of an Attribute,” Proc. 21st Int'l Conf. Very Large Databases, pp. 311322, 1995.
[32] P. Indyk, “Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation,” Proc. 40th Symp. Foundations of Computer Science, pp. 189197, 2000.
[33] P. Indyk, N. Koudas, and S. Muthukrishnan, “Identifying Representative Trends in Massive Time Series Data Sets Using Sketches,” Proc. 26th Int'l Conf. Very Large Databases, pp. 363372, 2000.
[34] W.B. Johnson and J. Lindenstrauss, “Extensions of Lipshitz Mapping into Hilbert Space,” Contemporary Math., vol. 26, pp. 189206, 1984.
[35] S. Madden and M.J. Franklin, “Fjording the Stream: An Architecture for Queries over Streaming Sensor Data,” Proc. 18th IEEE Int'l Conf. Data Eng., pp. 555566, Feb. 2002.
[36] D. Moore, G.M. Voelker, and S. Savage, “Inferring Internet Denial of Service Activity,” Proc. Usenix Security Symp., pp. 922, 2001.
[37] Cisco NetFlow, more details athttp://www.cisco.com/warp/public/732/Tech netflow/, 2002.
[38] NOAA, National Oceanic and Atmospheric Administration, US National Weather Service,http:/www.nws.noaa.gov/, 2003.
[39] J. Nolan Stable Distributions, available fromhttp://academic2.american.edu/~jpnolan/stable chap1.pdf, 2002.
[40] J. Nolan, personal communication, 2002.
[41] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery, Numerical Recipes in C. Cambridge Univ. Press, 1992.
[42] California PATH SmartAHS, more details athttp://path.berkeley.edu/SMARTAHSindex.html , 1997.
[43] N. Thaper, S. Guha, P. Indyk, and N. Koudas, “Multidimensional Dynamic Histograms,” Proc. ACM SIGMOD Int'l Conf. Management of Data, 2002.
[44] Unidata, Atmospheric Data Repository,http:/www.unidata.ucar.edu/, 2003.