This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Duplicate-Insensitive Order Statistics Computation over Data Streams
April 2010 (vol. 22 no. 4)
pp. 493-507
Ying Zhang, University of New South Wales and NICTA, Sydney
Xuemin Lin, University of New South Wales and NICTA, Sydney
Yidong Yuan, University of New South Wales and NICTA, Sydney
Masaru Kitsuregawa, University of Tokyo, Tokyo
Xiaofang Zhou, University of Queensland, Brisbane
Jeffrey Xu Yu, Chinese University of Hong Kong, Hong Kong
Duplicates in data streams may often be observed by the projection on a subspace and/or multiple recordings of objects. Without the uniqueness assumption on observed data elements, many conventional aggregates computation problems need to be further investigated due to their duplication-sensitive nature. In this paper, we present novel, space-efficient, one-scan algorithms to continuously maintain duplicate-insensitive order sketches so that rank-based queries can be approximately processed with a relative rank error guarantee \epsilon in the presence of data duplicates. Besides the space efficiency, the proposed algorithms are time-efficient and highly accurate. Moreover, our techniques may be immediately applied to the heavy hitter problem against distinct elements and to the existing fault-tolerant distributed communication techniques. A comprehensive performance study demonstrates that our algorithms can support real-time computation against high-speed data streams.

[1] M. Ajtai, I.S. Jayram, R. Kumar, and D. Sivakumar, "Approximate Counting of Inversions in a Data Stream," Proc. Symp. Theory of Computing (STOC), 2002.
[2] A. Arasu and G.S. Manku, "Approximate Counts and Quantiles over Sliding Windows," Proc. ACM Symp. Principles of Database Systems (PODS), 2004.
[3] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, "Models and Issues in Data Stream Systems," Proc. ACM Symp. Principles of Database Systems (PODS '02), 2002.
[4] Z. Bar-Yossef, T.S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan, "Counting Distinct Elements in a Data Stream," Proc. Int'l Workshop Randomization and Approximation Techniques (RANDOM), 2002.
[5] M. Bawa, H.G. Molina, A. Gionis, and R. Motwani, "Estimating Aggregates on a Peer-to-Peer Network," technical report, Stanford Univ., 2003.
[6] J.L. Bentley, H.T. Kung, M. Schkolnick, and C.D. Thompson, "On the Average Number of Maxima in a Set of Vectors and Applications," J. ACM, vol. 25, no. 4, pp. 536-543, 1978.
[7] S. Börzsönyi, D. Kossmann, and K. Stocker, "The Skyline Operator," Proc. Int'l Conf. Data Eng. (ICDE), pp. 421-430, 2001.
[8] J.-Y. Chen, G. Pandurangan, and D. Xu, "Robust Computation of Aggregates in Wireless Sensor Networks: Distributed Randomized Algorithms and Analysis," Proc. Int'l Conf. Information Processing in Sensor Networks (IPSN), 2005.
[9] J. Considine, F. Li, G. Kollios, and J. Byers, "Approximate Aggregation Techniques for Sensor Databases," Proc. Int'l Conf. Data Eng. (ICDE), pp. 449-460, 2004.
[10] C. Cormode and S. Muthukrishnan, "An Improved Data Stream: The Count-Min Sketch and Its Applications," Proc. Latin Am. Theoretical Informatics (LATIN), 2004.
[11] G. Cormode, M. Garofalakis, S. Muthukrishnan, and R. Rastogi, "Holistic Aggregates in a Networked World: Distributed Tracking of Approximate Quantiles," Proc. ACM SIGMOD, pp. 25-36, 2005.
[12] G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava, "Effective Computation of Biased Quantiles over Data Streams," Proc. Int'l Conf. Data Eng. (ICDE), 2005.
[13] G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava, "Space- and Time-Efficient Deterministic Algorithms for Biased Quantiles over Data Streams," Proc. ACM Symp. Principles of Database Systems (PODS), 2006.
[14] G. Cormode and S. Muthukrishnan, "Space Efficient Mining of Multigraph Streams," Proc. Symp. Principles of Database Systems (PODS), 2005.
[15] G. Cormode, S. Muthukrishnan, and W. Zhuang, "What's Different: Distributed, Continuous Monitoring of Duplicate-Resilient Aggregates on Data Streams," Proc. Int'l Conf. Data Eng. (ICDE), 2006.
[16] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar, "Rank Aggregation Methods for the Web," Proc. Int'l Conf. World Wide Web (WWW), 2001.
[17] W. Feller, An Introduction to Probability Theory and Its Applications. John Wiley & Sons, Inc., 1966.
[18] P. Flajolet and G.N. Martin, "Probabilistic Counting Algorithms for Data Base Applications," J. Computer and System Sciences, vol. 31, no. 2, pp. 182-209, 1985.
[19] A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, "How to Summarize the Universe: Dynamic Maintenance of Quantiles," Proc. Int'l Conf. Very Large Data Bases (VLDB), 2002.
[20] M. Greenwald and S. Khanna, "Space-Efficient Online Computation of Quantile Summaries," Proc. ACM SIGMOD, 2001.
[21] M. Greenwald and S. Khanna, "Power-Conserving Computation of Order Statistics over Sensor Networks," Proc. Symp. Principles of Database Systems (PODS), 2004.
[22] S. Guha, N. Koudas, and K. Shim, "Data-Streams and Histograms," Proc. Symp. Theory of Computing (STOC), 2001.
[23] S. Guha and A. McGregor, "Approximate Quantile and the Order of the Stream," Proc. Symp. Principles of Database Systems (PODS), 2006.
[24] A. Gupta and F. Zane, "Counting Inversions in Lists," Proc. Symp. Discrete Algorithms (SODA), 2003.
[25] M. Hadjieleftheriou, J.W. Byers, and G. Kollios, "Robust Sketching and Aggregation of Distributed Data Streams," technical report, Boston Univ., 2005.
[26] J. Hershberger, N. Shrivastava, S. Suri, and C. Toth, "Adaptive Spatial Partitioning for Multidimensional Data Streams," Proc. Int'l Symp. Algorithms and Computation (ISAAC), 2004.
[27] Internet Traffic Archive, http:/ita.ee.lbl.gov, 2009.
[28] J.I. Munro and M.S. Paterson, "Selection and Sorting with Limited Storage," Theoretical Computer Science (TCS), vol. 12, pp. 315-323, 1980.
[29] D. Kempe, A. Dobra, and J. Gehrke, "Gossip-Based Computation of Aggregate Information," Proc. Ann. Symp. Foundations of Computer Science (FOCS), 2003.
[30] X. Lin, H. Lu, J. Xu, and J.X. Yu, "Continuously Maintaining Quantile Summaries of the Most Recent n Elements over a Data Stream," Proc. Int'l Conf. Data Eng. (ICDE), 2004.
[31] A. Manjhi, S. Nath, and P.B. Gibbons, "Tributaries and Deltas: Efficient and Robust Aggregation in Sensor Network Streams," Proc. ACM SIGMOD, 2005.
[32] G.S. Manku, S. Rajagopalan, and B.G. Lindsay, "Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets," Proc. ACM SIGMOD, 1999.
[33] Massive Data Analysis Lab., http://www.cs.rutgers.edu/~muthumassdal.html , 2009.
[34] K. Mouratidis, S. Bakiras, and D. Papadias, "Continuous Monitoring of Top-k Queries over Sliding Windows," Proc. ACM SIGMOD, pp. 635-646, 2006.
[35] S. Nath, P.B. Gibbons, S. Seshan, and Z.R. Anderson, "Synopsis Diffusion for Robust Aggregation in Sensor Networks," Proc. Int'l Conf. Embedded Networked Sensor Systems (SenSys), pp. 250-262, 2004.
[36] N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri, "Medians and Beyond: New Aggregation Techniques for Sensor Networks," Proc. Int'l Conf. Embedded Networked Sensor Systems (SenSys '04), pp. 239-249, 2004.
[37] M. Yiu, N. Marmoulis, and Y. Tao, "Efficient Quantile Retrieval on Multi-Dimensional Data," Proc. Int'l Conf. Extending Database Technology (EDBT), 2006.
[38] Y. Zhang, X. Lin, J. Xu, F. Korn, and W. Wang, "Space-Efficient Relative Error Order Sketch over Data Streams," Proc. Int'l Conf. Data Eng. (ICDE), 2006.

Index Terms:
Order statistic, data stream, duplicate insensitive, relative error.
Citation:
Ying Zhang, Xuemin Lin, Yidong Yuan, Masaru Kitsuregawa, Xiaofang Zhou, Jeffrey Xu Yu, "Duplicate-Insensitive Order Statistics Computation over Data Streams," IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 4, pp. 493-507, April 2010, doi:10.1109/TKDE.2009.68
Usage of this product signifies your acceptance of the Terms of Use.