The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.04 - April (2010 vol.22)
pp: 493-507
Xuemin Lin , University of New South Wales and NICTA, Sydney
Yidong Yuan , University of New South Wales and NICTA, Sydney
Masaru Kitsuregawa , University of Tokyo, Tokyo
Xiaofang Zhou , University of Queensland, Brisbane
Jeffrey Xu Yu , Chinese University of Hong Kong, Hong Kong
ABSTRACT
Duplicates in data streams may often be observed by the projection on a subspace and/or multiple recordings of objects. Without the uniqueness assumption on observed data elements, many conventional aggregates computation problems need to be further investigated due to their duplication-sensitive nature. In this paper, we present novel, space-efficient, one-scan algorithms to continuously maintain duplicate-insensitive order sketches so that rank-based queries can be approximately processed with a relative rank error guarantee \epsilon in the presence of data duplicates. Besides the space efficiency, the proposed algorithms are time-efficient and highly accurate. Moreover, our techniques may be immediately applied to the heavy hitter problem against distinct elements and to the existing fault-tolerant distributed communication techniques. A comprehensive performance study demonstrates that our algorithms can support real-time computation against high-speed data streams.
INDEX TERMS
Order statistic, data stream, duplicate insensitive, relative error.
CITATION
Xuemin Lin, Yidong Yuan, Masaru Kitsuregawa, Xiaofang Zhou, Jeffrey Xu Yu, "Duplicate-Insensitive Order Statistics Computation over Data Streams", IEEE Transactions on Knowledge & Data Engineering, vol.22, no. 4, pp. 493-507, April 2010, doi:10.1109/TKDE.2009.68
REFERENCES
[1] M. Ajtai, I.S. Jayram, R. Kumar, and D. Sivakumar, "Approximate Counting of Inversions in a Data Stream," Proc. Symp. Theory of Computing (STOC), 2002.
[2] A. Arasu and G.S. Manku, "Approximate Counts and Quantiles over Sliding Windows," Proc. ACM Symp. Principles of Database Systems (PODS), 2004.
[3] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, "Models and Issues in Data Stream Systems," Proc. ACM Symp. Principles of Database Systems (PODS '02), 2002.
[4] Z. Bar-Yossef, T.S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan, "Counting Distinct Elements in a Data Stream," Proc. Int'l Workshop Randomization and Approximation Techniques (RANDOM), 2002.
[5] M. Bawa, H.G. Molina, A. Gionis, and R. Motwani, "Estimating Aggregates on a Peer-to-Peer Network," technical report, Stanford Univ., 2003.
[6] J.L. Bentley, H.T. Kung, M. Schkolnick, and C.D. Thompson, "On the Average Number of Maxima in a Set of Vectors and Applications," J. ACM, vol. 25, no. 4, pp. 536-543, 1978.
[7] S. Börzsönyi, D. Kossmann, and K. Stocker, "The Skyline Operator," Proc. Int'l Conf. Data Eng. (ICDE), pp. 421-430, 2001.
[8] J.-Y. Chen, G. Pandurangan, and D. Xu, "Robust Computation of Aggregates in Wireless Sensor Networks: Distributed Randomized Algorithms and Analysis," Proc. Int'l Conf. Information Processing in Sensor Networks (IPSN), 2005.
[9] J. Considine, F. Li, G. Kollios, and J. Byers, "Approximate Aggregation Techniques for Sensor Databases," Proc. Int'l Conf. Data Eng. (ICDE), pp. 449-460, 2004.
[10] C. Cormode and S. Muthukrishnan, "An Improved Data Stream: The Count-Min Sketch and Its Applications," Proc. Latin Am. Theoretical Informatics (LATIN), 2004.
[11] G. Cormode, M. Garofalakis, S. Muthukrishnan, and R. Rastogi, "Holistic Aggregates in a Networked World: Distributed Tracking of Approximate Quantiles," Proc. ACM SIGMOD, pp. 25-36, 2005.
[12] G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava, "Effective Computation of Biased Quantiles over Data Streams," Proc. Int'l Conf. Data Eng. (ICDE), 2005.
[13] G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava, "Space- and Time-Efficient Deterministic Algorithms for Biased Quantiles over Data Streams," Proc. ACM Symp. Principles of Database Systems (PODS), 2006.
[14] G. Cormode and S. Muthukrishnan, "Space Efficient Mining of Multigraph Streams," Proc. Symp. Principles of Database Systems (PODS), 2005.
[15] G. Cormode, S. Muthukrishnan, and W. Zhuang, "What's Different: Distributed, Continuous Monitoring of Duplicate-Resilient Aggregates on Data Streams," Proc. Int'l Conf. Data Eng. (ICDE), 2006.
[16] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar, "Rank Aggregation Methods for the Web," Proc. Int'l Conf. World Wide Web (WWW), 2001.
[17] W. Feller, An Introduction to Probability Theory and Its Applications. John Wiley & Sons, Inc., 1966.
[18] P. Flajolet and G.N. Martin, "Probabilistic Counting Algorithms for Data Base Applications," J. Computer and System Sciences, vol. 31, no. 2, pp. 182-209, 1985.
[19] A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, "How to Summarize the Universe: Dynamic Maintenance of Quantiles," Proc. Int'l Conf. Very Large Data Bases (VLDB), 2002.
[20] M. Greenwald and S. Khanna, "Space-Efficient Online Computation of Quantile Summaries," Proc. ACM SIGMOD, 2001.
[21] M. Greenwald and S. Khanna, "Power-Conserving Computation of Order Statistics over Sensor Networks," Proc. Symp. Principles of Database Systems (PODS), 2004.
[22] S. Guha, N. Koudas, and K. Shim, "Data-Streams and Histograms," Proc. Symp. Theory of Computing (STOC), 2001.
[23] S. Guha and A. McGregor, "Approximate Quantile and the Order of the Stream," Proc. Symp. Principles of Database Systems (PODS), 2006.
[24] A. Gupta and F. Zane, "Counting Inversions in Lists," Proc. Symp. Discrete Algorithms (SODA), 2003.
[25] M. Hadjieleftheriou, J.W. Byers, and G. Kollios, "Robust Sketching and Aggregation of Distributed Data Streams," technical report, Boston Univ., 2005.
[26] J. Hershberger, N. Shrivastava, S. Suri, and C. Toth, "Adaptive Spatial Partitioning for Multidimensional Data Streams," Proc. Int'l Symp. Algorithms and Computation (ISAAC), 2004.
[27] Internet Traffic Archive, http:/ita.ee.lbl.gov, 2009.
[28] J.I. Munro and M.S. Paterson, "Selection and Sorting with Limited Storage," Theoretical Computer Science (TCS), vol. 12, pp. 315-323, 1980.
[29] D. Kempe, A. Dobra, and J. Gehrke, "Gossip-Based Computation of Aggregate Information," Proc. Ann. Symp. Foundations of Computer Science (FOCS), 2003.
[30] X. Lin, H. Lu, J. Xu, and J.X. Yu, "Continuously Maintaining Quantile Summaries of the Most Recent n Elements over a Data Stream," Proc. Int'l Conf. Data Eng. (ICDE), 2004.
[31] A. Manjhi, S. Nath, and P.B. Gibbons, "Tributaries and Deltas: Efficient and Robust Aggregation in Sensor Network Streams," Proc. ACM SIGMOD, 2005.
[32] G.S. Manku, S. Rajagopalan, and B.G. Lindsay, "Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets," Proc. ACM SIGMOD, 1999.
[33] Massive Data Analysis Lab., http://www.cs.rutgers.edu/~muthumassdal.html , 2009.
[34] K. Mouratidis, S. Bakiras, and D. Papadias, "Continuous Monitoring of Top-k Queries over Sliding Windows," Proc. ACM SIGMOD, pp. 635-646, 2006.
[35] S. Nath, P.B. Gibbons, S. Seshan, and Z.R. Anderson, "Synopsis Diffusion for Robust Aggregation in Sensor Networks," Proc. Int'l Conf. Embedded Networked Sensor Systems (SenSys), pp. 250-262, 2004.
[36] N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri, "Medians and Beyond: New Aggregation Techniques for Sensor Networks," Proc. Int'l Conf. Embedded Networked Sensor Systems (SenSys '04), pp. 239-249, 2004.
[37] M. Yiu, N. Marmoulis, and Y. Tao, "Efficient Quantile Retrieval on Multi-Dimensional Data," Proc. Int'l Conf. Extending Database Technology (EDBT), 2006.
[38] Y. Zhang, X. Lin, J. Xu, F. Korn, and W. Wang, "Space-Efficient Relative Error Order Sketch over Data Streams," Proc. Int'l Conf. Data Eng. (ICDE), 2006.
23 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool