Subscribe

Issue No.04 - April (2010 vol.22)

pp: 493-507

Ying Zhang , University of New South Wales and NICTA, Sydney

Xuemin Lin , University of New South Wales and NICTA, Sydney

Yidong Yuan , University of New South Wales and NICTA, Sydney

Masaru Kitsuregawa , University of Tokyo, Tokyo

Xiaofang Zhou , University of Queensland, Brisbane

Jeffrey Xu Yu , Chinese University of Hong Kong, Hong Kong

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2009.68

ABSTRACT

Duplicates in data streams may often be observed by the projection on a subspace and/or multiple recordings of objects. Without the uniqueness assumption on observed data elements, many conventional aggregates computation problems need to be further investigated due to their duplication-sensitive nature. In this paper, we present novel, space-efficient, one-scan algorithms to continuously maintain duplicate-insensitive order sketches so that rank-based queries can be approximately processed with a relative rank error guarantee \epsilon in the presence of data duplicates. Besides the space efficiency, the proposed algorithms are time-efficient and highly accurate. Moreover, our techniques may be immediately applied to the heavy hitter problem against distinct elements and to the existing fault-tolerant distributed communication techniques. A comprehensive performance study demonstrates that our algorithms can support real-time computation against high-speed data streams.

INDEX TERMS

Order statistic, data stream, duplicate insensitive, relative error.

CITATION

Ying Zhang, Xuemin Lin, Yidong Yuan, Masaru Kitsuregawa, Xiaofang Zhou, Jeffrey Xu Yu, "Duplicate-Insensitive Order Statistics Computation over Data Streams",

*IEEE Transactions on Knowledge & Data Engineering*, vol.22, no. 4, pp. 493-507, April 2010, doi:10.1109/TKDE.2009.68REFERENCES

- [1] M. Ajtai, I.S. Jayram, R. Kumar, and D. Sivakumar, "Approximate Counting of Inversions in a Data Stream,"
Proc. Symp. Theory of Computing (STOC), 2002.- [2] A. Arasu and G.S. Manku, "Approximate Counts and Quantiles over Sliding Windows,"
Proc. ACM Symp. Principles of Database Systems (PODS), 2004.- [3] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, "Models and Issues in Data Stream Systems,"
Proc. ACM Symp. Principles of Database Systems (PODS '02), 2002.- [4] Z. Bar-Yossef, T.S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan, "Counting Distinct Elements in a Data Stream,"
Proc. Int'l Workshop Randomization and Approximation Techniques (RANDOM), 2002.- [5] M. Bawa, H.G. Molina, A. Gionis, and R. Motwani, "Estimating Aggregates on a Peer-to-Peer Network," technical report, Stanford Univ., 2003.
- [6] J.L. Bentley, H.T. Kung, M. Schkolnick, and C.D. Thompson, "On the Average Number of Maxima in a Set of Vectors and Applications,"
J. ACM, vol. 25, no. 4, pp. 536-543, 1978.- [7] S. Börzsönyi, D. Kossmann, and K. Stocker, "The Skyline Operator,"
Proc. Int'l Conf. Data Eng. (ICDE), pp. 421-430, 2001.- [8] J.-Y. Chen, G. Pandurangan, and D. Xu, "Robust Computation of Aggregates in Wireless Sensor Networks: Distributed Randomized Algorithms and Analysis,"
Proc. Int'l Conf. Information Processing in Sensor Networks (IPSN), 2005.- [9] J. Considine, F. Li, G. Kollios, and J. Byers, "Approximate Aggregation Techniques for Sensor Databases,"
Proc. Int'l Conf. Data Eng. (ICDE), pp. 449-460, 2004.- [10] C. Cormode and S. Muthukrishnan, "An Improved Data Stream: The Count-Min Sketch and Its Applications,"
Proc. Latin Am. Theoretical Informatics (LATIN), 2004.- [11] G. Cormode, M. Garofalakis, S. Muthukrishnan, and R. Rastogi, "Holistic Aggregates in a Networked World: Distributed Tracking of Approximate Quantiles,"
Proc. ACM SIGMOD, pp. 25-36, 2005.- [12] G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava, "Effective Computation of Biased Quantiles over Data Streams,"
Proc. Int'l Conf. Data Eng. (ICDE), 2005.- [13] G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava, "Space- and Time-Efficient Deterministic Algorithms for Biased Quantiles over Data Streams,"
Proc. ACM Symp. Principles of Database Systems (PODS), 2006.- [14] G. Cormode and S. Muthukrishnan, "Space Efficient Mining of Multigraph Streams,"
Proc. Symp. Principles of Database Systems (PODS), 2005.- [15] G. Cormode, S. Muthukrishnan, and W. Zhuang, "What's Different: Distributed, Continuous Monitoring of Duplicate-Resilient Aggregates on Data Streams,"
Proc. Int'l Conf. Data Eng. (ICDE), 2006.- [16] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar, "Rank Aggregation Methods for the Web,"
Proc. Int'l Conf. World Wide Web (WWW), 2001.- [17] W. Feller,
An Introduction to Probability Theory and Its Applications. John Wiley & Sons, Inc., 1966.- [18] P. Flajolet and G.N. Martin, "Probabilistic Counting Algorithms for Data Base Applications,"
J. Computer and System Sciences, vol. 31, no. 2, pp. 182-209, 1985.- [19] A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, "How to Summarize the Universe: Dynamic Maintenance of Quantiles,"
Proc. Int'l Conf. Very Large Data Bases (VLDB), 2002.- [20] M. Greenwald and S. Khanna, "Space-Efficient Online Computation of Quantile Summaries,"
Proc. ACM SIGMOD, 2001.- [21] M. Greenwald and S. Khanna, "Power-Conserving Computation of Order Statistics over Sensor Networks,"
Proc. Symp. Principles of Database Systems (PODS), 2004.- [22] S. Guha, N. Koudas, and K. Shim, "Data-Streams and Histograms,"
Proc. Symp. Theory of Computing (STOC), 2001.- [23] S. Guha and A. McGregor, "Approximate Quantile and the Order of the Stream,"
Proc. Symp. Principles of Database Systems (PODS), 2006.- [24] A. Gupta and F. Zane, "Counting Inversions in Lists,"
Proc. Symp. Discrete Algorithms (SODA), 2003.- [25] M. Hadjieleftheriou, J.W. Byers, and G. Kollios, "Robust Sketching and Aggregation of Distributed Data Streams," technical report, Boston Univ., 2005.
- [26] J. Hershberger, N. Shrivastava, S. Suri, and C. Toth, "Adaptive Spatial Partitioning for Multidimensional Data Streams,"
Proc. Int'l Symp. Algorithms and Computation (ISAAC), 2004.- [27] Internet Traffic Archive, http:/ita.ee.lbl.gov, 2009.
- [28] J.I. Munro and M.S. Paterson, "Selection and Sorting with Limited Storage,"
Theoretical Computer Science (TCS), vol. 12, pp. 315-323, 1980.- [29] D. Kempe, A. Dobra, and J. Gehrke, "Gossip-Based Computation of Aggregate Information,"
Proc. Ann. Symp. Foundations of Computer Science (FOCS), 2003.- [30] X. Lin, H. Lu, J. Xu, and J.X. Yu, "Continuously Maintaining Quantile Summaries of the Most Recent n Elements over a Data Stream,"
Proc. Int'l Conf. Data Eng. (ICDE), 2004.- [31] A. Manjhi, S. Nath, and P.B. Gibbons, "Tributaries and Deltas: Efficient and Robust Aggregation in Sensor Network Streams,"
Proc. ACM SIGMOD, 2005.- [32] G.S. Manku, S. Rajagopalan, and B.G. Lindsay, "Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets,"
Proc. ACM SIGMOD, 1999.- [33] Massive Data Analysis Lab., http://www.cs.rutgers.edu/~muthumassdal.html , 2009.
- [34] K. Mouratidis, S. Bakiras, and D. Papadias, "Continuous Monitoring of Top-k Queries over Sliding Windows,"
Proc. ACM SIGMOD, pp. 635-646, 2006.- [35] S. Nath, P.B. Gibbons, S. Seshan, and Z.R. Anderson, "Synopsis Diffusion for Robust Aggregation in Sensor Networks,"
Proc. Int'l Conf. Embedded Networked Sensor Systems (SenSys), pp. 250-262, 2004.- [36] N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri, "Medians and Beyond: New Aggregation Techniques for Sensor Networks,"
Proc. Int'l Conf. Embedded Networked Sensor Systems (SenSys '04), pp. 239-249, 2004.- [37] M. Yiu, N. Marmoulis, and Y. Tao, "Efficient Quantile Retrieval on Multi-Dimensional Data,"
Proc. Int'l Conf. Extending Database Technology (EDBT), 2006.- [38] Y. Zhang, X. Lin, J. Xu, F. Korn, and W. Wang, "Space-Efficient Relative Error Order Sketch over Data Streams,"
Proc. Int'l Conf. Data Eng. (ICDE), 2006. |