This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Approximate Processing of Massive Continuous Quantile Queries over High-Speed Data Streams
May 2006 (vol. 18 no. 5)
pp. 683-698
Quantile computation has many applications including data mining and financial data analysis. It has been shown that an \epsilon{\hbox{-}}{\rm{approximate}} summary can be maintained so that, given a quantile query (\phi, \epsilon), the data item at rank \lceil \phi N \rceil may be approximately obtained within the rank error precision \epsilon N over all N data items in a data stream or in a sliding window. However, scalable online processing of massive continuous quantile queries with different \phi and \epsilon poses a new challenge because the summary is continuously updated with new arrivals of data items. In this paper, first we aim to dramatically reduce the number of distinct query results by grouping a set of different queries into a cluster so that they can be processed virtually as a single query while the precision requirements from users can be retained. Second, we aim to minimize the total query processing costs. Efficient algorithms are developed to minimize the total number of times for reprocessing clusters and to produce the minimum number of clusters, respectively. The techniques are extended to maintain near-optimal clustering when queries are registered and removed in an arbitrary fashion against whole data streams or sliding windows. In addition to theoretical analysis, our performance study indicates that the proposed techniques are indeed scalable with respect to the number of input queries as well as the number of items and the item arrival rate in a data stream.

[1] A. Arasu and G.S. Manku, “Approximate Counts and Quantiles over Sliding Windows,” Proc. ACM Symp. Principles of Database Systems, pp. 286-296, 2004.
[2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and Issues in Data Stream Systems,” Proc. ACM Symp. Principles of Database Systems, pp. 1-16, 2002.
[3] J. Chen, D.J. DeWitt, F. Tian, and Y. Wang, “NiagaraCQ: A Scalable Continuous Query System for internet Databases,” Proc. SIGMOD Conf., pp. 379-390, 2000.
[4] Y. Chen, G. Dong, J. Han, B.W. Wah, and J. Wang, “Multi-Dimensional Regression Analysis of Time-Series Data Streams,” Proc. Conf. Very Large Databases, pp. 323-334, 2002.
[5] T. Cormen, C.S.C. Leiserson, and R. Rivest, Introduction to Algorithms. The MIT Press, 2001.
[6] G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava, “Effective Computation of Biased Quantiles over Data Streams,” Proc. Int'l Conf. Data Eng., pp. 20-31, 2005.
[7] G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava, “Finding Hierarchical Heavy Hitters in Data Streams,” Proc. Conf. Very Large Databases, pp. 464-475, 2003.
[8] G. Cormode and S. Muthukrishnan, “An Improved Data Stream Summary: The Count-Min Sketch and Its Applications,” J. Algorithms, vol. 55, no. 1, 2004.
[9] T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk, “Mining Database Structure; or, How to Build a Data Quality Browser,” Proc. SIGMOD Conf., pp. 240-251, 2002.
[10] A. Dobra, M.N. Garofalakis, J. Gehrke, and R. Rastogi, “Processing Complex Aggregate Queries over Data Streams,” Proc. SIGMOD Conf., pp. 61-72, 2002.
[11] M. Garofalakis and A. Kumar, “Correlating XML Data Streams Using Tree-Edit Distance Embeddings,” Proc. ACM Symp. Principles of Database Systems, pp. 143-154, 2003.
[12] A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, “How to Summarize the Universe: Dynamic Maintenance of Quantiles,” Proc. Conf. Very Large Databases, pp. 454-465, 2002.
[13] M. Greenwald and S. Khanna, “Space-Efficient Online Computation of Quantile Summaries,” Proc. SIGMOD Conf., 2001.
[14] S. Guha, N. Koudas, and K. Shim, “Data-Streams and Histograms,” Proc. Ann. ACM Symp. Theory of Computing, pp. 471-475, 2001.
[15] A. Gupta and D. Suciu, “Stream Processing of Xpath Queries with Predicates,” Proc. SIGMOD Conf., pp. 419-430, 2003.
[16] M. Hammad, M. Franklin, W. Aref, and A. Elmagarmid, “Scheduling for Shared Window Joins over Data Streams,” Proc. Conf. Very Large Databases, pp. 297-308, 2003.
[17] S. Krishnamurthy, S. Chandrasekaran, O. Cooper, A. Deshpande, M.J. Franklin, J.M. Hellerstein, W. Hong, S.R. Madden, F. Reiss, and M.A. Shah, “Telegraphcq: An Architectural Status Report,” IEEE Data Eng. Bull., 2003.
[18] H. Kung, F. Luccio, and F. Preparata, “On Finding the Maximum of a Set of Vectors,” J. ACM, vol. 22, no. 4, pp. 469-476, 1975.
[19] X. Lin, H. Lu, J. Xu, and J.X. Yu, “Continuously Maintaining Quantile Summaries of the Most Recent n Elements over a Data Stream,” Proc. Int'l Conf. Data Eng., pp. 362-374, 2004.
[20] T. Linsmeier and N. Pearson, “Risk Measurement: An Introduction to Value at Risk,” Working Paper 96-04, Office for Futures and Options Research (OFOR), 1996.
[21] L. Liu, C. Pu, and W. Tang, “Continual Queries for Internet Scale Event-Driven information Delivery,” IEEE Trans, Knowledge and Data Eng., vol. 11, no. 4, pp. 610-628, 1999.
[22] S. Madden and M.J. Franklin, “Fjording the Stream: An Architecture for Queries over Streaming Sensor Data,” Proc. Int'l Conf. Data Eng., pp. 555-566, 2002.
[23] S. Manganelli and R. Engle, “Value at Risk Models in Finance,” European Central Bank Working Paper Series No. 75, 2001.
[24] G. Manku and R. Motwani, “Approximate Frequency Counts over Data Streams,” Proc. Conf. Very Large Databases, pp. 346-357, 2002.
[25] G.S. Manku, S. Rajagopalan, and B.G. Lindsay, “Approximate Medians and Other Quantiles in One Pass and with Limited Memory,” Proc. SIGMOD Conf., pp. 426-435, 1998.
[26] G.S. Manku, S. Rajagopalan, and B.G. Lindsay, “Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Data Sets,” Proc. 1999 ACM SIGMOD Int'l Conf. Management of Data, pp. 251-262, 1999.
[27] J.I. Munro and M.S. Paterson, “Selection and Sorting with Limited Storage,” Theoretical Computer Science, 1980.
[28] F. Nielsen, “Fast Stabbing of Boxes in High Dimensions,” Theoretical Computer Science, 2000.
[29] C. Olston, J. Jiang, and J. Widom, “Adaptive Filters for Continuous Queries over Distributed Data Streams,” Proc. SIGMOD Conf., pp. 563-574, 2003.
[30] V. Poosala, P. Haas, Y. Ioannidis, and E. Shekita, “Improved Histograms for Selectivity Estimation of Range Predicates,” Proc. SIGMOD Conf., pp. 294-305, 1996.
[31] V. Poosala and Y. Ioannidis, “Estimation of Query-Result Distribution and Its Application in Parallel-Join Load Balancing,” Proc. Conf. Very Large Databases, pp. 448-459, 1996.
[32] F.P. Preparata and M.I. Shamos, Computational Geometry: An Introduction. Springer-Verlag, 1985.
[33] Version2.0 Financial Analysis Package MT for Gauss™— Version 2.0, Aptech System, Inc., 2003.
[34] S. Viglas, J.F. Naughton, and J. Burger, “Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources,” Proc. Conf. Very Large Databases, pp. 285-296, 2003.
[35] Y. Zhu and D. Shasha, “Statstream: Statistical Monitoring of Thousands of Data Streams in Real Time,” Proc. Conf. Very Large Databases, pp. 358-369, 2002.
[36] G.K. Zipf, Human Behaviour and he Principle of Least Effort. Addison-Wesley, 1949.

Index Terms:
Query processing, online computation, data mining.
Citation:
Xuemin Lin, Jian Xu, Qing Zhang, Hongjun Lu, Jeffrey Xu Yu, Xiaofang Zhou, Yidong Yuan, "Approximate Processing of Massive Continuous Quantile Queries over High-Speed Data Streams," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 5, pp. 683-698, May 2006, doi:10.1109/TKDE.2006.73
Usage of this product signifies your acceptance of the Terms of Use.