This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Domain-Driven Data Synopses for Dynamic Quantiles
July 2005 (vol. 17 no. 7)
pp. 927-938
In this paper, we present new algorithms for dynamically computing quantiles of a relation subject to insert as well as delete operations. At the core of our algorithms lies a small-space multiresolution representation of the underlying data distribution based on random subset sums or RSSs. These RSSs are updated with every insert and delete operation. When quantiles are demanded, we use these RSSs to estimate quickly, without having to access the data, all the quantiles, each guaranteed to be accurate to within user-specified precision. While quantiles have found many uses in databases, in this paper, our focus is primarily on network management applications that monitor the distribution of active sessions in the network. Our examples are drawn both from the telephony and the IP network, where the goal is to monitor the distribution of the length of active calls and IP flows, respectively, over time. For such applications, we propose a new type of histogram that uses RSSs for summarizing the dynamic parts of the distributions while other parts with small volume of sessions are approximated using simple counters.

[1] R. Agrawal, T. Imielinski, and A. Swami, “Mining Associations between Sets of Items in Massive Databases,” Proc. ACM SIGMOD, pp. 207-216, May 1993.
[2] R. Agrawal and R. Srikant, “Mining Quantitative Association Rules in Large Relational Tables,” Proc. ACM SIGMOD, pp. 1-12, June 1996.
[3] R. Agrawal and A. Swami, “A One-Pass Space-Efficient Algorithm for Finding Quantiles,” Proc. Conf. Management of Data, 1995.
[4] N. Alon, Y. Matias, and M. Szegedy, “The Space Complexity of Approximating the Frequency Moments,” Proc. ACM Symp. Theory of Computing, pp. 20-29, 1996.
[5] N. Alon and J.H. Spencer, The Probabilistic Method. New York: Wiley and Sons, 1992.
[6] K. Alsabti, S. Ranka, and V. Singh, “A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data,” Proc. Very Large Data Bases Conf., pp. 346-355, 1997.
[7] M. Blum, R.W. Floyd, V.R. Pratt, R.L. Rivest, and R.E. Tarjan, “Time Bounds for Selection,” J. Computer and System Sciences, vol. 7, no. 4, pp. 448-461, 1973.
[8] M. Charikar, K. Chen, and M. Farach-Colton, “Finding Frequent Items in Data Streams,” Proc. 29th Int'l Colloquium Automata, Languages, and Programming, 2002.
[9] F. Chen, D. Lambert, and J.C. Pinheiro, “Incremental Quantile Estimation for Massive Tracking,” Proc. Int'l Conf. Knowledge Discovery and Data Mining, pp. 516-522, Aug. 2000.
[10] Cisco NetFlow, http://www.cisco.com/warp/public/732net flow /, 1998.
[11] G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan, “Comparing Data Streams Using Hamming Norms (How to Zero In),” Proc. Very Large Data Bases Conf., pp. 335-345, 2002.
[12] G. Cormode and S. Muthukrishnan, “An Improved Data Stream Summary: The Count-Min Sketch and Its Applications,” LATIN, pp. 29-38, 2004.
[13] M. Datar, A. Gionis, P. Indyk, and R. Motwani, “Maintaining Stream Statistics over Sliding Windows,” Proc. 13th ACM-SIAM Symp. Discrete Algorithms, 2002.
[14] D.J. DeWitt, J.F. Naughton, and D.A. Schneider, “Parallel Sorting on a Shared-Nothing Architecture Using Probabilistic Splitting,” Proc. Conf. Parallel and Distributed Information Systems, pp. 280-291, 1991.
[15] A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi, “Processing Complex Aggregate Queries over Data Streams,” Proc. ACM SIMGOD, pp. 61-72, June 2002.
[16] P. Gibbons, “Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports,” Proc. Very Large Data Bases Conf., pp. 541-550, 2001.
[17] P. Gibbons, Y. Matias, and V. Poosala, “Fast Incremental Maintenance of Approximate Histograms,” Proc. Very Large Data Bases Conf., pp. 466-475, 1997.
[18] A.C. Gilbert and Y. Kotidis, S. Muthukrishnan, and M.J. Strauss, “Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries,” Proc. Very Large Data Bases Conf., pp. 79-88, 2001.
[19] A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M.J. Strauss, “How to Sumamtize the Universe: Dynamic Maintenance of Quantiles,” Proc. Very Large Data Bases Conf., pp. 454-465, 2002.
[20] A.C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M.J. Strauss, “Fast, Small-Space Algorithms for Approximate Histogram Maintenance,” Proc. 34th ACM Symp. Theory of Computing, pp. 389-398, 2002.
[21] M. Greenwald and S. Khanna, “Space-Efficient Online Computation of Quantile Summaries,” Proc. ACM SIGMOD, pp. 58-66, May 2001.
[22] P. Indyk, “Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation,” Proc. 41st Symp. Foundations of Computer Science, pp. 189-197, 2000.
[23] P. Indyk, N. Koudas, and S. Muthukrishnan, “Identifying Representative Trends in Massive Time Series Data Sets Using Sketches,” Proc. Very Large Data Bases, pp. 363-372, 2000.
[24] R. Jain and I. Chlamtac, “The $P^2$ Algorithm for Dynamic Calculation of Quantiles and Histograms without Storing Observations,” Comm. ACM, vol. 28, no. 10, 1985.
[25] T. Johnson, S. Muthukrishnan, P. Dasu, and V. Shkapenyuk, “Mining Database Structure; or, How to Build a Data Quality Browser,” Proc. ACM SIGMOD, 2002.
[26] G.S. Manku, S. Rajagopalan, and B.G. Lindsay, “Approximate Medians and Other Quantiles in One Pass and with Limited Memory,” Proc. ACM SIGMOD, 1998.
[27] G.S. Manku, S. Rajagopalan, and B.G. Lindsay, “Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Data Sets,” Proc. ACM SIGMOD, pp. 251-262, 1999.
[28] J.I. Munro and M.S. Paterson, “Selection and Sorting with Limited Storage,” Theoretical Computer Science, vol. 12, pp. 315-323, 1980.
[29] M.S. Paterson, “Progress in Selection,” technical report, Univ. of Warwick, Coventry, U.K., 1997.
[30] V. Poosala, “Histogram-Based Estimation Techniques in Database Systems,” PhD dissertation, Univ. of Wisconsin-Madison, 1997.
[31] V. Poosala and Y.E. Ioannidis, “Estimation of Query-Result Distribution and Its Application in Parallel-Join Load Balancing,” Proc. Very Large Data Bases Conf., pp. 448-459, 1996.
[32] V. Poosala, Y.E. Ioannidis, P.J. Haas, and E.J. Shekita, “Improved Histograms for Selectivity Estimation of Range Predicates,” Proc. ACM SIGMOD, pp. 294-305, 1996.
[33] Y. Matias, J. Vitter, and M. Wang, “Dynamic Maintenance of Wavelet-Based Histograms,” Proc. Very Large Data Bases Conf., pp. 101-110, Sept. 2000.
[34] J.S. Vitter, “Random Sampling with a Reservoir,” ACM Trans. Math. Software, vol. 11, no. 1, pp. 37-57, 1985.

Index Terms:
Index Terms- Quantiles, database statistics, data streams.
Citation:
Anna C. Gilbert, Yannis Kotidis, S. Muthukrishnan, Martin J. Strauss, "Domain-Driven Data Synopses for Dynamic Quantiles," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 7, pp. 927-938, July 2005, doi:10.1109/TKDE.2005.108
Usage of this product signifies your acceptance of the Terms of Use.