The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.08 - August (2010 vol.22)
pp: 1142-1157
Graham Cormode , AT&T Labs-Research, Florham Park, NJ
Minos Garofalakis , Technical University of Crete, Kounoupidiana
ABSTRACT
There is a growing realization that uncertain information is a first-class citizen in modern database management. As such, we need techniques to correctly and efficiently process uncertain data in database systems. In particular, data reduction techniques that can produce concise, accurate synopses of large probabilistic relations are crucial. Similar to their deterministic relation counterparts, such compact probabilistic data synopses can form the foundation for human understanding and interactive data exploration, probabilistic query planning and optimization, and fast approximate query processing in probabilistic database systems. In this paper, we introduce definitions and algorithms for building histogram- and Haar wavelet-based synopses on probabilistic data. The core problem is to choose a set of histogram bucket boundaries or wavelet coefficients to optimize the accuracy of the approximate representation of a collection of probabilistic tuples under a given error metric. For a variety of different error metrics, we devise efficient algorithms that construct optimal or near optimal size B histogram and wavelet synopses. This requires careful analysis of the structure of the probability distributions, and novel extensions of known dynamic-programming-based techniques for the deterministic domain. Our experiments show that this approach clearly outperforms simple ideas, such as building summaries for samples drawn from the data distribution, while taking equal or less time.
INDEX TERMS
Histograms, wavelets, probabilistic data.
CITATION
Graham Cormode, Minos Garofalakis, "Histograms and Wavelets on Probabilistic Data", IEEE Transactions on Knowledge & Data Engineering, vol.22, no. 8, pp. 1142-1157, August 2010, doi:10.1109/TKDE.2010.66
REFERENCES
[1] L. Antova, T. Jansen, C. Koch, and D. Olteanu, "Fast and Simple Relational Processing of Uncertain Data," Proc. IEEE Int'l Conf. Data Eng., 2008.
[2] O. Benjelloun, A.D. Sarma, C. Hayworth, and J. Widom, "An Introduction to ULDBs and the Trio System," IEEE Data Eng. Bull., vol. 29, no. 1, pp. 5-16, Mar. 2006.
[3] J. Boulos, N. Dalvi, B. Mandhani, S. Mathur, C. Re, and D. Suciu, "MYSTIQ: A System for Finding More Answers by Using Probabilities," Proc. ACM SIGMOD, 2005.
[4] K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim, "Approximate Query Processing Using Wavelets," Proc. Int'l Conf. Very Large Data Bases, 2000.
[5] G. Cormode and M. Garofalakis, "Sketching Probabilistic Data Streams," Proc. ACM SIGMOD, 2007.
[6] G. Cormode, A. Deligiannakis, M. Garofalakis, and A. McGregor, "Probabilistic Histograms for Probabilistic Data," Proc. Int'l Conf. Very Large Data Bases, 2009.
[7] G. Cormode and A. McGregor, "Approximation Algorithms for Clustering Uncertain Data," Proc. ACM Symp. Principles of Database Systems, 2008.
[8] N. Dalvi and D. Suciu, "Efficient Query Evaluation on Probabilistic Databases," Proc. Int'l Conf. Very Large Data Bases, 2004.
[9] N.N. Dalvi and D. Suciu, "Management of Probabilistic Data: Foundations and Challenges," Proc. ACM Symp. Principles of Database Systems, 2007.
[10] M. Garofalakis and P.B. Gibbons, "Approximate Query Processing: Taming the Terabytes," Proc. Int'l Conf. Very Large Data Bases, 2001.
[11] M. Garofalakis and P.B. Gibbons, "Probabilistic Wavelet Synopses," ACM Trans. Database Systems, vol. 29, no. 1, pp. 43-90, Mar. 2004.
[12] M. Garofalakis and A. Kumar, "Wavelet Synopses for General Error Metrics," ACM Trans. Database Systems, vol. 30, no. 4, pp. 888-928, Dec. 2005.
[13] S. Guha, N. Koudas, and K. Shim, "Data Streams and Histograms," Proc. ACM Symp. Theory of Computing, pp. 471-475, 2001.
[14] S. Guha, K. Shim, and J. Woo, "REHIST: Relative Error Histogram Construction Algorithms," Proc. Int'l Conf. Very Large Data Bases, 2004.
[15] S. Guha and B. Harb, "Wavelet Synopsis for Data Streams: Minimizing Non-Euclidean Error," Proc. ACM SIGKDD, 2005.
[16] S. Guha, N. Koudas, and K. Shim, "Approximation and Streaming Algorithms for Histogram Construction Problems," ACM Trans. Database Systems, vol. 31, no. 1, pp. 396-438, 2006.
[17] S. Guha and K. Shim, "A Note on Linear Time Algorithms for Maximum Error Histograms," IEEE Trans. Knowledge and Data Eng., vol. 19, no. 7, pp. 993-997, July 2007.
[18] C.-T. Ho, R. Agrawal, N. Megiddo, and R. Srikant, "Range Queries in OLAP Data Cubes," Proc. ACM SIGMOD, 1997.
[19] Y.E. Ioannidis and V. Poosala, "Balancing Histogram Optimality and Practicality for Query Result Size Estimation," Proc. ACM SIGMOD, pp. 233-244, 1995.
[20] Y.E. Ioannidis, "The History of Histograms (Abridged)," Proc. Int'l Conf. Very Large Data Bases, 2003.
[21] H.V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. Sevcik, and T. Suel, "Optimal Histograms with Quality Guarantees," Proc. Int'l Conf. Very Large Data Bases, 1998.
[22] R. Jampani, L.L. Perez, F. Xu, C. Jermaine, M. Wi, and P. Haas, "MCDB: A Monte Carlo Approach to Managing Uncertain Data," Proc. ACM SIGMOD, 2008.
[23] T.S. Jayram, S. Kale, and E. Vee, "Efficient Aggregation Algorithms for Probabilistic Data," Proc. 18th Ann. ACM-SIAM Symp. Discrete Algorithms, 2007.
[24] T.S. Jayram, A. McGregor, S. Muthukrishnan, and E. Vee, "Estimating Statistical Aggregates on Probabilistic Data Streams," Proc. ACM Principles of Database Systems, 2007.
[25] N. Khoussainova, M. Balazinska, and D. Suciu, "Towards Correcting Input Data Errors Probabilistically Using Integrity Constraints," Proc. ACM Int'l Workshop Data Eng. for Wireless and Mobile Access (MobiDE), 2006.
[26] J. Letchner, C. Re, M. Balazinska, and M. Philipose, "Access Methods for Markovian Streams," Proc. IEEE Int'l Conf. Data Eng., 2009.
[27] E. Levina and P. Bickel, "The Earth Mover's Distance is the Mallows Distance: Some Insights from Statistics," Proc. IEEE Int'l Conf. Computer Vision, 2001.
[28] S. Muthukrishnan, V. Poosala, and T. Suel, "On Rectangular Partitionings in Two Dimensions: Algorithms, Complexity, and Applications," Proc. Int'l Conf. Database Theory, pp. 236-256, 1999.
[29] C. Ré and D. Suciu, "Approximate Lineage for Probabilistic Databases," Proc. Int'l Conf. Very Large Data Bases, 2008.
[30] Y. Rubner, C. Tomasi, and L.J. Guibas, "A Metric for Distributions with Applications to Image Databases," Proc. IEEE Int'l Conf. Computer Vision, 1998.
[31] E.J. Stollnitz, T.D. DeRose, and D.H. Salesin, "Wavelets for Computer Graphics—Theory and Applications," Morgan Kaufmann, 1996.
[32] D. Suciu and N.N. Dalvi, "Foundations of Probabilistic Answers to Queries," Proc. ACM SIGMOD, 2005.
[33] J.S. Vitter and M. Wang, "Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets," Proc. ACM SIGMOD, 1999.
[34] Q. Zhang, F. Li, and K. Yi, "Finding Frequent Items in Probabilistic Data," Proc. ACM SIGMOD, 2008.
[35] X. Zheng, "Topics in Massive Data Summarization," PhD dissertation, Univ. of Michigan, Aug. 2008.
18 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool