This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Making Aggregation Work in Uncertain and Probabilistic Databases
August 2011 (vol. 23 no. 8)
pp. 1261-1273
Raghotham Murthy, Stanford University, Stanford
Robert Ikeda, Stanford University, Stanford
Jennifer Widom, Stanford University, Stanford
We describe how aggregation is handled in the Trio system for uncertain and probabilistic data. Because “exact” aggregation in uncertain databases can produce exponentially sized results, we provide three alternatives: a low bound on the aggregate value, a high bound on the value, and the expected value. These variants return a single result instead of a set of possible results, and they are generally efficient to compute for both full-table and grouped aggregation queries. We provide formal definitions and semantics and a description of our open source implementation for single-table aggregation queries. We study the performance and scalability of our algorithms through experiments over a large synthetic data set. We also provide some preliminary results on aggregations over joins.

[1] D. Barbara, H. Garcia-Molina, and D. Porter, "The Management of Probabilistic Data," IEEE Trans. Knowledge and Data Eng., vol. 4, no. 5, pp. 487-502, Oct. 1992.
[2] O. Benjelloun, A. Das Sarma, A.Y. Halevy, and J. Widom, "ULDBs: Databases with Uncertainty and Lineage," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 953-964, 2006.
[3] D. Burdick, P. Deshpande, T.S. Jayram, R. Ramakrishnan, and S. Vaithyanathan, "OLAP over Uncertain and Imprecise Data," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 970-981, 2005.
[4] R. Cavallo and M. Pittarelli, "The Theory of Probabilistic Databases," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 71-81, 1987.
[5] Christmas Bird Count, Home Page, http://www.audubon.org/birdcbc, 2011.
[6] A.L.P. Chen, J. Chiu, and F.S.C. Tseng, "Evaluating Aggregate Operations over Imprecise Data," IEEE Trans. Knowledge and Data Eng., vol. 8, no. 2, pp. 273-284, Apr. 1996.
[7] L. Chen and A. Dobra, "Efficient Processing of Aggregates in Probabilistic Databases," Technical Report REP-2008-454, Univ. of Florida, 2008.
[8] R. Cheng, D.V. Kalashnikov, and S. Prabhakar, "Evaluating Probabilistic Queries over Imprecise Data," Proc. ACM SIGMOD, pp. 551-562, 2003.
[9] S. Cohen, B. Kimelfeld, and Y. Sagiv, "Incorporating Constraints in Probabilistic XML," Proc. ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems, pp. 109-118, 2008.
[10] N.N. Dalvi and D. Suciu, "Efficient Query Evaluation on Probabilistic Databases," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 864-875, 2004.
[11] D. Dey and S. Sarkar, "A Probabilistic Relational Model and Algebra," ACM Trans. Database Systems, vol. 21, no. 3, pp. 339-369, 1996.
[12] R. Jampani, F. Xu, M. Wu, L. Perez, C. Jermaine, and P. Haas, "MCDB: A Monte Carlo Approach to Managing Uncertain Data," Proc. ACM SIGMOD, pp. 687-700, 2008.
[13] T.S. Jayram, S. Kale, and E. Vee, "Efficient Aggregation Algorithms for Probabilistic Data," Proc. ACM-SIAM Symp. Discrete Algorithms (SODA), pp. 896-905, 2007.
[14] T.S. Jayram, A. McGregor, S. Muthukrishnan, and E. Vee, "Estimating Statistical Aggregates on Probabilistic Data Streams," Proc. ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems, pp. 243-252, 2007.
[15] A. Klug, "Equivalence of Relational Algebra and Relational Calculus Query Languages Having Aggregate Functions," J. ACM, vol. 29, no. 3, pp. 699-717, 1982.
[16] S. McClean, B. Scotney, and M. Shapcott, "Aggregation of Imprecise and Uncertain Information in Databases," IEEE Trans. Knowledge and Data Eng., vol. 13, no. 6, pp. 902-912, Nov./Dec. 2001.
[17] R. Murthy and J. Widom, "Making Aggregation Work in Uncertain and Probabilistic Databases," Proc. Workshop Management of Uncertain Data at Int'l Conf. Very Large Data Bases (VLDB), pp. 76-90, 2007.
[18] R. Murthy and J. Widom, "Making Aggregation Work in Uncertain and Probabilistic Databases," technical report, Stanford InfoLab, http:/dbpubs.stanford.edu, May 2007.
[19] M. Mutsuzaki, M. Theobald, A. de Keijzer, J. Widom, P. Agrawal, O. Benjelloun, A. Das Sarma, R. Murthy, and T. Sugihara, "Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS (Demo)," Proc. Conf. Innovative Data Systems Research (CIDR), pp. 269-274, 2007.
[20] PL/pgSQL—SQL Procedural Language, Manual at http://www.postgresql.org/docs/7.4/interactive plpgsql.html, 2009.
[21] C. Re, N. Dalvi, and D. Suciu, "Efficient Top-k Query Evaluation on Probabilistic Data," Proc. IEEE Int'l Conf. Data Eng. (ICDE), pp. 886-895, 2007.
[22] R. Ross, V.S. Subrahmanian, and J. Grant, "Aggregate Operators in Probabilistic Databases," J. ACM, vol. 52, no. 1, pp. 54-101, 2005.
[23] S. Ross, A First Course in Probability, seventh ed., Pearson Publications, May 2005.
[24] E.A. Rundensteiner and L. Bic, "Evaluating Aggregates in Possibilistic Relational Databases," Data and Knowledge Eng., vol. 7, no. 3, pp. 239-267, 1992.
[25] B. Scotney and S. McClean, "Database Aggregation of Imprecise and Uncertain Evidence," Information Sciences, vol. 155, nos. 3-4, pp. 245-263, 2003.
[26] P. Sen and A. Deshpande, "Representing and Querying Correlated Tuples in Probabilistic Databases," Proc. IEEE Int'l Conf. Data Eng. (ICDE), Apr. 2007.
[27] Y. Sismanis, L. Wang, A. Fuxman, P.J. Haas, and B. Reinwald, "Resolution-Aware Query Answering for Business Intelligence," Proc. IEEE Int'l Conf. Data Eng. (ICDE), 2009.
[28] M.A. Soliman, I.F. IlyasY, and K.C.-C. Chang, "Top-k Query Processing in Uncertain Databases," Proc. IEEE Int'l Conf. Data Eng. (ICDE), pp. 896-905, 2007.
[29] Server Programming Interface. Documentation at http://www.postgresql.org/docs/8.1/interactive spi.html, 2009.
[30] D. Suciu and N.N. Dalvi, "Foundations of Probabilistic Answers to Queries," Proc. ACM SIGMOD, pp. 963-963, 2005.
[31] Transaction Processing Council (TPC). TPC Benchmark H: Standard Specification, http://www.tpc.orgtpch, 2006.
[32] Trio Online Resources: TriQL Language Language Manual, Online Demo and Open-Source Distribution, http://www. infolab.stanford.edutrio, 2009.
[33] F.S.C. Tseng, A.L.P. Chen, and W.-P. Yang, "Answering Heterogeneous Database Queries with Degrees of Uncertainty," Distributed and Parallel Databases, vol. 1, no. 3, pp. 281-302, 1993.
[34] J. Widom, "Trio: A System for Integrated Management of Data, Accuracy, and Lineage," Proc. Conf. Innovative Data Systems Research (CIDR), pp. 262-276, 2005.

Index Terms:
Database management, query processing.
Citation:
Raghotham Murthy, Robert Ikeda, Jennifer Widom, "Making Aggregation Work in Uncertain and Probabilistic Databases," IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 8, pp. 1261-1273, Aug. 2011, doi:10.1109/TKDE.2010.166
Usage of this product signifies your acceptance of the Terms of Use.