This Article 
 Bibliographic References 
 Add to: 
Using Datacube Aggregates for Approximate Querying and Deviation Detection
November 2005 (vol. 17 no. 11)
pp. 1465-1477
Much research has been devoted to the efficient computation of relational aggregations and, specifically, the efficient execution of the datacube operation. In this paper, we consider the inverse problem, that of deriving (approximately) the original data from the aggregates. We motivate this problem in the context of two specific application areas, approximate query answering and data analysis. We propose a framework based on the notion of information entropy that enables us to estimate the original values in a data set, given only aggregated information about it. We then show how approximate queries on the data from which the aggregates were derived can be performed using our framework. We also describe an alternate use of the proposed framework that enables us to identify values that deviate from the underlying data distribution, suitable for data mining purposes. We present a detailed performance study of the algorithms using both real and synthetic data, highlighting the benefits of our approach as well as the efficiency of the proposed solutions. Finally, we evaluate our techniques with a case study on a real data set, which illustrates the applicability of our approach.

[1] A. Arning, R. Agrawal, and P. Raghavan, “A Linear Method for Deviation Detection in Large Databases,” Proc. Int'l Conf. Knowledge Discovery and Data Mining, pp. 164-169, Aug. 1996.
[2] S. Agrawal, S. Chaudhuri, and V.R. Narasayya, “Automated Selection of Materialized Views and Indexes in SQL Databases,” Proc. Int'l Conf. Very Large Databases, pp. 496-505, Sept. 2000.
[3] S. Acharya, P.B. Gibbons, V. Poosala, and S. Ramaswamy, “Join Synopses for Approximate Query Answering,” Proc. ACM SIGMOD Int'l Conf., pp. 275-286, June 1999.
[4] S. Abad-Mota, “Approximate Query Processing with Summary Tables in Statistical Databases,” Proc. Int'l Conf. Extending Database Technology, pp. 499-515, Mar. 1992.
[5] N. Bruno and S. Chaudhuri, “Exploiting Statistics on Query Expressions for Optimization,” Proc. ACM SIGMOD Int'l Conf., pp. 263-274, June 2002.
[6] D. Bertsekas, Constrained Optimization and 6 Multiplier Methods. Academic Press, 1982.
[7] Y.M.M. Bishop, S.E. Fienberg, and P.W. Holland, Discrete Multivariate Analysis: Theory and Practice. MIT Press, 1975.
[8] M.M. Breunig, H.-P. Kriegel, R.T. Ng, and J. Sander, “LOF: Identifying Density-Based Local Outliers,” Proc. ACM SIGMOD Int'l Conf., pp. 21-32, May 2000.
[9] A. Berger, S. Pietra, and V. Pietra, “A Maximum Entropy Approach to Natural Language Modelling,” Computational Linguistics, vol. 22, no. 1, May 1996.
[10] D. Barbará and M. Sullivan, “Quasi-Cubes: Exploiting Approximations in Multidimensional Databases,” ACM SIGMOD Record, vol. 26, no. 3, pp. 12-17, 1997.
[11] D. Barbará and X. Wu, “Using Loglinear Models to Compress Datacubes,” Web-Age Information Management, pp. 311-322, June 2000.
[12] S. Chaudhuri, A. Gupta, and V.R. Narasayya, “Compressing SQL Workloads,” Proc. ACM SIGMOD Int'l Conf., pp. 488-499, June 2002.
[13] S.F. Chen and R. Rosenfeld, “A Gaussian Prior for Smoothing Maximum Entropy Models,” Technical Report CMU-CS-99-108, Carnegie Mellon Univ., Feb. 1999.
[14] T. Cover and J. Thomas, Elements of Information Theory. Wiley, 1991.
[15] W.E. Deming and F.F. Stephan, “On a Least Square Adjustment of a Sampled Frequency Table When the Expected Marginal Totals Are Known,” Annals of Math. Statistics, vol. 11, pp. 427-444, 1940.
[16] C. Faloutsos, H.V. Jagadish, and N. Sidiropoulos, “Recovering Information from Summary Data,” Proc. Very Large Databases Conf., pp. 36-45, Aug. 1997.
[17] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh, “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals.” Proc. Int'l Conf. Data Eng., pp. 152-159, Mar. 1996.
[18] Y. Ioannidis and V. Poosala, “Balancing Histogram Optimality and Practicality for Query Result Size Estimation,” Proc. ACM SIGMOD, pp. 233-244, June 1995.
[19] H.V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K.C. Sevcik, and T. Suel, “Optimal Histograms with Quality Guarantees,” Proc. Int'l Conf. Very Large Data Bases, pp. 275-286, Aug. 1998.
[20] H.V. Jagadish, L.V.S. Lakshmanan, and D. Srivastava, “Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse,” Proc. ACM SIGMOD, pp. 37-48, June 1999.
[21] H.V. Jagadish, I.S. Mumick, and A. Silberschatz, “View Maintenance Issues in the Chronicle Data Model,” Proc. ACM Symp. Principles of Database Systems, pp. 113-124, June 1995.
[22] J.N. Kapur and H.K. Kesavan, Entropy Optimization Principles with Applications. Academic Press, Inc., 1992.
[23] E.M. Knorr and R.T. Ng, “Algorithms for Mining Distance-Based Outliers in Large Datasets,” Proc. Int'l Conf. Very Large Data Bases, pp. 392-403, Aug. 1998.
[24] E.M. Knorr and R.T. Ng, “Finding Intensional Knowledge of Distance-Based Outliers,” Proc. Int'l Conf. Very Large Data Bases, pp. 211-222, Sept. 1999.
[25] S. Kullback, Information Theory and Statistics. John Wiley and Sons, 1968.
[26] G.M. Lohman and S.S. Lightstone, “SMART: Making DB2 (More) Autonomic,” Proc. Int'l Conf. Very Large Data Bases, pp. 877-879, Aug. 2002.
[27] F. Malvestuto, “A Universal Scheme Approach to Statistical Databases Containing Homogeneous Summary Tables,” ACM Trans. Database Systems, vol. 18, no. 4, pp. 678-708, Dec. 1993.
[28] H. Mannila, D. Pavlov, and P. Smyth, “Prediction with Local Patterns Using Cross-Entropy,” Proc. Int'l Conf. Knowledge Discovery and Data Mining, pp. 357-361, Aug. 1999.
[29] G.S. Manku, S. Rajagopalan, and B.G. Lindsay, “Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets,” Proc. ACM SIGMOD, pp. 251-262, June 1999.
[30] V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita, “Improved Histograms for Selectivity Estimation of Range Predicates,” Proc. ACM SIGMOD, pp. 294-305, June 1996.
[31] D. Pavlov, H. Mannila, and P. Smyth, “Probabilistic Models for Query Approximation with Large Sparse Binary Data Sets,” Proc. Conf. Uncertainty in Artificial Intelligence, 2000.
[32] S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient Algorithms for Mining Ouliers from Large Data Sets” Proc. ACM SIGMOD, pp. 427-438, May 2000.
[33] S. Sarawagi, R. Agrawal, and N. Megiddo, “Discovery-Sriven Exploration of OLAP Data Cubes,” Proc. Int'l Conf. Extending Database Technology, pp. 68-182, Mar. 1998.
[34] S. Sarawagi, “User-Adaptive Exploration of Multidimensional Data,” Proc. Int'l Conf. Very Large Data Bases, pp. 307-316, Sept. 2000.
[35] J. Shanmugasundaram, U.M. Fayyad, and P.S. Bradley, “Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions,” Proc. Int'l Conf. Knowledge Discovery and Data Mining, pp. 223-232, Aug. 1999.
[36] M. Stillger, G.M. Lohman, V. Markl, and M. Kandil, “LEO— DB2's LEarning Optimizer,” Proc. Int'l Conf. Very Large Data Bases, pp. 19-28, Sept. 2001.
[37] J.S. Vitter, M. Wang, and B. Iyer, “Data Cube Approximation and Histograms via Wavelets,” Proc. ACM Int'l. Conf. Information and Knowledge Management, pp. 96-104, 1998.
[38] M. Zaharioudakis, R. Cochrane, G. Lapis, H. Pirahesh, and M. Urata, “Answering Complex SQL Queries Using Automatic Summary Tables,” Proc. ACM SIGMOD, pp. 105-116, May 2000.

Index Terms:
Index Terms- Data warehouse, datacube, approximate query answering, deviation detection.
Themis Palpanas, Nick Koudas, Alberto Mendelzon, "Using Datacube Aggregates for Approximate Querying and Deviation Detection," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 11, pp. 1465-1477, Nov. 2005, doi:10.1109/TKDE.2005.187
Usage of this product signifies your acceptance of the Terms of Use.