This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data
November/December 2003 (vol. 15 no. 6)
pp. 1409-1421

Abstract—We investigate the problem of generating fast approximate answers to queries posed to large sparse binary data sets. We focus in particular on probabilistic model-based approaches to this problem and develop a number of techniques that are significantly more accurate than a baseline independence model. In particular, we introduce two techniques for building probabilistic models from frequent itemsets: the itemset maximum entropy model and the itemset inclusion-exclusion model. In the maximum entropy model, we treat itemsets as constraints on the distribution of the query variables and use the maximum entropy principle to build a joint probability model for the query attributes online. In the inclusion-exclusion model, itemsets and their frequencies are stored in a data structure, called an ADtree, that supports an efficient implementation of the inclusion-exclusion principle in order to answer the query. We empirically compare these two itemset-based models to direct querying of the original data, querying of samples of the original data, as well as other probabilistic models such as the independence model, the Chow-Liu tree model, and the Bernoulli mixture model. These models are able to handle high-dimensionality (hundreds or thousands of attributes), whereas most other work on this topic has focused on relatively low-dimensional OLAP problems. Experimental results on both simulated and real-world transaction data sets illustrate various fundamental trade offs between approximation error, model complexity, and the online time required to compute a query answer.

[1] S. Acharya, P.B. Gibbons, V. Poosala, and S. Ramaswamy, The AQUA Approximate Query Answering System Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '99), pp. 574-576, 1999.
[2] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules Between Sets of Items in Large Databases,” Proc. 1993 ACM-SIGMOD Int'l Conf. Management of Data, pp. 207-216, May 1993.
[3] R. Agrawal, H. Manilla, R. Srikant, H. Toivonen, and A.I. Verkami, “Fast Discovery of Association Rules,” Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., pp. 307-328, 1996.
[4] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 1994 Int'l Conf. Very Large Data Bases, pp. 487-499, Sept. 1994.
[5] B.S. Anderson and A.W. Moore, AD Trees for Fast Counting and for Fast Learning of Association Rules Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining, 1998.
[6] A.L. Berger, S.A. Della Pietra, and V.J. Della Pietra, A Maximum Entropy Approach to Natural Language Processing Computational Linguistics, vol. 22, no. 1, pp. 39-72, 1996.
[7] K. Chakrabarti, M.N. Garofalakis, R. Rastogi, and K. Shim, Approximate Query Processing Using Wavelets The VLDB J., vol. 3, pp. 111-122, 2000.
[8] C.K. Chow and C.N. Liu,"Approximating discrete probability distributions with dependence trees," IEEE Trans. Information Theory, vol. 14, no. 3, pp. 462-467, May 1968.
[9] T.H. Cormen,C.E. Leiserson, and R.L. Rivest,Introduction to Algorithms.Cambridge, Mass.: MIT Press/McGraw-Hill, 1990.
[10] J.N. Darroch and D. Ratcliff, Generalized Iterative Scaling for Log-Linear Models Annals of Math. Statistics, vol. 43, pp. 1470-1480, 1972.
[11] A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm J. Royal Statistical Soc., vol. B-39, pp. 1-38, 1977.
[12] A. Deshpande, M. Garofalakis, and R. Rastogi, Independence is Good: Dependency-Based Histogram Synopses for High-Dimensional Data Proc. ACM SIGMOD Conf. Management of Data (SIGMOD '01), pp. 199-210, 2001.
[13] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis. John Wiley and Sons, 1973.
[14] L. Getoor, B. Taskar, and D. Koller, Selectivity Estimation Using Probabilistic Models Proc. ACM SIGMOD Conf. Management of Data (SIGMOD '01), pp. 461-473, 2001.
[15] J. Han, J. Pei, and Y. Yin, Mining Frequent Patterns without Candidate Generation Proc. ACM SIGMOD Conf. Management of Data (SIGMOD '00), pp. 1-12, 2000.
[16] R.J. Lipton, J.F. Naughton, and D.A. Schneider, Practical Selectivity Estimation through Adaptive Sampling Proc. ACM SIGMOD, pp. 1-11, May 1990.
[17] H. Mannila, D. Pavlov, and P. Smyth, Predictions with Local Patterns Using Cross-Entropy Proc. Fifth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD '99), pp. 357-361, 1999.
[18] H. Mannila and H. Toivonen, Multiple Uses of Frequent Sets and Condensed Representations Proc. Second ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD '96), pp. 189-194, 1996.
[19] H. Mannila, H. Toivonen, and A.I. Verkamo, Efficient Algorithms for Discovering Association Rules Knowledge Discovery in Databases, Papers from the 1994 AAAI Workshop (KDD '94), pp. 181-192, 1994.
[20] Y. Matias, J.S. Vitter, and M. Wang, Wavelet-Based Histograms for Selectivity Estimation Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data, 1998.
[21] M. Meila-Predoviciu, Learning with Mixtures of Trees PhD thesis, Massachusetts Inst. of Tech nology, 1999.
[22] A.W. Moore and M.S. Lee, Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets J. Artificial Intelligence Research, vol. 8, pp. 67-91, 1998.
[23] M. Muralikrishna and D. DeWitt, Equi-Depth Histograms for Estimating Selectivity Factors for Multi-Dimensional Queries Proc. ACM SIGMOD Conf. Management of Data (SIGMOD '88), pp. 28-36, 1988.
[24] D. Pavlov, H. Mannila, and P. Smyth, Probabilistic Models for Query Approximation with Large Sparse Binary Data Sets Proc. Uncertainty in AI Conf. (UAI '00), pp. 465-472, 2000.
[25] D. Pavlov, H. Mannila, and P. Smyth, Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data Technical Report UCI-ICS-TR-01-09, Information and Computer Science, Univ. of California, Irvine, 2001.
[26] D. Pavlov and P. Smyth, Probabilistic Query Models for Transaction Data Proc. Seventh ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD '01), pp. 164-173, 2001.
[27] D. Pavlov and P. Smyth, Adaptive Approximate Querying of Large Sparse Binary Data Sets via Probabilistic Model Averaging Technical Report 2002-050, NEC Research Inst., May 2002.
[28] D. Pavlov and P. Smyth, Approximate Query Answering by Model Averaging Proc. Third SIAm Int'l Conf. Data Mining, 2003. (in press)
[29] J. Pearl, Probabilistic Reasoning in Intelligent Systems. San Mateo, Calif.: Morgan Kaufman, 1988.
[30] V. Poosala and Y. Ioannidis, “Selectivity Estimation without the Attribute Value Independence Assumption,” Proc. 23rd Int'l Conf. Very Large Data Bases (VLDB), 1997.
[31] J. Rissanen, A Universal Prior for Integers and Estimation by Minimum Description Length The Annals of Statistics, vol. 11, no. 2, pp. 416-431, 1983.
[32] J. Shanmugasundaram, U. Fayyad, and P. Bradley, Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions Proc. Fifth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD '99), pp. 223-232, 1999.
[33] J.S. Vitter and M. Wang, Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Proc. ACM SIGMOD Conf. Management of Data (SIGMOD '99), pp. 193-204, 1999.
[34] D-Y. Yang, A. Johar, A. Grama, and W. Szpankowski, Summary Structures for Frequency Queries on Large Transaction Sets Proc. Data Compression Conf., pp. 420-429, 2000.

Index Terms:
Binary transaction data, query approximation, probabilistic model, itemsets, ADTree, maximum entropy.
Citation:
Dmitry Pavlov, Heikki Mannila, Padhraic Smyth, "Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data," IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 6, pp. 1409-1421, Nov.-Dec. 2003, doi:10.1109/TKDE.2003.1245281
Usage of this product signifies your acceptance of the Terms of Use.