
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Dmitry Pavlov, Heikki Mannila, Padhraic Smyth, "Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data," IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 6, pp. 14091421, November/December, 2003.  
BibTex  x  
@article{ 10.1109/TKDE.2003.1245281, author = {Dmitry Pavlov and Heikki Mannila and Padhraic Smyth}, title = {Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data}, journal ={IEEE Transactions on Knowledge and Data Engineering}, volume = {15}, number = {6}, issn = {10414347}, year = {2003}, pages = {14091421}, doi = {http://doi.ieeecomputersociety.org/10.1109/TKDE.2003.1245281}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Knowledge and Data Engineering TI  Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data IS  6 SN  10414347 SP1409 EP1421 EPD  14091421 A1  Dmitry Pavlov, A1  Heikki Mannila, A1  Padhraic Smyth, PY  2003 KW  Binary transaction data KW  query approximation KW  probabilistic model KW  itemsets KW  ADTree KW  maximum entropy. VL  15 JA  IEEE Transactions on Knowledge and Data Engineering ER   
Abstract—We investigate the problem of generating fast approximate answers to queries posed to large sparse binary data sets. We focus in particular on probabilistic modelbased approaches to this problem and develop a number of techniques that are significantly more accurate than a baseline independence model. In particular, we introduce two techniques for building probabilistic models from frequent itemsets: the itemset maximum entropy model and the itemset inclusionexclusion model. In the maximum entropy model, we treat itemsets as constraints on the distribution of the query variables and use the maximum entropy principle to build a joint probability model for the query attributes online. In the inclusionexclusion model, itemsets and their frequencies are stored in a data structure, called an ADtree, that supports an efficient implementation of the inclusionexclusion principle in order to answer the query. We empirically compare these two itemsetbased models to direct querying of the original data, querying of samples of the original data, as well as other probabilistic models such as the independence model, the ChowLiu tree model, and the Bernoulli mixture model. These models are able to handle highdimensionality (hundreds or thousands of attributes), whereas most other work on this topic has focused on relatively lowdimensional OLAP problems. Experimental results on both simulated and realworld transaction data sets illustrate various fundamental trade offs between approximation error, model complexity, and the online time required to compute a query answer.
[1] S. Acharya, P.B. Gibbons, V. Poosala, and S. Ramaswamy, The AQUA Approximate Query Answering System Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '99), pp. 574576, 1999.
[2] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules Between Sets of Items in Large Databases,” Proc. 1993 ACMSIGMOD Int'l Conf. Management of Data, pp. 207216, May 1993.
[3] R. Agrawal, H. Manilla, R. Srikant, H. Toivonen, and A.I. Verkami, “Fast Discovery of Association Rules,” Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. PiatetskyShapiro, P. Smyth, and R. Uthurusamy, eds., pp. 307328, 1996.
[4] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 1994 Int'l Conf. Very Large Data Bases, pp. 487499, Sept. 1994.
[5] B.S. Anderson and A.W. Moore, AD Trees for Fast Counting and for Fast Learning of Association Rules Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining, 1998.
[6] A.L. Berger, S.A. Della Pietra, and V.J. Della Pietra, A Maximum Entropy Approach to Natural Language Processing Computational Linguistics, vol. 22, no. 1, pp. 3972, 1996.
[7] K. Chakrabarti, M.N. Garofalakis, R. Rastogi, and K. Shim, Approximate Query Processing Using Wavelets The VLDB J., vol. 3, pp. 111122, 2000.
[8] C.K. Chow and C.N. Liu,"Approximating discrete probability distributions with dependence trees," IEEE Trans. Information Theory, vol. 14, no. 3, pp. 462467, May 1968.
[9] T.H. Cormen,C.E. Leiserson, and R.L. Rivest,Introduction to Algorithms.Cambridge, Mass.: MIT Press/McGrawHill, 1990.
[10] J.N. Darroch and D. Ratcliff, Generalized Iterative Scaling for LogLinear Models Annals of Math. Statistics, vol. 43, pp. 14701480, 1972.
[11] A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm J. Royal Statistical Soc., vol. B39, pp. 138, 1977.
[12] A. Deshpande, M. Garofalakis, and R. Rastogi, Independence is Good: DependencyBased Histogram Synopses for HighDimensional Data Proc. ACM SIGMOD Conf. Management of Data (SIGMOD '01), pp. 199210, 2001.
[13] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis. John Wiley and Sons, 1973.
[14] L. Getoor, B. Taskar, and D. Koller, Selectivity Estimation Using Probabilistic Models Proc. ACM SIGMOD Conf. Management of Data (SIGMOD '01), pp. 461473, 2001.
[15] J. Han, J. Pei, and Y. Yin, Mining Frequent Patterns without Candidate Generation Proc. ACM SIGMOD Conf. Management of Data (SIGMOD '00), pp. 112, 2000.
[16] R.J. Lipton, J.F. Naughton, and D.A. Schneider, Practical Selectivity Estimation through Adaptive Sampling Proc. ACM SIGMOD, pp. 111, May 1990.
[17] H. Mannila, D. Pavlov, and P. Smyth, Predictions with Local Patterns Using CrossEntropy Proc. Fifth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD '99), pp. 357361, 1999.
[18] H. Mannila and H. Toivonen, Multiple Uses of Frequent Sets and Condensed Representations Proc. Second ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD '96), pp. 189194, 1996.
[19] H. Mannila, H. Toivonen, and A.I. Verkamo, Efficient Algorithms for Discovering Association Rules Knowledge Discovery in Databases, Papers from the 1994 AAAI Workshop (KDD '94), pp. 181192, 1994.
[20] Y. Matias, J.S. Vitter, and M. Wang, WaveletBased Histograms for Selectivity Estimation Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data, 1998.
[21] M. MeilaPredoviciu, Learning with Mixtures of Trees PhD thesis, Massachusetts Inst. of Tech nology, 1999.
[22] A.W. Moore and M.S. Lee, Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets J. Artificial Intelligence Research, vol. 8, pp. 6791, 1998.
[23] M. Muralikrishna and D. DeWitt, EquiDepth Histograms for Estimating Selectivity Factors for MultiDimensional Queries Proc. ACM SIGMOD Conf. Management of Data (SIGMOD '88), pp. 2836, 1988.
[24] D. Pavlov, H. Mannila, and P. Smyth, Probabilistic Models for Query Approximation with Large Sparse Binary Data Sets Proc. Uncertainty in AI Conf. (UAI '00), pp. 465472, 2000.
[25] D. Pavlov, H. Mannila, and P. Smyth, Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data Technical Report UCIICSTR0109, Information and Computer Science, Univ. of California, Irvine, 2001.
[26] D. Pavlov and P. Smyth, Probabilistic Query Models for Transaction Data Proc. Seventh ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD '01), pp. 164173, 2001.
[27] D. Pavlov and P. Smyth, Adaptive Approximate Querying of Large Sparse Binary Data Sets via Probabilistic Model Averaging Technical Report 2002050, NEC Research Inst., May 2002.
[28] D. Pavlov and P. Smyth, Approximate Query Answering by Model Averaging Proc. Third SIAm Int'l Conf. Data Mining, 2003. (in press)
[29] J. Pearl, Probabilistic Reasoning in Intelligent Systems. San Mateo, Calif.: Morgan Kaufman, 1988.
[30] V. Poosala and Y. Ioannidis, “Selectivity Estimation without the Attribute Value Independence Assumption,” Proc. 23rd Int'l Conf. Very Large Data Bases (VLDB), 1997.
[31] J. Rissanen, A Universal Prior for Integers and Estimation by Minimum Description Length The Annals of Statistics, vol. 11, no. 2, pp. 416431, 1983.
[32] J. Shanmugasundaram, U. Fayyad, and P. Bradley, Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions Proc. Fifth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD '99), pp. 223232, 1999.
[33] J.S. Vitter and M. Wang, Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Proc. ACM SIGMOD Conf. Management of Data (SIGMOD '99), pp. 193204, 1999.
[34] DY. Yang, A. Johar, A. Grama, and W. Szpankowski, Summary Structures for Frequency Queries on Large Transaction Sets Proc. Data Compression Conf., pp. 420429, 2000.