Issue No. 06 - November/December (2003 vol. 15)
Padhraic Smyth , IEEE
<p><b>Abstract</b>—We investigate the problem of generating fast approximate answers to queries posed to large sparse binary data sets. We focus in particular on probabilistic model-based approaches to this problem and develop a number of techniques that are significantly more accurate than a baseline independence model. In particular, we introduce two techniques for building probabilistic models from frequent itemsets: the itemset maximum entropy model and the itemset inclusion-exclusion model. In the maximum entropy model, we treat itemsets as constraints on the distribution of the query variables and use the maximum entropy principle to build a joint probability model for the query attributes online. In the inclusion-exclusion model, itemsets and their frequencies are stored in a data structure, called an ADtree, that supports an efficient implementation of the inclusion-exclusion principle in order to answer the query. We empirically compare these two itemset-based models to direct querying of the original data, querying of samples of the original data, as well as other probabilistic models such as the independence model, the Chow-Liu tree model, and the Bernoulli mixture model. These models are able to handle high-dimensionality (hundreds or thousands of attributes), whereas most other work on this topic has focused on relatively low-dimensional OLAP problems. Experimental results on both simulated and real-world transaction data sets illustrate various fundamental trade offs between approximation error, model complexity, and the online time required to compute a query answer.</p>
Binary transaction data, query approximation, probabilistic model, itemsets, ADTree, maximum entropy.
D. Pavlov, P. Smyth and H. Mannila, "Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data," in IEEE Transactions on Knowledge & Data Engineering, vol. 15, no. , pp. 1409-1421, 2003.