This Article 
 Bibliographic References 
 Add to: 
A Contingency Approach to Estimating Record Selectivities
June 1991 (vol. 17 no. 6)
pp. 544-552

An approach to estimating record selectivity rooted in the theory of fitting a hierarchy of models in discrete data analysis is presented. In contrast to parametric methods, this approach does not presuppose a distribution pattern to which the actual data conform; it searches for one that fits the actual data. This approach makes use of parsimonious models wherever appropriate in order to minimize the storage requirement without sacrificing accuracy. Two-dimensional cases are used as examples to illustrate the proposed method. It is demonstrated that the technique of identifying a good-fitting and parsimonious model can drastically reduce storage space and that the implementation of this technique requires little extra processing effort. The case of perfect or near-perfect association and the idea of keeping information about salient cells of a table are discussed. A strategy to reduce storage requirement in cases in which a good-fitting and parsimonious model is not available is proposed. Hierarchical models for three-dimensional cases are presented, along with a description of the W.E. Deming and F.F. Stephan (1940) iterative proportional fitting algorithm which fits hierarchical models of any dimensions.

[1] A. E. Beaton, "The influence of education and ability on salary and attitudes," inEducation, Income, and Human Behavior, F. T. Juster, Ed. New York: McGraw-Hill, 1975, pp. 365-396.
[2] Y. M. M. Bishop, S. E. Fienberg, and P. W. Holland,Discrete Multivariate Analysis: Theory and PracticeCambridge, MA: MIT Press, 1975.
[3] A. Cardenas, "Analysis and performance of inverted data-base structures,"Commun. ACM, vol. 18, no. 5, pp. 253-263, 1975.
[4] S. Christodoulakis, "Estimating selectivities in data bases," Ph.D. dissert., Computer Syst. Res. Group, Univ. Toronto, Toronto, ON, Can., 1981.
[5] S. Christodoulakis, "Estimating record selectivities,"Inform. Syst., vol. 8, no. 2, pp. 105-115, 1983.
[6] W. E. Deming and F. F. Stephan, "On a least-square adjustment of a sampled frequency table when the expected marginal totals are known,"Ann. Math. Stat., vol. 11, pp. 427-444, 1940.
[7] W. J. Dixon,BMDP Statistical Software. Berkeley: Univ. California Press, 1981.
[8] R. E. Fay and L. A. Goodman,ECTA Program: Description for Users. Chicago: Univ. Chicago Press, 1975.
[9] S. E. Fienberg, "The analysis of multidimensional tables,"Ecology, vol. 51, pp. 419-433, 1970.
[10] D. Friedlander, "A technique for estimating a contingency table, given the marginal totals and some supplementary data,"J. Roy. Stat. Soc. Ser. A, vol. 124, pp. 412-420, 1961.
[11] S. J. Haberman, "Log-linear fit for contingency tables (Algorithm AS 51),"Appl. Stat., vol. 21, pp. 218-225, 1972.
[12] S. J. Haberman,The Analysis of Frequency Data. Chicago: Univ. Chicago Press, 1974.
[13] C. T. Ireland and S. Kullback, "Contingency tables with given marginals,"Biometrika, vol. 55, pp. 179-188, 1968.
[14] M. Jarke and J. Koch, "Query optimization in database systems,"ACM Comput. Surveys, vol. 16, no. 2, June 1984.
[15] N. Kamel and R. King, "A model of data distribution based on texture analysis," inProc. ACM SIGMOD Conf.(Austin, TX), May 1985, pp. 319-325.
[16] J. J. Kennedy,Analyzing Qualitative Data. New York: Praeger, 1983.
[17] D. Knoke and P. J. Burke,Log-Linear Models. Beverly Hills, CA: Sage, 1982.
[18] R. Kooi, "The optimization of queries in relational databases," Ph.D. dissert., Case Western Univ., Cleveland, OH, 1980.
[19] Law Enforcement Assistance Administration, "San Jose methods test of known crime victims," Washington, DC, U.S. Govt. Printing Office, Stat. Tech. Rep. No. 1, 1972.
[20] M. V. Mannino, P. C. Chu, and T. Sager, "Statistical profile estimation in database systems,"ACM Comput. Surveys, vol. 20, no. 3, pp. 191-221, Sept. 1988.
[21] T. H. Merrett and E. Otoo, "Distribution models of relations," inProc. 5th Int. Conf. on Very Large Data Bases(Rio de Janeiro, Brazil), Oct. 1979, pp. 418-425.
[22] M. Muralikrishna and D. J. DeWitt, "Equi-depth histograms for estimating selectivity factors for multi-dimensional queries," inProc. ACM SIGMOD Conf.(Chicago, IL), June 1988, pp. 28-36.
[23] B. Muthuswamy and G. Kerschberg, "A DDSM for relational query optimization," inProc. ACM Ann. Conf.(Denver, CO), Oct. 1985, pp. 439-448.
[24] G. Piatetsky-Shapiro and C. Connell, "Accurate estimation of the number of tuples satisfying a condition," inProc. ACM SIGMOD Conf.(Boston, MA), June 1984, pp. 256-276.
[25] H. T. Reynolds,Analysis of Nominal Data, 2nd ed. Beverly Hill, CA: Sage, 1984.
[26] P. Selinger,et al., "Access path selection in a relational data base system," inProc. 1979 ACM-SIGMOD Int. Conf. Management of Data, Boston, MA, June 1979.

Index Terms:
contingency approach; record selectivities; discrete data analysis; parsimonious models; storage requirement; storage space; near-perfect association; storage requirement; three-dimensional cases; iterative proportional fitting algorithm; hierarchical models; information retrieval systems; relational databases; storage management
P.-C. Chu, "A Contingency Approach to Estimating Record Selectivities," IEEE Transactions on Software Engineering, vol. 17, no. 6, pp. 544-552, June 1991, doi:10.1109/32.87280
Usage of this product signifies your acceptance of the Terms of Use.