This Article 
 Bibliographic References 
 Add to: 
Sample-Based Quality Estimation of Query Results in Relational Database Environments
May 2006 (vol. 18 no. 5)
pp. 639-650
The quality of data in relational databases is often uncertain, and the relationship between the quality of the underlying base tables and the set of potential query results, a type of information product (IP), that could be produced from them has not been fully investigated. This paper provides a basis for the systematic analysis of the quality of such IPs. This research uses the relational algebra framework to develop estimates for the quality of query results based on the quality estimates of samples taken from the base tables. Our procedure requires an initial sample from the base tables; these samples are then used for all possible information IPs. Each specific query governs the quality assessment of the relevant samples. By using the same sample repeatedly, our approach is relatively cost effective. We introduce the Reference-Table Procedure, which can be used for quality estimation in general. In addition, for each of the basic algebraic operators, we discuss simpler procedures that may be applicable. Special attention is devoted to the Join operation. We examine various, relevant statistical issues, including how to deal with the impact on quality of missing rows in base tables. Finally, we address several implementation issues related to sampling.

[1] S. Acharya, P.B. Gibbons, V. Poosala, and S. Ramaswamy, “Join Synopses for Approximate Query Answering,” ACM SIGMOD Record, Proc. 1999 ACM SIGMOD Int'l Conf. Management of Data, vol. 28, no. 2, pp. 275-286, 1999.
[2] S. Acharya, P.B. Gibbons, and V. Poosala, “Congressional Samples for Approximate Answering of Group-by Queries,” ACM SIGMOD Record, Proc. 2000 ACM SIGMOD Int'l Conf. Management of Data, vol. 29, no. 2, pp. 487-498, 2000.
[3] D.P. Ballou and H.L. Pazer, “Cost/Quality Tradeoffs for Control Procedures in Information Systems,” OMEGA: Int'l J. Management Science, vol. 15, no. 6, pp. 509-521, 1987.
[4] D.P. Ballou and H.L. Pazer, “Designing Information Systems to Optimize the Accuracy-Timeliness Tradeoff,” Information Systems Research, vol. 6, no. 1, pp. 51-72, 1995.
[5] M.T. Boswell, K.P. Burnham, and G.P. Patil, “Role and Use of Composite Sampling and Capture-Recapture Sampling in Ecological Studies,” Handbook of Statistics, chapter 19, pp. 469-488, North Holland: Elsevier Science Publishers, 1988.
[6] A.J. Duncan, Quality Control and Industrial Statistics. Homewood, Ill.: Irwin, 1986.
[7] B. Efron and R.J. Tibshirani, An Introduction to the Bootstrap. New York: Chapman and Hall, 1993.
[8] S.E. Fienberg and M. Anderson, “An Adjusted Census in 1990: The Supreme Court Decides,” Chance, vol. 9, no. 3, 1996.
[9] M.L. Fisher, A. Raman, and A.S. McClelland, “Rocket Science Retailing Is Almost Here: Are You Ready?” Harvard Business Rev., pp. 115-124, July-Aug. 2000.
[10] J. Funk, Y. Lee, and R. Wang, “Institutionalizing Information Quality Practice,” Proc. 1998 Conf. Information Quality, vol. 3, pp. 1-17, 1998.
[11] H. Gitlow, S. Gitlow, A. Oppenheim, and R. Oppenheim, Tools and Methods for the Improvement of Quality. Boston: Irwin, 1989.
[12] P.J. Haas and J.M. Hellerstein, “Ripple Joins for Online Aggregation,” Proc. ACM-SIGMOD Int'l Conf. Management of Data, pp. 287-298, 1999.
[13] J.M. Hellerstein, P.J. Haas, and H.J. Wang, “Online Aggregation,” SIGMOD Record (ACM Special Interest Group on Management of Data), vol. 26, no. 2, pp. 171-182, 1997.
[14] B.D. Klein and D.L. Goodhue, “Can Humans Detect Errors in Data? Impact of Base Rates, Incentives, and Goals,” MIS Quarterly, vol. 21, no. 2, pp. 169-195, 1997.
[15] A. Klug, “Equivalence of Relational Algebra and Relational Calculus Query Languages Having Aggregate Functions,” J. ACM, vol. 29, pp. 699-717, 1982.
[16] K.C. Laudon, “Data Quality and Due Process in Large Interorganizational Record Systems,” Comm. ACM, vol. 29, no. 1, pp. 4-11, 1986.
[17] D. Little and S. Misra, “Auditing for Database Integrity (IS Management),” J. Systems Management, vol. 45, no. 8, pp. 6-11, 1994.
[18] M.V. Mannino, P. Chu, and T. Sager, “Statistical Profile Estimation in Database Systems,” ACM Computing Surveys, vol. 20, no. 3, pp. 191-221, 1988.
[19] A. Motro and I. Rakov, “Estimating the Quality of Databases,” Flexible Query Answering Systems, pp. 298-307, Berlin: Springer Verlag, 1988.
[20] F. Naumann, J.C. Freytag, and U. Leser, “Completeness of Integrated Information Sources,” Information Systems, vol. 29, no. 7, pp. 583-615, 2004.
[21] V.M. O'Reilly, P.J. McDonnell, B.N. Winograd, J.S. Gerson, and H.R. Jaenicke, Montgomery's Auditing. New York: Wiley, 1998.
[22] K. Orr, “Data Quality and Systems Theory,” Comm. ACM, vol. 41, no. 2, pp. 66-71, 1998.
[23] A. Parssian, S. Sarkar, and V.S. Jacob, “Assessing Data Quality for Information Products,” Proc. 20th Int'l Conf. Information Systems, pp. 428-433, 1999.
[24] A. Parssian, S. Sarkar, and V.S. Jacob, “Assessing Information Quality for the Composite Relational Operation Join,” Proc. Seventh Int'l Conf. Information Quality, pp. 225-237, 2002.
[25] A. Raman, N. DeHoratius, and Z. Ton, “Execution: The Missing Link in Retail Operations,” California Management Rev., vol. 43, no. 3, pp. 136-152, 2001.
[26] P. Rob and C. Coronel, Database Systems: Design, Implementation, and Management, fourth ed. Cambridge, Mass.: Course Tech nology, 2000.
[27] C.P. Robert and G. Casella, Monte Carlo Statistical Methods. Springer Verlag, 2004.
[28] M. Scannapieco and C. Batini, “Completeness in the Relational Model: A Comprehensive Framework,” Proc. Int'l Conf. Information Quality, pp. 333-345, 2004.
[29] R.Y. Wang and D.M. Strong, “Beyond Accuracy: What Data Quality Means to Data Consumers,” J. Management Information Systems (JMIS), vol. 12, no. 4, pp. 5-34, 1996.
[30] R. Weber, Information Systems Control and Audit. Upper Saddle River, N.J.: Prentice Hall, 1999.

Index Terms:
Data quality, database sampling, information product, relational algebra, quality control.
Donald P. Ballou, InduShobha N. Chengalur-Smith, Richard Y. Wang, "Sample-Based Quality Estimation of Query Results in Relational Database Environments," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 5, pp. 639-650, May 2006, doi:10.1109/TKDE.2006.83
Usage of this product signifies your acceptance of the Terms of Use.