The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.08 - August (2011 vol.23)
pp: 1138-1153
Kuang Chen , University of California, Berkeley, Berkeley
Harr Chen , Massachusetts Institute of Technology, Cambridge
Neil Conway , University of California, Berkeley, Berkeley
Joseph M. Hellerstein , University of California, Berkeley, Berkeley
Tapan S. Parikh , University of California, Berkeley, Berkeley
ABSTRACT
Data quality is a critical problem in modern databases. data-entry forms present the first and arguably best opportunity for detecting and mitigating errors, but there has been little research into automatic methods for improving data quality at entry time. In this paper, we propose Usher, an end-to-end system for form design, entry, and data quality assurance. Using previous form submissions, Usher learns a probabilistic model over the questions of the form. Usher then applies this model at every step of the data-entry process to improve data quality. Before entry, it induces a form layout that captures the most important data values of a form instance as quickly as possible and reduces the complexity of error-prone questions. During entry, it dynamically adapts the form to the values being entered by providing real-time interface feedback, reasking questions with dubious responses, and simplifying questions by reformulating them. After entry, it revisits question responses that it deems likely to have been entered incorrectly by reasking the question or a reformulation thereof. We evaluate these components of Usher using two real-world data sets. Our results demonstrate that Usher can improve data quality considerably at a reduced cost when compared to current practice.
INDEX TERMS
Data quality, data entry, form design, adaptive form.
CITATION
Kuang Chen, Harr Chen, Neil Conway, Joseph M. Hellerstein, Tapan S. Parikh, "Usher: Improving Data Quality with Dynamic Forms", IEEE Transactions on Knowledge & Data Engineering, vol.23, no. 8, pp. 1138-1153, August 2011, doi:10.1109/TKDE.2011.31
REFERENCES
[1] T. Dasu and T. Johnson, Exploratory Data Mining and Data Cleaning. Wiley, 2003.
[2] C. Batini and M. Scannapieco, Data Quality: Concepts, Methodologies and Techniques. Springer, 2006.
[3] R.M. Groves, F.J. Fowler, M.P. Couper, J.M. Lepkowski, E. Singer, and R. Tourangeau, Survey Methodology. Wiley-Interscience, 2004.
[4] J. Lee, "Address to WHO Staff," http://www.who.int/dg/lee/speeches/2003/ 21_07en, 2003.
[5] J.V.D. Broeck, M. Mackay, N. Mpontshane, A.K.K. Luabeya, M. Chhagan, and M.L. Bennish, "Maintaining Data Integrity in a Rural Clinical Trial," Clinical Trials, vol. 4, pp. 572-582, 2007.
[6] R. McCarthy and T. Piazza, "Personal Interview," Univ. of California at Berkeley Survey Research Center, 2009.
[7] S. Patnaik, E. Brunskill, and W. Thies, "Evaluating the Accuracy of Data Collection on Mobile Phones: A Study of Forms, SMS, and Voice," Proc. IEEE/ACM Int'l Conf. Information and Comm. Technologies and Development (ICTD), 2009.
[8] J.M. Hellerstein, "Quantitative Data Cleaning for Large Databases," United Nations Economic Commission for Europe (UNECE), 2008.
[9] A. Ali and C. Meek, "Predictive Models of Form Filling," Technical Report MSR-TR-2009-1, Microsoft Research, Jan. 2009.
[10] L.A. Hermens and J.C. Schlimmer, "A Machine-Learning Apprentice for the Completion of Repetitive Forms," IEEE Expert: Intelligent Systems and Their Applications, vol. 9, no. 1, pp. 28-33, Feb. 1994.
[11] D. Lee and C. Tsatsoulis, "Intelligent Data Entry Assistant for XML Using Ensemble Learning," Proc. ACM 10th Int'l Conf. Intelligent User Interfaces (IUI), 2005.
[12] J.C. Schlimmer and P.C. Wells, "Quantitative Results Comparing Three Intelligent Interfaces for Information Capture," J. Artificial Intelligence Research, vol. 5, pp. 329-349, 1996.
[13] S.S.J.R. Warren, A. Davidovic, and P. Bolton, "Mediface: Anticipative Data Entry Interface for General Practitioners," Proc. Australasian Conf. Computer Human Interaction (OzCHI), 1998.
[14] J. Warren and P. Bolton, "Intelligent Split Menus for Data Entry: A Simulation Study in General Practice Medicine," Proc. Am. Medical Informatics Assoc. (AMIA) Ann. Symp., 1999.
[15] Y. Yu, J.A. Stamberger, A. Manoharan, and A. Paepcke, "Ecopod: A Mobile Tool for Community Based Biodiversity Collection Building," Proc. Sixth ACM/IEEE CS Joint Conf. Digital Libraries (JCDL), 2006.
[16] S. Day, P. Fayers, and D. Harvey, "Double Data Entry: What Value, What Price?" Controlled Clinical Trials, vol. 19, no. 1, pp. 15-24, 1998.
[17] D.W. King and R. Lashley, "A Quantifiable Alternative to Double Data Entry," Controlled Clinical Trials, vol. 21, no. 2, pp. 94-102, 2000.
[18] K. Kleinman, "Adaptive Double Data Entry: A Probabilistic Tool for Choosing Which Forms to Reenter," Controlled Clinical Trials, vol. 22, no. 1, pp. 2-12, 2001.
[19] K.L. Norman,"Online Survey Design Guide," http://lap.umd. edusurvey_design, 2011.
[20] A. Hartemink, Banjo: Bayesian Network Inference with Java Objects, http://www.cs.duke.edu/~amink/softwarebanjo , 2011.
[21] F.G. Cozman,"JavaBayes—Bayesian Networks in Java," http://www.cs.cmu.edu~javabayes, 2011.
[22] M.R. Cambridge, Infer.NET, http://research.microsoft.com/en-us/um/cambridge/ projectsinfernet, 2011.
[23] D. Heckerman, D. Geiger, and D.M. Chickering, "Learning Bayesian Networks: The Combination of Knowledge and Statistical Data," Machine Learning, vol. 20, no. 3, pp. 197-243, 1995.
[24] F. Jelinek and R.L. Mercer, "Interpolated Estimation of Markov Source Parameters from Sparse Data," Proc. Workshop Pattern Recognition in Practice, 1980.
[25] C.M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
[26] W.L. Buntine, "Operations for Learning with Graphical Models," J. Artificial Intelligence Research, vol. 2, pp. 159-225, 1994.
[27] J.M. Bernardo and A.F. Smith, Bayesian Theory. Wiley, 2000.
[28] D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
[29] T.P. Minka, "Expectation Propagation for Approximate Bayesian Inference," Proc. Conf. Uncertainty in Artificial Intelligence, 2001.
[30] J.O. Wobbrock, E. Cutrell, S. Harada, and I.S. MacKenzie, "An Error Model for Pointing Based on Fitts' Law," Proc. 26th Ann. SIGCHI Conf. Human Factors in Computing Systems (CHI '08), pp. 1613-1622, 2008.
[31] Survey Documentation and Analysis, UC Berkeley, http:/sda.berkeley.edu, 2011.
[32] P.M. Fitts, "The Information Capacity of the Human Motor System in Controlling the Amplitude of Movement," J. Experimental Psychology, vol. 47, no. 6, pp. 381-391, 1954.
[33] K. Chen, T. Parikh, and J.M. Hellerstein, "Designing Adaptive Feedback for Improving Data Entry Accuracy," Proc. ACM Symp. User Interface Software and Technology, 2010.
[34] The Millennium Villages Project, http:/www.millenniumvillages. org, 2011.
17 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool