This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
A Decision-Theoretic Framework for Numerical Attribute Value Reconciliation
July 2012 (vol. 24 no. 7)
pp. 1153-1169
Zhengrui Jiang, Iowa State University, Ames
One of the major challenges of data integration is to resolve conflicting numerical attribute values caused by data heterogeneity. In addressing this problem, existing approaches proposed in prior literature often ignore such data inconsistencies or resolve them in an ad hoc manner. In this study, we propose a decision-theoretical framework that resolves numerical value conflicts in a systematic manner. The framework takes into consideration the consequences of incorrect numerical values and selects the value that minimizes the expected cost of errors for all data application problems under consideration. Experimental results show that significant savings can be achieved by adopting the proposed framework instead of ad hoc approaches.

[1] A. Bilke, J. Bleiholder, C. Böhm, K. Draba, F. Naumann, and M. Weis, "Automatic Data Fusion with HumMer," Proc. 31st Int'l Conf. Very Large Data Bases, pp. 1251-1254, Aug./Sept., 2005.
[2] J. Bleiholder and F. Naumann, "Conflict Handling Strategies in an Integrated Information System," Proc. WWW Workshop in Information Integration on the Web (II Web), 2006.
[3] J. Bleiholder and F. Naumann, "Data Fusion," ACM Computing Surveys, vol. 41, no. 1, pp. 1-41, 2008.
[4] A. DeFelice, "What's in a Name?," CRM Magazine, July 2005.
[5] D. Dey and S. Sarkar, "A Probabilistic Relational Model and Algebra," ACM Trans. Database Systems, vol. 21, no. 3, pp. 339-369, 1996.
[6] D. Dey, T.M. Barron, and A.N. Saharia, "A Decision Model for Choosing the Optimal Level of Storage in Temporal Databases," IEEE Trans. Knowledge and Data Eng., vol. 10, no. 2, pp. 297-309, Mar./Apr. 1998.
[7] D. Dey, S. Sarkar, and P. De, "A Probabilistic Decision Model for Entity Matching in Heterogeneous Databases," Management Science, vol. 44, no. 10, pp. 1379-1395, 1998.
[8] A. Dekhtyar, R. Ross, and V.S. Subrahmanian, "Probabilistic Temporal Databases, I: Algebra," ACM Trans. Database Systems, vol. 26, no. 1, pp. 41-95, 2001.
[9] "Executive Summary for 2005," US Direct Marketing Today: Economic Impact 2005, Chapter 2, Direct Marketing Assoc., 2005.
[10] X.L. Dong, L. Berti-Equille, and D. Srivastava, "Integrating Conflicting Data: The Role of Source Dependence," Proc. VLDB Endowment, vol. 2, no. 1, pp. 550-561, 2009.
[11] X.L. Dong, A. Halevy, and C. Yu, "Data Integration with Uncertainty," The VLDB J., vol. 18, no. 2, pp. 469-500, 2009.
[12] W. Fan, H. Lu, S.E. Madnick, and D. Cheung, "Discovering and Reconciling Value Conflicts for Numerical Data Integration," Information Systems, vol. 26, no. 8, pp. 635-656, 2001.
[13] O. Hassanzadeh and R.J. Miller, "Creating Probabilistic Databases from Duplicated Data," The VLDB J., vol. 18, no. 5, pp. 1141-1166, 2009.
[14] C.J. Hegg, "Data Integration Technology and Applications: An Overview," Drug Benefit Trends, vol. 10, no. 1, pp. 32-34, 1998.
[15] M.A. Hernandez and S.J. Stolfo, "Real-world Data is Dirty: Data Cleaning and the Merge/Purge Problem," Data Mining and Knowledge Discovery, vol. 2, no. 1, pp. 9-37, 1998.
[16] Z. Jiang, S. Sarkar, P. De, and D. Dey, "A Framework for Reconciling Attribute Values from Multiple Data Sources," Management Science, vol. 53, no. 12, pp. 1946-1963, 2007.
[17] E. Lim, J. Srivastava, and S. Shekhar, "An Evidential Reasoning Approach to Attribute Value Conflict Resolution in Database Integration," IEEE Trans. Knowledge and Data Eng., vol. 8, no. 5, pp. 707-723, Oct. 1996.
[18] M. King, S. Ruggles, T. Alexander, D. Leicach, and M. Sobek,, Integrated Public Use Microdata Series, Current Population Survey: Version 2.0. [Machine-Readable Database]. Minneapolis, MN: Minnesota Population Center [Producer and Distributor], 2004.
[19] H. Mendelson and A.N. Saharia, "Incomplete Information Costs and Database Design," ACM Trans. Database Systems, vol. 11, no. 2, pp. 159-185, 1986.
[20] A. Motro, P. Anokhin, and A.C. Acar, "Utility-Based Resolution of Data Inconsistencies," Proc. Int'l Workshop Information Quality in Information Systems (IQIS '04), pp. 35-43, 2004.
[21] E. Peukert, H. Berthold, and E. Rahm, "Rewrite Techniques for Performance Optimization of Schema Matching Processes," Proc. 13th Int'l Conf. Extending Database Technology (EDBT '10), Mar. 2010.
[22] C.I. Sidló, "Generic Entity Resolution in Relational Databases," Proc. 13th East European Conf. Advances in Databases and Information Systems, pp. 59-73, 2009.
[23] D. Verton, "Bush Orders Integration of U.S. Terrorist Watch Lists," Computerworld, Sept. 2003.
[24] M. Wedel and W.A. Kamakura, Market Segmentation: Conceptual and Methodological Foundations, second ed. Kluwer Academic Publishers, 2000.
[25] X. Yin, J. Han, and P.S. Yu, "Truth Discovery with Multiple Conflicting Information Providers on the Web," IEEE Trans. Knowledge Data Eng., vol. 20, no. 6, pp. 796-808, June 2008.
[26] W. Zhao, A. Dekhtyar, and J. Goldsmith, "A Framework for Management of Semistructured Probabilistic Data," J. Intelligent Information Systems, vol. 25, no. 3, pp. 293-332, 2005.

Index Terms:
Database integration, heterogeneous databases, data heterogeneity, numerical value conflicts, probabilistic databases, Type I, Type II, and misrepresentation errors.
Citation:
Zhengrui Jiang, "A Decision-Theoretic Framework for Numerical Attribute Value Reconciliation," IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 7, pp. 1153-1169, July 2012, doi:10.1109/TKDE.2011.75
Usage of this product signifies your acceptance of the Terms of Use.