The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.07 - July (2012 vol.24)
pp: 1153-1169
Zhengrui Jiang , Iowa State University, Ames
ABSTRACT
One of the major challenges of data integration is to resolve conflicting numerical attribute values caused by data heterogeneity. In addressing this problem, existing approaches proposed in prior literature often ignore such data inconsistencies or resolve them in an ad hoc manner. In this study, we propose a decision-theoretical framework that resolves numerical value conflicts in a systematic manner. The framework takes into consideration the consequences of incorrect numerical values and selects the value that minimizes the expected cost of errors for all data application problems under consideration. Experimental results show that significant savings can be achieved by adopting the proposed framework instead of ad hoc approaches.
INDEX TERMS
Database integration, heterogeneous databases, data heterogeneity, numerical value conflicts, probabilistic databases, Type I, Type II, and misrepresentation errors.
CITATION
Zhengrui Jiang, "A Decision-Theoretic Framework for Numerical Attribute Value Reconciliation", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 7, pp. 1153-1169, July 2012, doi:10.1109/TKDE.2011.75
REFERENCES
[1] A. Bilke, J. Bleiholder, C. Böhm, K. Draba, F. Naumann, and M. Weis, "Automatic Data Fusion with HumMer," Proc. 31st Int'l Conf. Very Large Data Bases, pp. 1251-1254, Aug./Sept., 2005.
[2] J. Bleiholder and F. Naumann, "Conflict Handling Strategies in an Integrated Information System," Proc. WWW Workshop in Information Integration on the Web (II Web), 2006.
[3] J. Bleiholder and F. Naumann, "Data Fusion," ACM Computing Surveys, vol. 41, no. 1, pp. 1-41, 2008.
[4] A. DeFelice, "What's in a Name?," CRM Magazine, July 2005.
[5] D. Dey and S. Sarkar, "A Probabilistic Relational Model and Algebra," ACM Trans. Database Systems, vol. 21, no. 3, pp. 339-369, 1996.
[6] D. Dey, T.M. Barron, and A.N. Saharia, "A Decision Model for Choosing the Optimal Level of Storage in Temporal Databases," IEEE Trans. Knowledge and Data Eng., vol. 10, no. 2, pp. 297-309, Mar./Apr. 1998.
[7] D. Dey, S. Sarkar, and P. De, "A Probabilistic Decision Model for Entity Matching in Heterogeneous Databases," Management Science, vol. 44, no. 10, pp. 1379-1395, 1998.
[8] A. Dekhtyar, R. Ross, and V.S. Subrahmanian, "Probabilistic Temporal Databases, I: Algebra," ACM Trans. Database Systems, vol. 26, no. 1, pp. 41-95, 2001.
[9] "Executive Summary for 2005," US Direct Marketing Today: Economic Impact 2005, Chapter 2, Direct Marketing Assoc., 2005.
[10] X.L. Dong, L. Berti-Equille, and D. Srivastava, "Integrating Conflicting Data: The Role of Source Dependence," Proc. VLDB Endowment, vol. 2, no. 1, pp. 550-561, 2009.
[11] X.L. Dong, A. Halevy, and C. Yu, "Data Integration with Uncertainty," The VLDB J., vol. 18, no. 2, pp. 469-500, 2009.
[12] W. Fan, H. Lu, S.E. Madnick, and D. Cheung, "Discovering and Reconciling Value Conflicts for Numerical Data Integration," Information Systems, vol. 26, no. 8, pp. 635-656, 2001.
[13] O. Hassanzadeh and R.J. Miller, "Creating Probabilistic Databases from Duplicated Data," The VLDB J., vol. 18, no. 5, pp. 1141-1166, 2009.
[14] C.J. Hegg, "Data Integration Technology and Applications: An Overview," Drug Benefit Trends, vol. 10, no. 1, pp. 32-34, 1998.
[15] M.A. Hernandez and S.J. Stolfo, "Real-world Data is Dirty: Data Cleaning and the Merge/Purge Problem," Data Mining and Knowledge Discovery, vol. 2, no. 1, pp. 9-37, 1998.
[16] Z. Jiang, S. Sarkar, P. De, and D. Dey, "A Framework for Reconciling Attribute Values from Multiple Data Sources," Management Science, vol. 53, no. 12, pp. 1946-1963, 2007.
[17] E. Lim, J. Srivastava, and S. Shekhar, "An Evidential Reasoning Approach to Attribute Value Conflict Resolution in Database Integration," IEEE Trans. Knowledge and Data Eng., vol. 8, no. 5, pp. 707-723, Oct. 1996.
[18] M. King, S. Ruggles, T. Alexander, D. Leicach, and M. Sobek,, Integrated Public Use Microdata Series, Current Population Survey: Version 2.0. [Machine-Readable Database]. Minneapolis, MN: Minnesota Population Center [Producer and Distributor], 2004.
[19] H. Mendelson and A.N. Saharia, "Incomplete Information Costs and Database Design," ACM Trans. Database Systems, vol. 11, no. 2, pp. 159-185, 1986.
[20] A. Motro, P. Anokhin, and A.C. Acar, "Utility-Based Resolution of Data Inconsistencies," Proc. Int'l Workshop Information Quality in Information Systems (IQIS '04), pp. 35-43, 2004.
[21] E. Peukert, H. Berthold, and E. Rahm, "Rewrite Techniques for Performance Optimization of Schema Matching Processes," Proc. 13th Int'l Conf. Extending Database Technology (EDBT '10), Mar. 2010.
[22] C.I. Sidló, "Generic Entity Resolution in Relational Databases," Proc. 13th East European Conf. Advances in Databases and Information Systems, pp. 59-73, 2009.
[23] D. Verton, "Bush Orders Integration of U.S. Terrorist Watch Lists," Computerworld, Sept. 2003.
[24] M. Wedel and W.A. Kamakura, Market Segmentation: Conceptual and Methodological Foundations, second ed. Kluwer Academic Publishers, 2000.
[25] X. Yin, J. Han, and P.S. Yu, "Truth Discovery with Multiple Conflicting Information Providers on the Web," IEEE Trans. Knowledge Data Eng., vol. 20, no. 6, pp. 796-808, June 2008.
[26] W. Zhao, A. Dekhtyar, and J. Goldsmith, "A Framework for Management of Semistructured Probabilistic Data," J. Intelligent Information Systems, vol. 25, no. 3, pp. 293-332, 2005.
5 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool