This Article 
 Bibliographic References 
 Add to: 
Uninterpreted Schema Matching with Embedded Value Mapping under Opaque Column Names and Data Values
February 2010 (vol. 22 no. 2)
pp. 291-304
Anuj Jaiswal, Pennsylvania State University, University Park
David J. Miller, Pennsylvania State University, University Park
Prasenjit Mitra, Pennsylvania State University, University Park
Schema matching and value mapping across two heterogenous information sources are critical tasks in applications involving data integration, data warehousing, and federation of databases. Before data can be integrated from multiple tables, the columns and the values appearing in the tables must be matched. The complexity of the problem grows quickly with the number of data attributes/columns to be matched and due to multiple semantics of data values. Traditional research has tackled schema matching and value mapping independently. We propose a novel method that optimizes embedded value mappings to enhance schema matching in the presence of opaque data values and column names. In this approach, the fitness objective for matching a pair of attributes from two schemas depends on the value mapping function for each of the two attributes. Suitable fitness objectives include the euclidean distance measure, which we use in our experimental study, as well as relative (cross) entropy. We propose a heuristic local descent optimization strategy that uses sorting and two-opt switching to jointly optimize value mappings and attribute matches. Our experiments show that our proposed technique outperforms earlier uninterpreted schema matching methods, and thus, should form a useful addition to a suite of (semi) automated tools for resolving structural heterogeneity.

[1] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, “Eliminating Fuzzy Duplicates in Data Warehouses,” Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 586-597, 2002.
[2] P.A. Bernstein, S. Melnik, M. Petropoulos, and C. Quix, “Industrial-Strength Schema Matching,” SIGMOD Record, vol. 33, no. 4, pp. 38-43, 2004.
[3] M.A. Casanova, K.K. Breitman, D.F. Brauner, and A.L. Marins, “Database Conceptual Schema Matching,” Computer, vol. 40, no. 10, pp. 102-104, Oct. 2007.
[4] S. Castano, V.D. Antonellis, and S.D.C. di Vimercati, “Global Viewing of Heterogeneous Data Sources,” IEEE Trans. Knowledge and Data Eng., vol. 13, no. 2, pp. 277-297, Feb. 2001.
[5] W.W. Cohen, “Integration of Heterogeneous Databases without Common Domains Using Queries Based on Textual Similarity,” Proc. ACM SIGMOD, pp. 201-212, 1998.
[6] T.M. Cover and J.A. Thomas, Elements of Information Theory. Wiley Interscience, 1991.
[7] G. Croes, “A Method for Solving Traveling Salesman Problems,” Operations Research, vol. 6, no. 6, pp. 791-812, 1958.
[8] I. Cruz, R. Tamassia, and D. Yao, “Privacy-Preserving Schema Matching Using Mutual Information,” Lecture Notes in Computer Science, vol. 4602, p. 93-94, Springer, 2007.
[9] H.H. Do, S. Melnik, and E. Rahm, “Comparison of Schema Matching Evaluations,” Proc. Revised Papers from the NODe '02 Web and Database-Related Workshops Web, Web-Services, and Database Systems, pp. 221-237, 2003.
[10] A. Doan, P. Domingos, and A.Y. Halevy, “Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach,” Proc. ACM SIGMOD, pp. 509-520, 2001.
[11] I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64, no. 328, pp. 1183-1210, 1969.
[12] L. Gravano, P.G. Ipeirotis, N. Koudas, and D. Srivastava, “Text Joins for Data Cleansing and Integration in an rdbms,” Proc. Int'l Conf. Data Eng. (ICDE), 2003.
[13] B. He and K.C.-C. Chang, “Statistical Schema Matching across Web Query Interfaces,” Proc. ACM SIGMOD, pp. 217-228, 2003.
[14] M.A. Hernández and S.J. Stolfo, “The Merge/Purge Problem for Large Databases,” Proc. ACM SIGMOD, pp. 127-138, 1995.
[15] J. Kang, T.S. Han, D. Lee, and P. Mitra, “Establishing Value Mappings Using Statistical Models and User Feedback,” Proc. ACM Conf. Information and Knowledge Management (CIKM), pp. 68-75, 2005.
[16] J. Kang, D. Lee, and P. Mitra, “Identifying Value Mappings for Data Integration: An Unsupervised Approach,” Proc. Conf. Web Information Systems Eng. (WISE), 2005.
[17] J. Kang and J.F. Naughton, “On Schema Matching with Opaque Column Names and Data Values,” Proc. ACM SIGMOD, pp. 205-216, 2003.
[18] H. Kuhn, “The Hungarian Method for Solving the Assignment Problem,” Naval Research Logistics Quarterly, vol. 2, pp. 83-97, 1955.
[19] W.-S. Li and C. Clifton, “Semint: A System Prototype for Semantic Integration in Heterogeneous Databases,” Proc. ACM SIGMOD, p.484, 1995.
[20] J. Madhavan, P.A. Bernstein, A. Doan, and A. Halevy, “Corpus-Based Schema Matching,” Proc. Int'l Conf. Data Eng. (ICDE), pp.57-68, 2005.
[21] J. Madhavan, P.A. Bernstein, and E. Rahm, “Generic Schema Matching with Cupid,” Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 49-58, 2001.
[22] A. McCallum, K. Nigam, and L.H. Ungar, “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching,” Proc. ACM SIGKDD, pp. 169-178, 2000.
[23] S. Melnik, H. Garcia-Molina, and E. Rahm, “Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching,” Proc. IEEE Int'l Conf. Data Eng. (ICDE), pp.117-128, 2002.
[24] R.J. Miller, L.M. Haas, and M.A. Hernández, “Schema Mapping as Query Discovery,” Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 77-88, 2000.
[25] T. Milo and S. Zohar, “Using Schema Matching to Simplify Heterogeneous Data Translation,” Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 122-133, 1998.
[26] J. Munkres, “Algorithms for the Assignment and Transportation Problems,” J. Soc. for Industrial and Applied Math., vol. 5, no. 1, pp.32-38, 1957.
[27] E. Rahm and P.A. Bernstein, “A Survey of Approaches to Automatic Schema Matching,” The VLDB J., vol. 10, no. 4, pp.334-350, 2001.
[28] S. Umeyama, “An Eigendecomposition Approach to Weighted Graph Matching Problems,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 10, no. 5, pp. 695-703, May 1988.
[29] W. Winkler, “The State of Record Linkage and Current Research Problems,” technical report, US Bureau of the Census, 1999.
[30] L. Xu and D. Embley, “Using Schema Mapping to Facilitate Data Integration,” 2003.

Index Terms:
Schema matching, opaque conditions, embedded schema matching with value mapping.
Anuj Jaiswal, David J. Miller, Prasenjit Mitra, "Uninterpreted Schema Matching with Embedded Value Mapping under Opaque Column Names and Data Values," IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 2, pp. 291-304, Feb. 2010, doi:10.1109/TKDE.2009.69
Usage of this product signifies your acceptance of the Terms of Use.