The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.10 - October (2008 vol.20)
pp: 1393-1407
Jaewoo Kang , Korea University, Seoul
Jeffrey F. Naughton , University of Wisconsin, Madison
ABSTRACT
Schema matching is one of the key challenges in information integration. It is a labor-intensive and time-consuming process. To alleviate the problem, many automated solutions have been proposed. Most of the existing solutions mainly rely upon textual similarity of the data to be matched. However, there exist instances of the schema matching problem for which they do not even apply. Such problem instances typically arise when the column names in the schemas and the data in the columns are opaque or very difficult to interpret. In our previous work [36] we proposed a two-step technique to address this problem. In the first step, we measure the dependencies between attributes within tables using an information-theoretic measure and construct a dependency graph for each table capturing the dependencies among attributes. In the second step, we find matching node pairs across the dependency graphs by running a graph matching algorithm. In our previous work, we experimentally validated the accuracy of the approach. One remaining challenge is the computational complexity of the graph matching problem in the second step. In this paper we extend the previous work by improving the second phase of the algorithm incorporating efficient approximation algorithms into the framework.
INDEX TERMS
Database integration, Schema and subschema
CITATION
Jaewoo Kang, Jeffrey F. Naughton, "Schema Matching Using Interattribute Dependencies", IEEE Transactions on Knowledge & Data Engineering, vol.20, no. 10, pp. 1393-1407, October 2008, doi:10.1109/TKDE.2008.100
REFERENCES
[1] Operations Research: Deterministic Optimization Models. Prentice Hall, 1995.
[2] J. Albert, Y.E. Ioannidis, and R. Ramakrishnan, “Conjunctive Query Equivalence of Keyed Relational Schemas,” Proc. 16th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems (PODS '97), pp. 44-50, 1997.
[3] H.A. Almohamad and S.O. Duffuaa, “A Linear Programming Approach for the Weighted Graph Matching Problem,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 15, no. 5, pp.522-525, May 1993.
[4] E.D. Andersen and K.D. Andersen, “The Mosek Interior Point Optimizer for Linear Programming: An Implementation of the Homogeneous Algorithm,” Proc. High Performance Optimization Techniques (HPOPT), 1997.
[5] K.M. Anstreicher and N.W. Brixius, “A New Bound for the Quadratic Assignment Problem Based on Convex Quadratic Programming,” Math. Programming, vol. 89, pp. 341-357, 2001.
[6] P. Atzeni, G. Ausiello, C. Batini, and M. Moscarini, “Inclusion and Equivalence between Relational Database Schemata,” Theoretical Computer Science, vol. 19, pp. 267-285, 1982.
[7] P. Atzeni, P. Cappellari, and P.A. Bernstein, “Modelgen: Model Independent Schema Translation,” Proc. 21st Int'l Conf. Data Eng. (ICDE '05), pp. 1111-1112, 2005.
[8] C. Beeri, A.O. Mendelzon, Y. Sagiv, and J.D. Ullman, “Equivalence of Relational Database Schemes,” SIAM J. Computing, vol. 10, no. 2, pp. 352-370, 1981.
[9] J. Berlin and A. Motro, “Database Schema Matching Using Machine Learning with Feature Selection,” Proc. 14th Int'l Conf. Advanced Information Systems Eng. (CAiSE '02), pp. 452-466, 2002.
[10] P.A. Bernstein, T.J. Green, S. Melnik, and A. Nash, “Implementing Mapping Composition,” Proc. 32nd Int'l Conf. Very Large Data Base (VLDB '06), pp. 55-66, 2006.
[11] P.A. Bernstein, A.Y. Halevy, and R. Pottinger, “A Vision of Management of Complex Models,” SIGMOD Record, vol. 29, no. 4, 2000.
[12] P.A. Bernstein and S. Melnik, “Model Management 2.0: Manipulating Richer Mappings,” Proc. ACM SIGMOD '07, pp. 1-12, 2007.
[13] P.A. Bernstein, S. Melnik, and J.E. Churchill, “Incremental Schema Matching,” Proc. 32nd Int'l Conf. Very Large Data Base (VLDB '06), pp. 1167-1170, 2006.
[14] R.E. Burkard, E. Cela, P.M. Pardalos, and L.S. Pitsoulis, The Quadratic Assignment Problem. In Handbook of Combinatorial Optimization, vol. 2. Kluwer Academic Publishers, 1998.
[15] S. Castano, V. Antonellis, and S. Vimercati, “Global Viewing of Heterogeneous Data Sources,” IEEE Trans. Knowledge and Data Eng., vol. 13, no. 2, pp. 277-297, Mar./Apr. 2001.
[16] T.M. Cover and J.A. Thomas, Elements of Information Theory. John Wiley & Sons, 1991.
[17] R. Dhamankar, Y. Lee, A. Doan, A.Y. Halevy, and P. Domingos, “iMAP: Discovering Complex Mappings between Database Schemas,” Proc. ACM SIGMOD, 2004.
[18] A. Doan, P. Domingos, and A.Y. Halevy, “Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach,” Proc. ACM SIGMOD, 2001.
[19] A. Doan, P. Domingos, and A.Y. Levy, “Learning Source Description for Data Integration,” Proc. Third Int'l Workshop Web and Databases (WebDB '00), pp. 81-86, 2000.
[20] C. Domshlak, A. Gal, and H. Roitman, “Rank Aggregation for Automatic Schema Matching,” IEEE Trans. Knowledge and Data Eng., vol. 19, no. 4, pp. 538-553, Apr. 2007.
[21] R. Fagin, P.G. Kolaitis, R.J. Miller, and L. Popa, “Data Exchange: Semantics and Query Answering,” Theoretical Computer Science, vol. 336, no. 1, pp. 89-124, 2005.
[22] R. Fagin, P.G. Kolaitis, L. Popa, and W.C. Tan, “Composing Schema Mappings: Second-Order Dependencies to the Rescue,” ACM Trans. Database Systems, vol. 30, no. 4, pp. 994-1055, 2005.
[23] R. Fagin, P.G. Kolaitis, L. Popa, and W.C. Tan, “Quasi-Inverses of Schema Mappings,” Proc. 26th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems (PODS '07), pp. 123-132, 2007.
[24] N. Friedman, I. Nachman, and D. Peer, Learning Bayesian Network Structure from Massive Datasets: The “Sparse Candidate” Algorithm, pp. 206-215, 1999.
[25] M.R. Garey and D.S. Johnson, Computers and Intractability. A Guide to the Theory of NP-Completeness. Freeman, 1979.
[26] L. Getoor, B. Taskar, and D. Koller, “Selectivity Estimation Using Probabilistic Models,” Proc. ACM SIGMOD, 2001.
[27] S. Gold and A. Rangarajan, “A Graduated Assignment Algorithm for Graph Matching,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 4, pp. 377-388, Apr. 1996.
[28] L.M. Haas, M.A. Hernández, H. Ho, L. Popa, and M. Roth, “Clio Grows Up: From Research Prototype to Industrial Tool,” Proc. ACM SIGMOD '05, pp. 805-810, 2005.
[29] B. He and K.C.-C. Chang, “Making Holistic Schema Matching Robust: An Ensemble Approach,” Proc. 11th ACM SIGKDD Int'l Conf. Knowledge Discovery in Data Mining (KDD '05), pp. 429-438, 2005.
[30] B. He, K.C.-C. Chang, and J. Han, “Discovering Complex Matchings Across Web Query Interfaces: A Correlation Mining Approach,” Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '04), pp. 148-157, 2004.
[31] K.-J.R.H. Mannila, “Dependency Inference,” Proc. 13th Int'l Conf. Very Large Data Base (VLDB '87), pp. 155-158, 1987.
[32] M.A. Hernandez, R.J. Miller, and L.M. Haas, “Clio: A Semi-Automatic Tool for Schema Mapping,” Proc. ACM SIGMOD, 2001.
[33] Y. Huhtala, J. Karkkainen, P. Porkka, and H. Toivonen, “Efficient Discovery of Functional and Approximate Dependencies Using Partitions,” Proc. 14th Int'l Conf. Data Eng. (ICDE), 1998.
[34] R. Hull, “Relative Information Capacity of Simple Relational Database Schemata,” SIAM J. Computing, vol. 15, no. 3, pp. 856-886, 1986.
[35] S. Ishii and M. aki Sato, “Doubly Constrained Network for Combinatorial Optimization,” Neurocomputing, vol. 43, nos. 1-4, pp.239-257, 2002.
[36] J. Kang and J.F. Naughton, “On Schema Matching with Opaque Column Names and Data Values,” Proc. ACM SIGMOD '03, June 2003.
[37] J. Kivinen and H. Mannila, “Approximate Inference of Functional Dependencies from Relations,” Theoretical Computer Science, vol. 149, no. 1, pp. 129-149, 1995.
[38] W.-S. Li and C. Clifton, “Semantic Integration in Heterogeneous Databases Using Neural Networks,” Proc. 20th Int'l Conf. Very Large Data Base (VLDB '94), pp. 1-12, 1994.
[39] W.-S. Li and C. Clifton, “SEMINT: A Tool for Identifying Attribute Correspondences in Heterogeneous Databases Using Neural Networks,” J. Data and Knowledge Eng., vol. 33, no. 1, Dec. 2000.
[40] J. Madhavan, P.A. Bernstein, and E. Rahm, “Generic Schema Matching with Cupid,” Proc. 27th Int'l Conf. Very Large Data Base (VLDB), 2001.
[41] S. Melnik, H. Garcia-Molina, and E. Rahm, “Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching,” Proc. 18th Int'l Conf. Data Eng. (ICDE), 2002.
[42] S. Melnik, A. Adya, and P.A. Bernstein, “Compiling Mappings to Bridge Applications and Databases,” Proc. ACM SIGMOD '07, pp.461-472, 2007.
[43] R.J. Miller, L.M. Haas, and M.A. Hernandez, “Schema Mapping as Query Discovery,” Proc. 26th Int'l Conf. Very Large Data Base (VLDB '00), pp. 77-88, 2000.
[44] R.J. Miller, Y.E. Ioannidis, and R. Ramakrishnan, “The Use of Information Capacity in Schema Integration and Translation,” Proc. 19th Int'l Conf. Very Large Data Base (VLDB '93), pp. 120-133, 1993.
[45] R.J. Miller, Y.E. Ioannidis, and R. Ramakrishnan, “Schema Equivalence in Heterogeneous Systems: Bridging Theory and Practice,” Proc. Fourth Int'l Conf. Extending Database Technology (EDBT), 1994.
[46] T. Milo and S. Zohar, “Using Schema Matching to Simplify Heterogeneous Data Translation,” Proc. 24th Int'l Conf. Very Large Data Base (VLDB), 1998.
[47] E. Rahm and P.A. Bernstein, “On Matching Schemas Automatically,” The VLDB J., vol. 10, no. 4, Dec. 2001.
[48] E. Rahm and P.A. Bernstein, “A Survey of Approaches to Automatic Schema Matching,” The VLDB J., vol. 10, no. 4, 2001.
[49] J. Rissanen, “On Equivalences of Database Schemes,” Proc. First ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems (PODS '82), pp. 23-26, 1982.
[50] K. Rose, “Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems,” Proc. IEEE, vol. 86, pp. 2210-2239, 1998.
[51] S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach. Prentice Hall, 1995.
[52] C. Schellewald, S. Roth, and C. Schnorr, “Evaluation of Convex Optimization Techniques for the Weighted Graph-Matching Problem in Computer Vision,” Proc. 23rd DAGM Symp. Pattern Recognition (DAGM '01), pp. 361-368, 2001.
[53] P. Shvaiko and J. Euzenat, A Survey of Schema-Based Matching Approaches, pp. 146-171, 2005.
[54] S. Umeyama, “An Eigendecomposition Approach to Weighted Graph Matching Problems,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 10, no. 5, pp. 695-703, Sept. 1988.
[55] S.J. Wright, Primal-Dual Interior-Point Methods, SIAM, 1997.
[56] L.-L. Yan, R.J. Miller, L.M. Haas, and R. Fagin, “Data-Driven Understanding and Refinement of Schema Mappings,” Proc. ACM SIGMOD, 2000.
16 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool