This Article 
 Bibliographic References 
 Add to: 
An Exploratory Study of Database Integration Processes
January 2008 (vol. 20 no. 1)
pp. 99-115
One of the central problems of database integration is schema matching, the identification of similar data elements in two or more databases or other data sources. Existing definitions of "similarity" in this context vary greatly. As a result, schema matching has given rise to large number of heuristics software tools. However, the empirical understanding of this process in humans is very limited, so little guidance can be offered to the further development of heuristics and tool. This paper presents an exploratory study of the similarity judgement process in humans, employing a process tracing methodology. The similarity judgements of twelve data integration professionals on a range of integration problems are recorded and analyzed. Implications for future empirical and applied research in this area are discussed.

[1] C. Batini, M. Lenzerini, and S. Navathe, “A Comparative Analysis of Methodologies for Database Schema Integration,” ACM Computer Surveys, vol. 18, no. 4, 1986.
[2] E. Rahm and P.A. Bernstein, “A Survey of Approaches to Automatic Schema Matching,” The VLDB J., vol. 10, no. 4, pp.334-350, 2001.
[3] B.S. Lerner, “A Model for Compound Type Changes Encountered in Schema Evolution,” ACM Trans. Database Systems, vol. 25, no. 1, pp. 83-127, Mar. 2000.
[4] P.Z. Yeh, B. Porter, and K. Barker, “Using Transformations to Improve Semantic Matching,” Proc. Second Int'l Conf. Knowledge Capture (K-CAP '03), pp. 180-189, 2003.
[5] L. Palopoli, D. Sacca, G. Terracina, and D. Ursino, “Uniform Techniques for Deriving Similarities of Objects and Subschemas in Heterogeneous Databases,” IEEE Trans. Knowledge and Data Eng., vol. 15, no. 2, pp. 271-294, Mar. 2003.
[6] A. Doan, J. Madhavan, R. Dhamankar, P. Domingos, and A. Halevy, “Learning to Match Ontologies on the Semantic Web,” The VLDB J., vol. 12, pp. 303-319, 2003.
[7] J. Berlin and A. Motro, “Database Schema Matching Using Machine Learning with Feature Selection,” Proc. 14th Int'l Conf. Advanced Information Systems Eng. (CAISE '02), pp. 452-466, 2002.
[8] J. Kang and J.F. Naughton, “On Schema Matching with Opaque Column Names and Data Values,” Proc. 2003 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '03), pp. 205-216, 2003.
[9] X. Su, S. Hakkarainen, and T. Brasethvik, “Semantic Enrichment for Improving System Interoperability,” Proc. 19th ACM Symp. Applied Computing (SAC '04), 2004.
[10] N. Noy and M. Musen, “The PROMPT Suite: Interactive Tools for Ontology Merging and Mapping,” Int'l J. Human-Computer Studies, vol. 59, no. 6, pp. 983-1024, 2003.
[11] W.W. Cohen, “Data Integration Using Similarity Joins and a Word-Based Information Representation Language,” ACM Trans. Information Systems, vol. 18, no. 3, 2000.
[12] H.H. Do, S. Melnik, and E. Rahm, “Comparison of Schema Matching Evaluations,” Revised Papers from the NODe 2002 Web and Database-Related Workshops Web, Web-Services, and Database Systems, pp. 221-237, 2002.
[13] P. Todd and I. Benbasat, “Process Tracing Methods in Decision Support Systems Research: Exploring the Black Box,” MIS Quarterly, vol. 11, no. 4, pp. 493-512, Dec. 1987.
[14] J. Patrick and N. James, “Process Tracing of Complex Cognitive Work Tasks,” J. Occupational and Organizational Psychology, vol. 77, pp. 259-280, 2004.
[15] P. Mitra, G. Wiederhold, and M. Kersten, “A Graph-Oriented Model for Articulation of Ontology Interdependencies,” Proc. Seventh Int'l Conf. Extending Database Technology (EDBT '00), pp.86-100, 2000.
[16] E. Bertino, G. Guerrini, and M. Mesiti, “A Matching Algorithm for Measuring the Structural Similarity between an XML Document and a DTD and Its Applications,” Information Systems, vol. 29, pp.23-46, 2004.
[17] S. Castano, V. De Antonellis, and S. De Capitani di Vimercati, “Global Viewing of Heterogeneous Data Sources,” IEEE Trans. Knowledge and Data Eng., vol. 13, no. 2, pp. 277-297, Mar. 2001.
[18] J. Larson, S. Navathe, and R. Elmasri, “A Theory of Attribute Equivalence in Databases with Application to Schema Integration,” IEEE Trans. Software Eng., vol. 15, no. 4, pp. 449-463, Apr. 1989.
[19] W. Gotthard, P.C. Lockemann, and A. Neufeld, “System-Guided View Integration for Object-Oriented Databases,” IEEE Trans. Knowledge and Data Eng., vol. 4, no. 1, Feb. 1992.
[20] S. Hayne and S. Ram, “Multi-User View Integration System (MUVIS): An Expert System for View Integration,” Proc. Sixth Int'l Conf. Data Eng. (ICDE '90), pp. 402-409, 1990.
[21] S. Spaccapietra and C. Parent, “View Integration: A Step Forward in Solving Structural Conflicts,” IEEE Trans. Knowledge and Data Eng., vol. 6, no. 2, Apr. 1992.
[22] N. Noy and M. Musen, “Anchor-PROMPT: Using Non-Local Context for Semantic Matching,” Proc. IJCAI Workshop Ontologies and Information Sharing, 2001.
[23] S. Melnik, H. Garcia-Molina, and E. Rahm, “Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching,” Proc. 18th Int'l Conf. Data Eng. (ICDE '02), 2002.
[24] T.-L.J. Wang, K. Zhang, K. Jeong, and D. Shasha, “A System for Approximate Tree Matching,” IEEE Trans. Knowledge and Data Eng., vol. 6, no. 4, pp. 559-571, Aug. 1994.
[25] R.J. Miller, M.A. Hernandez, L.M. Haas, L. Yan, C.H. Ho, R. Fagin, and L. Popa, “The Clio Project: Managing Heterogeneity,” ACM SIGMOD Record, vol. 30, no. 1, pp. 78-83, Mar. 2001.
[26] C.E.H. Chua, R.H. Chiang, and E.-P. Lim, “Instance-Based Attribute Identification in Database Integration,” The VLDB J., vol. 12, no. 3, pp. 228-243, Oct. 2003.
[27] W.-S. Li, C. Clifton, and S.-Y. Liu, “Database Integration Using Neural Networks: Implementation and Experiences,” Knowledge and Information Systems, vol. 2, no. 1, pp. 73-96, 2000.
[28] G.J. Cook and M.R. Swain, “A Computerized Approach to Decision Process Tracing for Decision Support System Design,” Decision Sciences, vol. 24, no. 5, pp. 931-952, 1993.
[29] S.F. Biggs, J.C. Bedard, B.G. Gaber, and T.J. Linsmeier, “The Effects of Task Size and Similarity on the Decision Behavior of Bank Loan Officers,” Management Science, vol. 31, no. 8, pp. 970-987, 1985.
[30] A. Broder and S. Schiffer, “Bayesian Strategy Assessment in Multi-Attribute Decision Making,” J. Behavioral Decision Making, vol. 16, no. 3, pp. 193-213, July 2003.
[31] J.-Y. Mao and I. Benbasat, “Contextualized Access to Knowledge: Theoretical Perspectives and a Process-Tracing Study,” Information Systems J., vol. 8, pp. 217-239, 1998.
[32] A. Kirmani and H. Baumgartner, “Reference Points Used in Quality and Value Judgements,” Marketing Letters, vol. 11, no. 4, pp. 299-310, 2000.
[33] B.C. Hungerford, A.R. Hevner, and R.W. Collins, “Reviewing Software Diagrams: A Cognitive Study,” IEEE Trans. Software Eng., vol. 30, no. 2, pp. 82-96, Feb. 2004.
[34] J. Parsons and L. Cole, “What Do the Pictures Mean? Guidelines for Experimental Evaluations of Representation Fidelity in Diagrammatical Conceptual Modeling Techniques,” Data and Knowledge Eng., vol. 55, pp. 327-342, 2005.
[35] R. Rosenthal, Experimenter Effects in Behavioral Research. Meredith Publishing, 1966.
[36] R.Y. Cavana, B.L. Delahaye, and U. Sekaran, Applied Business Research: Qualitative and Quantitative Methods. John Wiley & Sons, 2001.
[37] W.-S. Li and C. Clifton, “SEMINT: A Tool for Identifying Attribute Correspondences in Heterogeneous Databases Using Neural Networks,” Data and Knowledge Eng., vol. 33, pp. 49-84, 2000.
[38] L. Palopoli, D. Sacca, and D. Ursino, “Semi-Automatic Semantic Discovery of Properties from Database Schemes,” Proc. Int'l Database Eng. and Applications Symp. (IDEAS '98), pp. 244-253, 1998.
[39] C. Parent and S. Spaccapietra, “Issues and Approaches of Database Integration,” Comm. ACM, vol. 41, no. 5, pp. 166-178, May 1998.
[40] T. Nyerges, T. Moore, R. Montejano, and M. Compton, “Developing and Using Interaction Coding Systems for Studying Groupware Use,” Human-Computer Interaction, vol. 13, pp. 127-165, 1998.
[41] J.R. Anderson, Cognitive Psychology and Its Implications, fourth ed. W.H. Freeman, 1995.
[42] I. Rigoutsos, A. Floratos, L. Parida, Y. Gao, and D. Platt, “The Emergence of Pattern Discovery Techniques in Computational Biology,” Metabolic Eng., vol. 2, pp. 159-177, 2000.
[43] I. Rigoutsos, A. Floratos, and C. Ouzounis, “Case Studies in Pattern Discovery without Alignment—Results Using the Teiresias Algorithm,” Research Report RC20803(92166), Research Division, IBM T.J. Watson Research Center, 1997.
[44] I. Rigoutsos and A. Floratos, “On the Time Complexity of the Teiresias Algorithm,” Research Report RC21161(94582), Research Division, IBM T.J. Watson Research Center, 1998.
[45] I. Rigoutsos and A. Floratos, “Combinatorial Pattern Discovery in Biological Sequences: The Teiresias Algorithm,” Bioinformatics, vol. 14, no. 1, pp. 55-67, 1998.
[46] I. Rigoutsos and A. Floratos, “Motif Discovery without Alignment or Enumeration,” Proc. Second Ann. Int'l Conf. Research in Computational Molecular Biology (RECOMB '98), pp. 221-227, 1998.
[47] T.L. Bailey and C. Elkan, “Fitting a Mixture Model by Expectation Maximization to Discover Motifs in Biopolymers,” Proc. Second Int'l Conf. Intelligent Systems for Molecular Biology (ISMB '94), pp.28-36, 1994.
[48] I. Jonassen, J.F. Collins, and D. Higgins, “Finding Flexible Patterns in Unaligned Protein Sequences,” Protein Science, vol. 4, no. 8, pp.1587-1595, 1995.
[49] D. Robinson and L. Foulds, Digraphs: Theory and Techniques. Gordon and Breach Science, 1980.
[50] J.L. Gross and J. Yellen, Graph Theory and Its Applications, second ed. Chapman and Hall/CRC, 2006.
[51] L.C. Freeman, “The Sociological Concept of ‘Group’: An Empirical Test of Two Models,” The Am. J. Sociology, vol. 98, no. 1, pp. 152-166, July 1992.
[52] O. Cappe, E. Moulines, and T. Ryden, Inference in Hidden Markov Models. Springer, 2005.
[53] L. Baum, T. Petrie, G. Soules, and N. Weiss, “A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains,” Annals of Math. Statistics, vol. 41, no. 1, pp. 164-171, 1970.

Index Terms:
Database Management, Database integration
Joerg Evermann, "An Exploratory Study of Database Integration Processes," IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 1, pp. 99-115, Jan. 2008, doi:10.1109/TKDE.2007.190675
Usage of this product signifies your acceptance of the Terms of Use.