This Article 
 Bibliographic References 
 Add to: 
A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases
May/June 2002 (vol. 14 no. 3)
pp. 567-582

Abstract—In modern organizations, decision makers must often be able to quickly access information from diverse sources in order to make timely decisions. A critical problem facing many such organizations is the inability to easily reconcile the information contained in heterogeneous data sources. To overcome this limitation, an organization must resolve several types of heterogeneity problems that may exist across different sources. In this paper, we examine one such problem called the entity heterogeneity problem, which arises when the same real-world entity type is represented using different identifiers in different applications. A decision-theoretic model to resolve the problem is proposed. Our model uses a distance measure to express the similarity between two entity instances. We have implemented the model and tested it on real-world data. The results indicate that the model performs quite well in terms of its ability to predict whether two entity instances should be matched or not. The model is shown to be computationally efficient. It also scales well to large relations from the perspective of the accuracy of prediction. Overall, the test results imply that this is certainly a viable approach in practical situations.

[1] M.R. Anderberg, Cluster Analysis for Applications. New York: Academic Press, 1973.
[2] F.H. Barron and B.E. Barrett, “Decision Quality Using Ranked Attribute Weights,” Management Science, vol. 42, no. 11, pp. 1515-1523, Nov. 1996.
[3] C. Batini and M. Lenzerini, “A Methodology for Data Schema Integration in the Entity Relationship Model,” IEEE Trans. Software Eng., vol. 10, no. 6, pp. 650-664, Nov. 1984.
[4] C. Batini, M. Lenzerini, and S.B. Navathe, “A Comparative Analysis of Methodologies for Database Schema Integration,” ACM Computing Surveys, vol. 18, no. 2, pp. 323-364, Dec. 1986.
[5] J. Bischoff and T. Alexander, Data Warehouse: Practical Advice from the Experts. Prentice-Hall, 1997.
[6] M.W. Bright, A.R. Hurson, and S. Pakzad, “Automated Resolution of Semantic Heterogeneity in Multidatabases,” ACM Trans. Database Systems, vol. 19, no. 2, pp. 212-253, June 1994.
[7] J.R. Canada and W.G. Sullivan, Economic and Multiattribute Evaluation of Advanced Manufacturing Systems. Englewood Cliffs, N.J.: Prentice-Hall, 1989.
[8] A. Chatterjee and A. Segev, “Rule Based Joins in Heterogeneous Databases,” Decision Support Systems, vol. 13, no. 1, pp. 313-333, 1995.
[9] A.L.P. Chen, P.S.M. Tsai, and J.-L. Koh, “Identifying Object Isomerism in Multidatabase Systems,” Distributed and Parallel Databases, vol. 4, no. 2, pp. 143-168, Apr. 1996.
[10] W. Cohen, “Integration of Heterogeneous Databases without Common Domains Using Queries Based on Textual Similarity,” Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data, pp. 201-212, June 1998.
[11] J.B. Copas and F.J. Hilton, “Record Linkage: Statistical Models for Matching Computer Records,” J. Royal Statistical Soc., vol. 153, no. 3, pp. 287-320, 1990.
[12] D. Dey, S. Sarkar, and P. De, “A Probabilistic Decision Model for Entity Matching in Heterogeneous Databases,” Management Science, vol. 44, no. 10, pp. 1379-1395, 1998.
[13] R.T. Eckenrode, “Weighting Multiple Criteria,” Management Science, vol. 12, no. 3, pp. 180-192, Nov. 1965.
[14] I.P. Fellegi and A.B. Sunter, “A Theory of Record Linkage,” Am. Statistical Assoc. J., vol. 64, pp. 1183-1210, Dec. 1969.
[15] P. Fankhauser and E.J. Neuhold, “Knowledge Based Integration of Heterogeneous Databases,” Interoperable Database Systems (DS-5), D.K. Hsiao, E.J. Neuhold, and R. Sacks-Davis eds. pp. 155-175, North-Holland, 1993.
[16] S. Gass, “A Process to Determine Priorities and Weights for Large-Scale Linear Goal Programming,” Proc. 12th Int'l Symp. Math. Programming, Aug. 1985.
[17] W. Gotthard, P.C. Lockemann, and A. Neufeld, “System‐Guided View Integration for Object‐Oriented Databases,” IEEE Trans. Knowledge and Data Engineering, Vol. 4, No. 1, Feb. 1992, pp. 1–22.
[18] J.A Hartigan,Clustering Algorithms, John Wiley and Sons, New York, N.Y., 1975.
[19] M.A. Hernández and S.J. Stolfo, “The Merge/Purge Problem for Large Databases,” Proc. 1995 ACM SIGMOD Conf., pp. 127-138, May 1995.
[20] R.A. Johnson and D.W. Wichern,Applied multivariate statistical analysis, Prentice Hall, 1988.
[21] W. Kent, R. Ahmed, M. Ketabchi, and M.-C. Shan, “Object Identification in Multidatabase Systems,” Interoperable Database Systems (DS-5), D.K. Hsiao, E.J. Neuhold, and R. Sacks-Davis eds., pp. 313-330, 1993.
[22] W. Kim et al., "On Resolving Schematic Heterogeneity in MultiDatabase Systems," Distributed and Parallel Databases, vol. 3, no. 1, 1993.
[23] W. Kim and J. Seo, "Classifying Schematic and Data Heterogeneity in Multidatabase Systems," Computer, Dec. 1991.
[24] D.E. Knuth, J.H. Morris, and V.R. Pratt, “Fast Pattern Matching in Strings,” SIAM J. Computing, vol. 6, no. 2, pp. 323-350, June 1977.
[25] I. Kononenko and I. Bratco, “Information-Based Evaluation Criterion for Classifier's Performance,” Machine Learning, vol. 6, pp. 67-80, 1991.
[26] H.W. Kuhn, “The Hungarian Method for the Assignment Algorithm,” Naval Research Logistics Quarterly, vol. 1, nos. 1/2, pp. 83-97, Mar./June 1955.
[27] J.A. Larson, S.B. Navathe, and R. El‐Masri, “A Theory of Attribute Equivalence in Databases with Applications to Schema Integration,” IEEE Trans. Software Engineering, Vol. 15, No. 4, Apr. 1989, pp. 449–463.
[28] E. Lawler, Combinatorial Optimization: Networks and Matroids. New York: Holt, Rinehart and Winston, 1976.
[29] W.-S. Li and C. Clifton, “Semantic Integration in Heterogeneous Databases Using Neural Networks,” Proc. 20th Int'l Conf. Very Large Data Bases, pp. 1-12, Sept. 1994.
[30] A.E. Monge and C.P. Elkan, “The Field Matching Problem: Algorithm and Applications,” Proc. Second Int'l Conf. Knowledge Discovery and Data Mining, pp. 267-270, Aug. 1996.
[31] S.B. Navathe and S.G. Gadgil, “A Methodology for View Integration in Logical Database Design,” Proc. Eighth Int'l Conf. Very Large Data Bases, pp. 142-164, Sept. 1982.
[32] H.B. Newcombe, J.M. Kennedy, S.J. Axford, and A.P. James, “Automatic Linkage of Vital Records,” Science, vol. 130, no. 3381, pp. 954-959, Oct. 1959.
[33] C.H. Papadimitriu and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity. Prentice Hall, 1987.
[34] V. Poe, Building a Data Warehouse for Decision Support. Upper Saddle River, N.J.: Prentice-Hall 1996.
[35] M. Rusinkiewicz, A. Sheth, and G. Karabatis, “Specifying Interdatabase Dependencies in a Multidatabase Environment,” IEEE Computer, vol. 24, no. 12, pp. 46-53, Dec. 1991.
[36] T.L. Saaty, The Analytic Hierarchy Process. New York: McGraw-Hill, 1980.
[37] A.P. Seth and J.A. Larson,“Federated database systems for managing distributed, heterogeneous andautonomous databases,” ACM Computing Surveys, vol. 22, no. 3, pp. 184-236, September 1990.
[38] E. Stickel, A. Hunstock, A. Ortmann, and J. Ortmann, “Data Sharing Economics and Requirements for Integration Tool Design,” Information Sciences, vol. 19, no. 8, pp. 629-642, Aug. 1994.
[39] A. Tversky and D. Kahnemann, “Judgment under Uncertainty: Heuristics and Biases,” Science, vol. 185, pp. 1124-1131, 1974.
[40] V. Ventrone and S. Heiler, “Semantic Heterogeneity as a Result of Domain Evolution,” ACM SIGMOD Record, vol. 20, no. 4, pp. 16-20, Dec. 1991.
[41] Y.R. Wang and S. Madnick, “The Interdatabase Instance Identification Problem in Integrating Autonomous Systems,” Proc. Fifth Int'l Conf. Data Eng., pp. 46-55, Feb. 1989.

Index Terms:
Heterogeneous databases, entity heterogeneity, entity matching, distance measure
D. Dey, S. Sarkar, P. De, "A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases," IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 3, pp. 567-582, May-June 2002, doi:10.1109/TKDE.2002.1000343
Usage of this product signifies your acceptance of the Terms of Use.