Subscribe
Issue No.03 - March (2011 vol.23)
pp: 373-387
Debabrata Dey , University of Washington, Seattle
Vijay S. Mookerjee , University of Texas at Dallas, Richardson
Dengpan Liu , University of Alabama in Huntsville, Huntsville
ABSTRACT
The need to consolidate the information contained in heterogeneous data sources has been widely documented in recent years. In order to accomplish this goal, an organization must resolve several types of heterogeneity problems, especially the entity heterogeneity problem that arises when the same real-world entity type is represented using different identifiers in different data sources. Statistical record linkage techniques could be used for resolving this problem. However, the use of such techniques for online record linkage could pose a tremendous communication bottleneck in a distributed environment (where entity heterogeneity problems are often encountered). In order to resolve this issue, we develop a matching tree, similar to a decision tree, and use it to propose techniques that reduce the communication overhead significantly, while providing matching decisions that are guaranteed to be the same as those obtained using the conventional linkage technique. These techniques have been implemented, and experiments with real-world and synthetic databases show significant reduction in communication overhead.
INDEX TERMS
Record linkage, entity matching, sequential decision making, decision tree, data heterogeneity.
CITATION
Debabrata Dey, Vijay S. Mookerjee, Dengpan Liu, "Efficient Techniques for Online Record Linkage", IEEE Transactions on Knowledge & Data Engineering, vol.23, no. 3, pp. 373-387, March 2011, doi:10.1109/TKDE.2010.134
REFERENCES
 [1] R. Ash, Information Theory. John Wiley and Sons, 1965. [2] J.A. Baldwin, "Linked Record Health Data Systems," The Statistician, vol. 21, no. 4, pp. 325-338, 1972. [3] C. Batini, M. Lenzerini, and S.B. Navathe, "A Comparative Analysis of Methodologies for Database Schema Integration," ACM Computing Surveys, vol. 18, no. 4, pp. 323-364, 1986. [4] I. Baussano, M. Bugiani, D. Gregori, C. Pasqualini, V. Demicheli, and F. Merletti, "Impact of Immigration and HIV Infection on Tuberculosis Incidence in an Area of Low Tuberculosis Prevalence," Epidemiology and Infection, vol. 134, no. 6, pp. 1353-1359, 2006. [5] R. Baxter, P. Christen, and T. Churches, "A Comparison of Fast Blocking Methods for Record Linkage," Proc. ACM Workshop Data Cleaning, Record Linkage and Object Consolidation, pp. 25-27, Aug. 2003. [6] T.R. Belin and D.B. Rubin, "A Method for Calibrating False-Match Rates in Record Linkage," J. Am. Statistical Assoc., vol. 90, no. 430, pp. 694-707, 1995. [7] P. Bernstein, "Applying Model Management to Classical Meta Data Problems," Proc. Conf. Innovative Database Research (CIDR), pp. 209-220, Jan. 2003. [8] J. Bischoff and T. Alexander, Data Warehouse: Practical Advice from the Experts. Prentice-Hall, 1997. [9] P. Christen and T. Churches, FEBRL: Freely Extensible Biomedical Record Linkage Manual, Release 0.2 ed., https://sourceforge.net/projectsfebrl/, Apr. 2003. [10] J.B. Copas and F.J. Hilton, "Record Linkage: Statistical Models for Matching Computer Records," J. Royal Statistical Soc., vol. 153, no. 3, pp. 287-320, 1990. [11] M. DesMeules, J. Gold, S. McDermott, Z. Cao, J. Payne, B. Lafrance, B. Vissandjée, E. Kliewer, and Y. Mao, "Disparities in Mortality Patterns among Canadian Immigrants and Refugees, 1980-1998: Results of a National Cohort Study," J. Immigrant Health, vol. 7, no. 4, pp. 221-232, 2005. [12] D. Dey, "Record Matching in Data Warehouses: A Decision Model for Data Consolidation," Operations Research, vol. 51, no. 2, pp. 240-254, 2003. [13] D. Dey, S. Sarkar, and P. De, "A Probabilistic Decision Model for Entity Matching in Heterogeneous Databases," Management Science, vol. 44, no. 10, pp. 1379-1395, 1998. [14] D. Dey, S. Sarkar, and P. De, "A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases," IEEE Trans. Knowledge and Data Eng., vol. 14, no. 3, pp. 567-582, May/June 2002. [15] A.K. Elmagarmid, P.G. Ipeirotis, and V.S. Verykios, "Duplicate Record Detection: A Survey," IEEE Trans. Knowledge and Data Eng., vol. 19, no. 1, pp. 1-16, Jan. 2007. [16] I.P. Fellegi and A.B. Sunter, "A Theory of Record Linkage," J. Am. Statistical Assoc., vol. 64, pp. 1183-1210, 1969. [17] M.J. Goldacre, J.D. Abisgold, D.G.R. Yeates, and V. Seagroatt, "Risk of Multiple Sclerosis after Head Injury: Record Linkage Study," J. Neurology, Neurosurgery, and Psychiatry, vol. 77, no. 3, pp. 351-353, 2006. [18] L. Gu and R. Baxter, "Adaptive Filtering for Efficient Record Linkage," Proc. Fourth SIAM Int'l Conf. Data Mining (SDM '04), pp. 22-24, Apr. 2004. [19] L. Gu, R. Baxter, D. Vickers, and C. Rainsford, "Record Linkage: Current Practice and Future Directions," Technical Report 03/83, CSIRO Math. and Information Sciences, 2003. [20] M.E. Hill, S.H. Preston, and I. Rosenwaike, "Age Reporting among White Americans Aged $85{+}$ : Results of a Record Linkage Study," Demography, vol. 37, no. 2, pp. 175-186, 2000. [21] M.A. Jaro, "Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida," J. Am. Statistical Assoc., vol. 84, no. 406, pp. 414-420, 1989. [22] W. Kim, I. Choi, S. Gala, and M. Scheevel, "On Resolving Semantic Heterogeneity in Multidatabase Systems," Distributed and Parallel Databases, vol. 1, no. 3, pp. 251-279, 1993. [23] M.D. Larsen and D.B. Rubin, "Iterative Automated Record Linkage Using Mixture Models," J. Am. Statistical Assoc., vol. 96, no. 453, pp. 32-41, 2001. [24] J.A. Larson, S.B. Navathe, and R. Elmasri, "A Theory of Attribute Equivalence in Databases with Application to Schema Integration," IEEE Trans. Software Eng., vol. 15, no. 4, pp. 449-463, Apr. 1989. [25] R.J. Miller, Y.E. Ioannidis, and R. Ramakrishnan, "Schema Equivalence in Heterogeneous Systems: Bridging Theory and Practice," Information Systems, vol. 19, no. 1, pp. 3-31, 1994. [26] S.N. Minton, C. Nanjo, C.A. Knoblock, M. Michalowski, and M. Michelson, "A Heterogeneous Field Matching Method for Record Linkage," Proc. Fifth IEEE Int'l Conf. Data Mining (ICDM '05), pp. 314-321, Nov. 2005. [27] J. Moore and A. Whinston, "A Model of Decision Making with Sequential Information Acquisition—Part I," Decision Support Systems, vol. 2, no. 4, pp. 285-307, 1986. [28] J. Moore and A. Whinston, "A Model of Decision Making with Sequential Information Acquisition—Part II," Decision Support Systems, vol. 3, no. 1, pp. 47-72, 1987. [29] H.B. Newcombe, M.E. Fair, and P. Lalonde, "The Use of Names for Linking Personal Records," J. Am. Statistical Assoc., vol. 87, no. 420, pp. 1193-1204, 1992. [30] H.B. Newcombe and J.M. Kennedy, "Record Linkage: Making Maximum Use of the Discriminating Power of Identifying Information," Comm. ACM, vol. 5, no. 11, pp. 563-566, 1962. [31] C. Parent and S. Spaccapietra, "Issue and Approaches of Database Integration," Comm. ACM, vol. 41, no. 5es, pp. 166-178, 1998. [32] J.R. Quinlan, "Induction of Decision Trees," Machine Learning, vol. 1, no. 1, pp. 81-106, 1986. [33] E. Rahm and P.A. Bernstein, "A Survey of Approaches to Automatic Schema Matching," The VLDB J., vol. 10, no. 4, pp. 334-350, 2001. [34] I. Schmitt and G. Saake, "A Comprehensive Database Schema Integration Method Based on the Theory of Formal Concepts," Acta Informatica, vol. 41, nos. 7/8, pp. 475-524, 2005. [35] B. Tepping, "A Model for Optimum Linkage of Records," J. Am. Statistical Assoc., vol. 63, pp. 1321-1332, 1968. [36] A.M. Ward, N. de Klerk, D. Pritchard, M. Firth, and C.D. Holman, "Correlations of Siblings' and Mothers' Utilization of Primary and Hospital Health Care: A Record Linkage Study in Western Australia," Social Science and Medicine, vol. 62, no. 6, pp. 1341-1348, 2005. [37] W.E. Winkler, "Advanced Methods of Record Linkage," Proc. Section Survey Research Methods, pp. 467-472, 1994. [38] W.E. Winkler, "Matching and Record Linkage," Business Survey Methods, B.G. Cox, D.A. Binder, B.N. Chinnappa, and A. Christianson, eds., Wiley & Sons, 1995. [39] W.E. Winkler, "Methods for Evaluating and Creating Data Quality," Information Systems, vol. 29, pp. 531-550, 2004. [40] D.S. Zingmond, Z. Ye, S. Ettner, and H. Liu, "Linking Hospital Discharge and Death Records-Accuracy and Sources of Bias," J. Clinical Epidemiology, vol. 57, no. 1, pp. 21-29, 2004.