The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.06 - June (2012 vol.24)
pp: 975-987
Jie Tang , Tsinghua University, Beijing
A.C.M. Fong , Auckland University of Technology, Auckland
Bo Wang , Nanjing University of Aeronautics and Astronautics, Beijing
Jing Zhang , Tsinghua University, Beijing
ABSTRACT
Despite years of research, the name ambiguity problem remains largely unresolved. Outstanding issues include how to capture all information for name disambiguation in a unified approach, and how to determine the number of people K in the disambiguation process. In this paper, we formalize the problem in a unified probabilistic framework, which incorporates both attributes and relationships. Specifically, we define a disambiguation objective function for the problem and propose a two-step parameter estimation algorithm. We also investigate a dynamic approach for estimating the number of people K. Experiments show that our proposed framework significantly outperforms four baseline methods of using clustering algorithms and two other previous methods. Experiments also indicate that the number K automatically found by our method is close to the actual number.
INDEX TERMS
Digital libraries, information search and retrieval, database applications, heterogeneous databases.
CITATION
Jie Tang, A.C.M. Fong, Bo Wang, Jing Zhang, "A Unified Probabilistic Framework for Name Disambiguation in Digital Library", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 6, pp. 975-987, June 2012, doi:10.1109/TKDE.2011.13
REFERENCES
[1] H. Akaike, "A New Look at the Statistical Model Identification," IEEE Trans. Automatic Control, vol. AC-19, no. 6, pp. 716-723, Dec. 1974.
[2] S. Basu, M. Bilenko, and R.J. Mooney, "A Probabilistic Framework for Semi-Supervised Clustering," Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD '04), pp. 59-68, 2004.
[3] R. Bekkerman and A. McCallum, "Disambiguating Web Appearances of People in a Social Network," Proc. Int'l Conf. World Wide Web (WWW '05), pp. 463-470, 2005.
[4] O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S.E. Whang, and J. Widom, "Swoosh: A Generic Approach to Entity Resolution," The VLDB J., vol. 18, pp. 255-276, 2008.
[5] I. Bhattacharya and L. Getoor, "Collective Entity Resolution in Relational Data," ACM Trans. Knowledge Discovery from Data, vol. 1, article 5, 2007.
[6] C. Buckley and E.M. Voorhees, "Retrieval Evaluation with Incomplete Information," Proc. Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '04), pp. 25-32, 2004.
[7] Z. Chen, D.V. Kalashnikov, and S. Mehrotra, "Adaptive Graphical Approach to Entity Resolution," Proc. Seventh ACM/IEEE-CS Joint Conf. Digital Libraries (JCDL '07), pp. 204-213, 2007.
[8] Z. Chen, D.V. Kalashnikov, and S. Mehrotra, "Exploiting Context Analysis for Combining Multiple Entity Resolution Systems," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '09), pp. 207-218, 2009.
[9] D. Cohn, R. Caruana, and A. McCallum, "Semi-supervised Clustering with User Feedback," Technical Report TR2003-1892, Cornell Univ., 2003.
[10] D. Cai, X. He, and J. Han, "Spectral Regression for Dimensionality Reduction," technical report, 2856, UIUC 2004.
[11] P.T. Davis, D.K. Elson, and J.L. Klavans, "Methods for Precise Named Entity Matching in Digital Collections," Proc. ACM/IEEE-CS Joint Conf. Digital Libraries (JCDL '03), p. 125, 2003.
[12] C. Ding, "A Tutorial on Spectral Clustering," Proc. Int'l Conf. Machine Learning (ICML '04), 2004.
[13] M. Ester, R. Ge, B.J. Gao, Z. Hu, and B. Ben-Moshe, "Joint Cluster Analysis of Attribute Data and Relationship Data: The Connected K-Center Problem," Proc. SIAM Conf. Data Mining (SDM '06), 2006.
[14] S. Geman and D. Geman, "Stochastic Relaxation, Gibbs Distributions and the Bayesian Restoration of Images," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. PAMI-6, no. 6, pp. 721-742, Nov. 1984.
[15] Z. Ghahramani and M.I. Jordan, "Factorial Hidden Markov Models," Machine Learning, vol. 29, pp. 245-273, 1997.
[16] J. Hammersley and P. Clifford, "Markov Fields on Finite Graphs and Lattices," Unpublished manuscript, 1971.
[17] H. Han, L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis, "Two Supervised Learning Approaches for Name Disambiguation in Author Citations," Proc. ACM/IEEE Joint Conf. Digital Libraries (JCDL '04), pp. 296-305, 2004.
[18] H. Han, H. Zha, and C.L. Giles, "Name Disambiguation in Author Citations Using a K-Way Spectral Clustering Method," Proc. ACM/IEEE Joint Conf. Digital Libraries (JCDL '05), pp. 334-343, 2005.
[19] G.E. Hinton, "Training Products of Experts by Minimizing Contrastive Divergence," J. Neural Computation, vol. 14, pp. 1771-1800, 2002.
[20] L. Jiang, J. Wang, N. An, S. Wang, J. Zhan, and L. Li., "GRAPE: A Graph-Based Framework for Disambiguating People Appearances in Web Search," Proc. Int'l Conf. Data Mining (ICDM '09), pp. 199-208, 2009.
[21] M.I. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul, "An Introduction to Variational Methods for Graphical Models," Learning in Graphical Models, vol. 37, pp. 105-161, 1999.
[22] R. Kass and L. Wasserman, "A Reference Bayesian Test for Nested Hypotheses and Its Relationship to the Schwarz Criterion," J. Am. Statistical Assoc., vol. 90, pp. 773-795, 1995.
[23] L. Kaufman and P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, 1990.
[24] R. Kindermann and J.L. Snell, Markov Random Fields and Their Applications. Am. Math. Soc., 1980.
[25] H. Kunsch, S. Geman, and A. Kehagias, "Hidden Markov Random Fields," J. Annals of Applied Probability, vol. 5, no. 3, pp. 577-602, 1995.
[26] X. Li, P. Morie, D. Roth, "Identification and Tracing of Ambiguous Names: Discriminative and Generative Approaches," Proc. 19th Nat'l Conf. Artificial Intelligence (AAAI '04), pp. 419-424, 2004.
[27] J. MacQueen, "Some Methods for Classification and Analysis of Multivariate Observations," Proc. Fifth Berkeley Symp. Math. Statistics and Probability, 1967.
[28] D.M. McRae-Spencer and N.R. Shadbolt, "Also by the Same Author: AKTiveAuthor, a Citation Graph Approach to Name Disambiguation," Proc. ACM/IEEE Joint Conf. Digital Libraries (JCDL '06), pp. 53-54, 2006.
[29] E. Minkov, W.W. Cohen, and A.Y. Ng, "Contextual Search and Name Disambiguation in Email Using Graphs," Proc. 29th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '06), pp. 27-34, 2006.
[30] K.P. Murphy, Y. Weiss, and M.I. Jordan, "Loopy Belief Propagation for Approximate Inference: An Empirical Study," Proc. Conf. Uncertainty in Artificial Intelligence (UAI '99), pp. 467-475, 1999.
[31] M.E.J. Newman and M. Girvan, "Finding and Evaluating Community Structure in Networks," Physical Rev. E, vol. 69, p. 026113, 2004.
[32] B. On and D. Lee, "Scalable Name Disambiguation Using Multi-Level Graph Partition," Proc. SIAM Int'l Conf. Data Mining (SDM '07), 2007.
[33] D. Pelleg and A. Moore, "X-Means: Extending K-Means with Efficient Estimation of the Number of Clusters," Proc. Int'l Conf. Machine Learning (ICML '00), 2000.
[34] J. Rissanen, "A Universal Prior for Integers and Estimation by Minimum Description Length," J. Annals of Statistics, vol. 11, no. 2, pp. 416-431, 1983.
[35] J. Shi and J. Malik, "Normalized Cuts and Image Segmentation," IEEE Trans. Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888-905, Aug. 2000.
[36] L. Shu, B. Long, and W. Meng, "A Latent Topic Model for Complete Entity Resolution," Proc. IEEE Int'l Conf. Data Eng. (ICDE '09), pp. 880-891, 2009.
[37] Y. Song, J. Huang, I.G. Councill, J. Li, and C.L. Giles, "Efficient Topic-Based Unsupervised Name Disambiguation," Proc. ACM/IEEE Joint Conf. Digital Libraries (JCDL '07), pp. 342-351, 2007.
[38] Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema," Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD '09), 2009.
[39] Y.F. Tan, M. Kan, and D. Lee, "Search Engine Driven Author Disambiguation," Proc. ACM/IEEE Joint Conf. Digital Libraries (JCDL '06), pp. 314-315, 2006.
[40] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, "ArnetMiner: Extraction and Mining of Academic Social Networks," Proc. 14th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD '08), 2008.
[41] J. Tang, L. Yao, D. Zhang, and J. Zhang, "A Combination Approach to Web User Profiling," ACM Trans. Knowledge Discovery from Data, vol. 5, article 2, Dec. 2010.
[42] Y. Tian, R.A. Hankins, and J.M. Patel, "Efficient Aggregation for Graph Summarization," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '08), pp. 567-580, 2008.
[43] J. Vesanto and E. Alhoniemi, "Clustering of the Self-Organizing Map," IEEE Trans. Neural Network, vol. 11, no. 3, pp. 586-600, May 2000.
[44] M. Welling and G.E. Hinton, "A New Learning Algorithm for Mean Field Boltzmann Machines," Proc. Int'l Conf. Artificial Neural Networks (ICANN '01), pp. 351-357, 2001.
[45] M. Welling and K. Kurihara, "Bayesian K-Means as a "Maximization-Expectation" Algorithm," Proc. SIAM Int'l Conf. Data Mining (SDM '06), pp. 472-476, 2006.
[46] S.E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina, "Entity Resolution with Iterative Blocking," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '09), pp. 219-232, 2009.
[47] S.E. Whang, O. Benjelloun, and H. Garcia-Molina, "Generic Entity Resolution with Negative Rules," The VLDB J., vol. 18, no. 6, pp. 1261-1277, 2009.
[48] X. Xu, N. Yuruk, Z. Feng, and T.A.J. Schweiger, "Scan: A Structural Clustering Algorithm for Networks," Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD '07), pp. 824-833, 2007.
[49] X. Yin, J. Han, and P.S. Yu, "Object Distinction: Distinguishing Objects with Identical Names," Proc. Int'l Conf. Data Eng. (ICDE '07), pp. 1242-1246, 2007.
[50] H. Yu, W. Kim, V. Hatzivassiloglou, and J. Wilbur, "A Large Scale, Corpus-Based Approach for Automatically Disambiguating Biomedical Abbreviations," ACM Trans. Information Systems, vol. 24, no. 3, pp. 380-404, 2006.
[51] D. Zhang, J. Tang, J. Li, and K. Wang, "A Constraint-Based Probabilistic Framework for Name Disambiguation," Proc. ACM Conf. Information and Knowledge Management (CIKM '07), pp. 1019-1022, 2007.
[52] Y. Zhou, H. Cheng, and J.X. Yu, "Graph Clustering Based on Structural/Attribute Similarities," Proc. VLDB Endowment, vol. 2, no. 1, pp. 718-729, 2009.
14 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool