The Community for Technology Leaders
Digital Libraries, Joint Conference on (2004)
Tuscon, AZ, USA
June 7, 2004 to June 11, 2004
ISBN: 1-58113-832-6
pp: 296-305
Cheng Li , Harvard School of Public Health, Boston, MA
Hongyuan Zha , The Pennsylvania State University, University Park, PA
Lee Giles , The Pennsylvania State University, University Park, PA
Hui Han , The Pennsylvania State University, University Park, PA
Kostas Tsioutsiouliklis , NEC Laboratories America, Inc., Princeton, NJ
ABSTRACT
Due to name abbreviations, identical names, name misspellings, and pseudonyms in publications or bibliographies (citations), an author may have multiple names and multiple authors may share the same name. Such name ambiguity affects the performance of document retrieval, web search, database integration, and may cause improper attribution to authors. This paper investigates two supervised learning approaches to disambiguate authors in the citations. One approach uses the naive Bayes probability model, a generative model; the other uses Support Vector Machines(SVMs) [The Nature of Statistical Learning Theory] and the vector space representation of citations, a discriminative model. Both approaches utilize three types of citation attributes: co-author names, the title of the paper, and the title of the journal or proceeding. We illustrate these two approaches on two types of data, one collected from the web, mainly publication lists from homepages, the other collected from the DBLP citation databases.
INDEX TERMS
Naive Bayes, Name Disambiguation, Support Vector Machine
CITATION
Cheng Li, Hongyuan Zha, Lee Giles, Hui Han, Kostas Tsioutsiouliklis, "Two Supervised Learning Approaches for Name Disambiguation in Author Citations", Digital Libraries, Joint Conference on, vol. 00, no. , pp. 296-305, 2004, doi:10.1109/JCDL.2004.1336139
101 ms
(Ver )