Third IEEE International Conference on Data Mining (ICDM'03)
Mining the Web to Discover the Meanings of an Ambiguous Word
Melbourne, Florida
November 19-November 22
ISBN: 0-7695-1978-4
In information retrieval and text mining, information on word senses is usually taken from dictionaries or lexical databases that have been prepared by lexicographers. In this paper we propose an automatic method for word sense induction, i.e. for the discovery of a set of sense descriptors to a given ambiguous word. The approach is based on the statistics of word co-occurrence as derived from web pages. The underlying assumption is that the senses of an ambiguous word are best described by terms that, although bearing a strong association to this word, are mutually exclusive, i.e. whose association strength within the retrieved web pages is as weak as possible. Measuring association strength is based upon a novel Confidence Gain approach that relates the observed co-occurrence frequency for two sense descriptor candidates to an average co-occurrence frequency for pairs of arbitrary words. The proposed approach is fully unsupervised and takes into account the contemporary meanings of words, as reflected in texts from the internet. Our results are evaluated using a list of ambiguous words commonly referred to in the literature.