2007 Seventh IEEE International Conference on Data Mining
Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval
Omaha, Nebraska, USA
October 28-October 31
ISBN: 0-7695-3018-4
Most topic models, such as latent Dirichlet allocation, rely on the bag-of-words assumption. However, word order and phrases are often critical to capturing the meaning of text in many text mining tasks. This paper presents topical n-grams, a topic model that discovers topics as well as topical phrases. The probabilistic model generates words in their textual order by, for each word, first sampling a topic, then sampling its status as a unigram or bigram, and then sampling the word from a topic-specific unigram or bigram distribution. Thus our model can model "white house" as a special meaning phrase in the `politics' topic, but not in the `real estate' topic. Successive bigrams form longer phrases. We present experiments showing meaningful phrases and more interpretable topics from the NIPS data and improved information retrieval performance on a TREC collection.
Citation:
Xuerui Wang, Andrew McCallum, Xing Wei, "Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval," icdm, pp.697-702, 2007 Seventh IEEE International Conference on Data Mining, 2007