Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007)
Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models
Omaha, Nebraska, USA
October 28-October 31
ISBN: 0-7695-3033-8
Statistical topic models such as the Latent Dirichlet Al- location (LDA) have emerged as an attractive framework to model, visualize and summarize large document collections in a completely unsupervised fashion. One of the limitations of this family of models is their assumption of exchangeabil- ity of words within documents, which results in a `bag-of- words' representation for documents as well as topics. As a consequence, precious information that exists in the form of correlations between words is lost in these models. In this work, we adapt recent advances in sparse mod- eling techniques to the problem of modeling word corre- lations within topics and present a new algorithm called Sparse Word Graphs. Our experiments on AP corpus re- veal both long-distance and short-distance word correla- tions within topics that are semantically very meaningful. In addition, the new algorithm is highly scalable to large collections as it captures only the most important correla- tions in a sparse manner.
Citation:
Ramesh Nallapati, Amr Ahmed, William Cohen, Eric Xing, "Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models," icdmw, pp.343-348, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007), 2007