2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI) (2016)
San Jose, CA, USA
Nov. 6, 2016 to Nov. 8, 2016
Recently many topic models such as Latent Dirich-let Allocation (LDA) have made important progress towards generating high-level knowledge from a large corpus. They assume that a text consists of a mixture of topics, which is usually the case for regular articles but may not hold for a short text that usually contains only one topic. In practice, a corpus may include both short texts and long texts, in this case neither methods developed for only long texts nor methods for only short texts can generate satisfying results. In this paper, we present an innovative method to discover latent topics from a heterogeneous corpus including both long and short texts. A new topic model based on collapsed Gibbs sampling algorithm is developed for modeling such heterogeneous texts. The experiments on real-world datasets validate the effectiveness of the proposed model in comparison with other state-of-the-art models.
Rockets, Companies, Solid modeling, Twitter, Headphones, Mobile communication, Computer science,Collapsed Gibbs Sampling, Topic Model, LDA, Heterogeneous texts
Jipeng Qiang, Ping Chen, Wei Ding, Tong Wang, Fei Xie, Xindong Wu, "Topic Discovery from Heterogeneous Texts", 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI), vol. 00, no. , pp. 196-203, 2016, doi:10.1109/ICTAI.2016.0039