This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Keep It Simple with Time: A Reexamination of Probabilistic Topic Detection Models
October 2010 (vol. 32 no. 10)
pp. 1795-1808
Qi He, Pennsylvania State University, State College
Kuiyu Chang, Nanyang Technological University, Singapore
Ee-Peng Lim, Singapore Management University, Singapore
Arindam Banerjee, University of Minnesota, Twin Cities, Minneapolis
Topic detection (TD) is a fundamental research issue in the Topic Detection and Tracking (TDT) community with practical implications; TD helps analysts to separate the wheat from the chaff among the thousands of incoming news streams. In this paper, we propose a simple and effective topic detection model called the temporal Discriminative Probabilistic Model (DPM), which is shown to be theoretically equivalent to the classic vector space model with feature selection and temporally discriminative weights. We compare DPM to its various probabilistic cousins, ranging from mixture models like von-Mises Fisher (vMF) to mixed membership models like Latent Dirichlet Allocation (LDA). Benchmark results on the TDT3 data set show that sophisticated models, such as vMF and LDA, do not necessarily lead to better results; in the case of LDA, notably worst performance was obtained under variational inference, which is likely due to the significantly large number of LDA model parameters involved for document-level topic detection. On the contrary, using a relatively simple time-aware probabilistic model such as DPM suffices for both offline and online topic detection tasks, making DPM a theoretically elegant and effective model for practical topic detection.

[1] A. Ahmed and E. Xing, "Dynamic Non-Parametric Mixture Models and The Recurrent Chinese Restaurant Process: with Applications to Evolutionary Clustering," Proc. Eighth SIAM Int'l Conf. Data Mining, 2008.
[2] J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang, "Topic Detection and Tracking Pilot Study: Final Report," Proc. DARPA Broadcast News Transcription and Understanding Workshop, 1998.
[3] J. Allan, V. Lavrenko, and H. Jin, "First Story Detection in TDT Is Hard," Proc. Ninth Int'l Conf. Information and Knowledge Management, 2000.
[4] J. Allan, Topic Detection and Tracking: Event-Based Information Organization. Kluwer Academic Publishers, 2002.
[5] J. Allan, S. Harding, D. Fisher, A. Bolivar, S. Guzman-Lara, and P. Amstutz, "Taking Topic Detection from Evaluation to Practice," Proc. 38th Ann. Hawaii Int'l Conf. System Sciences, 2005.
[6] R. Ariew, Ockham's Razor: A Historical and Philosophical Analysis of Ockham's Principle of Parsimony, Univ. of Illi nois, 1976.
[7] A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra, "Generative Model-Based Clustering of Directional Data," Proc. ACM SIGKDD '03, 2003.
[8] A. Banerjee and S. Basuy, "Topic Models over Text Streams: A Study of Batch and Online Unsupervised Learning," Proc. SIAM Int'l Conf. Data Mining, 2007.
[9] S. Basu, A. Banerjee, and R.J. Mooney, "Active Semi-Supervision for Pairwise Constrained Clustering," Proc. SIAM Int'l Conf. Data Mining, 2004.
[10] D.M. Blei, A.Y. Ng, and M.I. Jordan, "Latent Dirichlet Allocation," J. Machine Learning Research, vol. 3, pp. 993-1022, Mar. 2003.
[11] D.M. Blei and J.D. Lafferty, "Dynamic Topic Models," Proc. Int'l Conf. Machine Learning, 2006.
[12] T. Brants, F. Chen, and A. Farahat, "A System for New Event Detection," Proc. ACM SIGIR '03, 2003.
[13] S. Dasgupta, "Learning Mixtures of Gaussians," Proc. IEEE Symp. Foundations of Computer Science, 1999.
[14] I.S. Dhillon and D.S. Modha, "Concept Decompositions for Large Sparse Text Data Using Clustering," Machine Learning, vol. 42, pp. 143-175, 2001.
[15] C. Elkan, "Clustering Documents with an Exponential-Family Approximation of the Dirichlet Compound Multinomial Distribution," Proc. Int'l Conf. Machine Learning, 2006.
[16] G.P.C. Fung, J.X. Yu, P.S. Yu, and H. Lu, "Parameter Free Bursty Events Detection in Text Streams," Proc. Int'l Conf. Very Large Data Bases, 2005.
[17] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan, "Clustering Data Streams: Theory and Practice," IEEE Trans. Knowledge and Data Eng., vol. 15, no. 3, pp. 515-528, May/June 2003.
[18] Q. He, K. Chang, and E.-P. Lim, "A Model for Anticipatory Event Detection," Proc. 25th Int'l Conf. Conceptual Modeling, 2006.
[19] Q. He, K. Chang, E.-P. Lim, and J. Zhang, "Bursty Feature Representation for Clustering Text Streams," Proc. SIAM Int'l Conf. Data Mining, 2007.
[20] Q. He, K. Chang, and E.-P. Lim, "Analyzing Feature Trajectories for Event Detection," Proc. ACM SIGIR '07, 2007.
[21] Q. He, B. Chen, J. Pei, B. Qiu, P. Mitra, and C.L. Giles, "Detecting Topic Evolution in Scientific Literature: How Can Citations Help?" Proc. Conf. Information and Knowledge Management, 2009.
[22] T. Hofmann, "Probabilistic Latent Semantic Indexing," Proc. ACM SIGIR '99, 1999.
[23] A.K. Jain, M.N. Murty, and P.J. Flynn, "Data Clustering: A Review," ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.
[24] T. Joachims, "A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization," Proc. Int'l Conf. Machine Learning, 1997.
[25] J. Kleinberg, "Bursty and Hierarchical Structure in Streams," Proc. ACM SIGKDD '02, 2002.
[26] G. Kumaran and J. Allan, "Text Classification and Named Entities for New Event Detection," Proc. ACM SIGIR '04, 2004.
[27] S. Lacoste-Julien, F. Sha, and M.I. Jordan, "DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification," Proc. Neural Information Processing Systems, 2008.
[28] G. Lebanon, "Learning Riemannian Metrics," Proc. Conf. Uncertainty in Artificial Intelligence, 2003.
[29] W. Li, D. Blei, and A. McCallum, "Nonparametric Bayes Pachinko Allocation," Proc. Conf. Uncertainty in Artificial Intelligence, 2007.
[30] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, "The DET Curve in Assessment of Detection Task Performance," Proc. Eurospeech, 1997.
[31] G.J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition, new ed. Wiley Interscience, 2004.
[32] R. Neal, "Markov Chain Sampling Methods for Dirichlet Process Mixture Models," Technical Report 9815, Univ. of Toronto, 1998.
[33] K. Nigam, A.K. McCallum, S. Thrun, and T.M. Mitchell, "Text Classification from Labeled and Unlabeled Documents Using EM," Machine Learning, vol. 39, pp. 103-134, 2000.
[34] G. Salton and C. Buckley, "Term-Weighting Approaches in Automatic Text Retrieval," Information Processing and Management, vol. 24, no. 5, pp. 513-523, 1988.
[35] N. Stokes and J. Carthy, "Combining Semantic and Syntactic Document Classifiers to Improve First Story Detection," Proc. ACM SIGIR '01, 2001.
[36] TDT: Annotation Manual—Version 1.2, Aug. 4, 2004, http://www.ldc.upenn.edu/ProjectsTDT2004 , 2004.
[37] Y. Teh, M. Jordan, M. Beal, and D. Blei, "Hierarchical Dirichlet Processes," Technical Report TR-653, Univ. of California, Berkeley, Statistics, 2004.
[38] A. Tsymbal, "The Problem of Concept Drift: Definitions and Related Work," technical report, Dept. of Computer Science, Trinity College, 2004.
[39] X. Wang and A. McCallum, "Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends," Proc. ACM SIGKDD '06, 2006.
[40] Y. Yang, T. Pierce, and J. Carbonell, "A Study of Retrospective and On-Line Event Detection," Proc. ACM SIGIR '98, 1998.
[41] Y. Yang, J. Zhang, J. Carbonell, and C. Jin, "Topic-Conditioned Novelty Detection," Proc. ACM SIGKDD '02, 2002.
[42] C.C. Yang and X. Shi, "Discovering Event Evolution Graphs from Newswires," Proc. 15th Int'l Conf. World Wide Web, 2006.
[43] J. Zhang, Z. Ghahramani, and Y. Yang, "A Probabilistic Model for Online Document Clustering with Application to Novelty Detection," Proc. Neural Information Processing System, 2005.
[44] Y. Zhu and D. Shasha, "Efficient Elastic Burst Detection in Data Streams," Proc. ACM SIGKDD '03, 2003.

Index Terms:
Topic detection, probabilistic model, time-aware, bursty feature, online, DPM, TFIDF.
Citation:
Qi He, Kuiyu Chang, Ee-Peng Lim, Arindam Banerjee, "Keep It Simple with Time: A Reexamination of Probabilistic Topic Detection Models," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 10, pp. 1795-1808, Oct. 2010, doi:10.1109/TPAMI.2009.203
Usage of this product signifies your acceptance of the Terms of Use.