The Community for Technology Leaders
RSS Icon
Issue No.03 - March (2008 vol.20)
pp: 289-299
The goal of on-line event analysis is to detect events and their associated documents in real-time from a continuous stream of documents generated by multiple information sources. Existing approaches (e.g., window-based, decay function, and adaptive threshold methods) incorporate the temporal relations of documents into traditional text categorization methods for event analysis. However, these methods suffer from the threshold dependence problem, i.e., their performance is only acceptable for a narrow range of thresholds; thus, it is difficult to designate an appropriate threshold in advance. In this paper, we propose a threshold resilient algorithm, called Incremental Probabilistic Latent Semantic Indexing (IPLSI), which can capture the storyline development of an event without the threshold dependence problem. The IPLSI algorithm is theoretically sound and more efficient than na?ve PLSI approaches. The results of the performance evaluation based on the TDT 4 corpus show that the proposed algorithm reduces the error tradeoff cost of event detection by as much as 14.51% and increases the threshold range for acceptable performance by 300% - 800%
Web mining, Clustering, Knowledge life cycles, Probabilistic algorithms
Tzu-Chuan Chou, Meng Chang Chen, "Using Incremental PLSI for Threshold-Resilient Online Event Analysis", IEEE Transactions on Knowledge & Data Engineering, vol.20, no. 3, pp. 289-299, March 2008, doi:10.1109/TKDE.2007.190702
[1] J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang, “Topic Detection and Tracking Pilot Study: Final Report,” Proc. DARPA Broadcast News Transcription and Understanding Workshop, 1998.
[2] J. Allan, R. Papka, and V. Lavrenko, “Online New Event Detection and Tracking,” Proc. ACM SIGIR '98, 1998.
[3] D.M. Blei and P.J. Moreno, “Topic Segmentation with an Aspect Hidden Markov Model,” Proc. ACM SIGIR '01, 2001.
[4] T. Brants, F. Chen, and I. Tsochantaridis, “Topic-Based Document Segmentation with Probabilistic Latent Semantic Analysis,” Proc. 11th ACM Int'l Conf. Information and Knowledge Management (CIKM '02), 2002.
[5] T. Brants and F. Chen, “A System for New Event Detection,” Proc. ACM SIGIR '03, 2003.
[6] D. Chakrabarti, R. Kumar, and A. Tomkins, “Evolutionary Clustering,” Proc. ACM SIGKDD '06, 2006.
[7] C.C. Chen, Y.T Chen, and M.C. Chen, “An Aging Theory for Event Life Cycle Modeling,” IEEE Trans. Systems, Man, and Cybernetics Part A, vol. 37, no. 2, pp. 237-248, Mar. 2007.
[8] Language Modeling and Information Retrieval, WB Croft and J.Lafferty, eds. Kluwer Academic Publishers, 2003.
[9] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc. B, vol. 39, pp. 1-38, 1977.
[10] T. Hofmann, “Probabilistic Latent Semantic Indexing,” Proc. ACM SIGIR '99, 1999.
[11] T. Hofmann, “Unsupervised Learning by Probabilistic Latent Semantic Analysis,” Machine Learning, vol. 42, pp. 177-196, 2001.
[12] X. Jin, Y. Zhou, and B. Mobasher, “Web Usage Mining Based on Probabilistic Latent Semantic Analysis,” Proc. ACM SIGKDD '04, 2004.
[13] Z.W. Li, B. Wang, M.J. Li, and W.Y. Ma, “A Probabilistic Model for Retrospective News Event Detection,” Proc. ACM SIGIR '05, 2005.
[14] R. Manmatha, A. Feng, and J. Allan, “A Critical Examination of TDT's Cost Function,” Proc. ACM SIGIR '02, 2002.
[15] Q. Mei and C.X. Zhai, “Discovering Evolutionary Theme Patterns from Text: An Exploration of Temporal Text Mining,” Proc. ACM SIGKDD '05, 2005.
[16] S. Morinaga and K. Yamanishi, “Tracking Dynamics of Topic Trends Using a Finite Mixture Model,” Proc. ACM SIGKDD '04, 2004.
[17] G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.
[18] A. Surendran and S. Sra, “Incremental Aspect Models for Mining Document Streams,” Proc. 17th European Conf. Machine Learning (ECML '06), 2006.
[19] Y. Yang, T. Pierce, and J. Carbonell, “A Study on Retrospective and Online Event Detection,” Proc. ACM SIGIR '98, 1998.
[20] J. Zhang, Y. Yang, and J. Carbonell, “New Event Detection with Nearest Neighbor, Support Vector Machines and Kernel Regression,” Technical Report CMU-CS-04-118 (CMU-LTI-04-180), Carnegie Mellon Univ., 2007.
[21] J. Zhang, Z. Ghahramani, and Y. Yang, “A Probabilistic Model for Online Document Clustering with Application to Novelty Detection,” Proc. Conf. Neural Information Processing Systems (NIPS '04), 2004.
[22] NIST Topic Detection and Tracking Corpus, http://www.nist. gov/speech/tests/tdt/tdt98 index.htm, 1998.
22 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool