Subscribe
Issue No.01 - January (2012 vol.24)
pp: 15-29
Jeremy H. Wright , AT&T Labs - Research, Florham Park
John Grothendieck , Raytheon BBN Technologies
ABSTRACT
Text streams are ubiquitous and contain a wealth of information, but are typically orders of magnitude too large in scale for comprehensive human inspection. There is a need for tools that can detect and group changes occurring within text streams and substreams, in order to find, structure, and summarize these changes for presentation to human analysts. This paper describes a procedure for efficiently finding step changes, trends, bursts, and cyclic changes affecting frequencies of words, or more general lexical items, within streams of documents which may be optionally labeled with metadata. The common phenomenon of over-dispersion is accommodated using mixture distributions. A streaming implementation is described which can process data from a continuous feed. Anomalies can be detected, grouped, and rendered visually for human comprehension.
INDEX TERMS
Statistical software, modeling structured, textual and multimedia data, text mining.
CITATION
Jeremy H. Wright, John Grothendieck, "CoCITe—Coordinating Changes in Text", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 1, pp. 15-29, January 2012, doi:10.1109/TKDE.2010.250
REFERENCES
 [1] A.E. Raftery and V.E. Akman, "Bayesian Analysis of a Poisson Process with a Change-Point," Biometrica, vol. 73, pp. 85-89, 1986. [2] J.D. Scargle, "Studies in Astronomy Time Series Analysis. V. Bayesian Blocks, a New Method to Analyze Structure in Photon Counting Data," The Astrophysical J., vol. 504, pp. 405-418, 1998. [3] M. Salmenkivi and H. Mannila, "Using Markov Chain Monte Carlo and Dynamic Programming for Event Sequence Data," J. Knowledge and Information Systems, vol. 7, no. 3, pp. 267-288, 2005. [4] Y. Lu and J. Garrido, "Doubly Periodic Non-Homogeneous Poisson Models for Hurricane Data," Statistical Methodology, vol. 2, pp. 17-35, 2005. [5] J. Allan, R. Papka, and V. Lavrenko, "On-Line New Event Detection and Tracking," Proc. 21st ACM-SIGIR Int'l Conf. Research and Development in Information Retrieval (SIGIR '98), pp. 37-45, 1998. [6] J. Allan, Topic Detection and Tracking. Springer, 2002. [7] J. Kleinberg, "Bursty and Hierarchical Structure in Streams," Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), pp. 91-101, 2002. [8] J. Kleinberg, "Temporal Dynamics of On-Line Information Streams," Data Stream Management: Processing High-Speed Data Streams, M. Garofalakis, J. Gehrke, R. Rastogi, eds., Springer, 2006. [9] M. Vlachos, C. Meek, Z. Vagena, and D. Gunopulos, "Identifying Similarities, Periodicities and Bursts for Online Search Queries," Proc. 23th ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '04), pp. 131-142, 2004. [10] X. Wang, C. Zhai, X. Hu, and R. Sproat, "Mining Correlated Bursty Topic Patterns from Coordinated Text Streams," Proc. 13th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '07), pp. 784-793, 2007. [11] M. Vlachos, K.-L. Wu, S.-K. Chen, and P.S. Yu, "Correlating Burst Events on Streaming Stock Market Data," Data Mining and Knowledge Discovery, vol. 16, pp. 109-133, 2008. [12] S. Papadimitriou, J. Sun, and P.S. Yu, "Local Correlation Tracking in Time Series," Proc. IEEE Sixth Int'l Conf. Data Mining, pp. 456-465, 2006. [13] Q. He, K. Chang, and E.-P. Lim, "Analyzing Feature Trajectories for Event Detection," Proc. 30th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '07), pp. 207-214, 2007. [14] M. Dubinko, R. Kumar, J. Magnani, J. Novak, P. Raghavan, and A. Tomkins, "Visualizing Tags over Time," ACM Trans. Web, vol. 1, no. 2, 2007. [15] L. Geng and H.J. Hamilton, "Interestingness Measures for Data Mining: A Survey," ACM Computing Surveys, vol. 38, no. 3, 2006. [16] M.D. Robinson and G.K. Smyth, "Small Sample Estimation of Negative Binomial Dispersion, with Applications to SAGE Data," Biostatistics, vol. 9, no. 2, pp. 321-332, 2008. [17] Y. Young-Xu and K.A. Chan, "Pooling Overdispersed Binomial Data to Estimate Event Rate," BMC Medical Research Methodology, vol. 8, no. 58, 2008. [18] G.A.F. Seber, Linear Regression Analysis. Wiley, 1977. [19] G.A. Barnard, "Significance Tests for $2 \times 2$ Tables," Biometrika, vol. 34, pp. 123-138, 1947. [20] C. Dean and J.F. Lawless, "Tests for Detecting Overdispersion in Poisson Regression Models," J. Am. Statistical Assoc., vol. 84, pp. 467-472, 1989. [21] R.E. Tarone, "Testing the Goodness of Fit of the Binomial Distribution," Biometrika, vol. 66, no. 3, pp. 585-590, 1979. [22] S. van Dongen, "MCL—A Cluster Algorithm for Graphs," http://micans.orgmcl, 2000. [23] S. van Dongen, "Graph Clustering by Flow Simulation," PhD thesis, Univ. of Utrecht, 2000. [24] K. Scarfone and P. Mell, "Guide to Intrusion Detection and Prevention Systems (IDPS)," NIST Special Publication 800-94, http://csrc.ncsl.nist.gov/publications/nistpubs/ 800-94 SP800-94.pdf, 2007. [25] K. Julisch and M. Dacier, "Mining Intrusion Detection Alarms for Actionable Knowledge," Proc. Eighth ACM Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), 2002. [26] J. Viinikka and H. Debar, "Monitoring IDS Background Noise Using EWMA Control Charts and Alert Information," Proc. Seventh Int'l Symp. Recent Advances in Intrusion Detection (RAID), pp. 166-187, 2004. [27] Linguistic Data Consortium, The AQUAINT Corpus of English News Text, Catalog no. LDC2002T31, http://www.ldc.upenn.edu/ Catalog/docsLDC2002T31 , 2002.