Subscribe
Issue No.08 - August (2008 vol.20)
pp: 1067-1076
ABSTRACT
Searching for words on a watchlist is one way in which large-scale surveillance of communication can be done, for example in intelligence and counterterrorism settings. One obvious defense is to replace words that might attract attention to a message with other, more innocuous, words. For example, the sentence the attack will be tomorrow" might be altered to the complex will be tomorrow", since 'complex' is a word whose frequency is close to that of 'attack'. Such substitutions are readily detectable by humans since they do not make sense. We address the problem of detecting such substitutions automatically, by looking for discrepancies between words and their contexts, and using only syntactic information. We define a set of measures, each of which is quite weak, but which together produce per-sentence detection rates around 90% with false positive rates around 10%. Rules for combining persentence detection into per-message detection can reduce the false positive and false negative rates for messages to practical levels. We test the approach using sentences from the Enron email and Brown corpora, representing informal and formal text respectively.
INDEX TERMS
textual analysis, counterterrorism, word frequencies, data mining, pointwise mutual information, co-occurrence
CITATION
SW. Fong, D. Roussinov, D.B. Skillicorn, "Detecting Word Substitutions in Text", IEEE Transactions on Knowledge & Data Engineering, vol.20, no. 8, pp. 1067-1076, August 2008, doi:10.1109/TKDE.2008.94
REFERENCES
 [1] D. Skillicorn , “Beyond Keyword Filtering for Message and Conversation Detection,” Proc. IEEE Int'l Conf. Intelligence and Security Informatics (ISI '05), pp. 231-243, May 2005. [2] “Message on MySpace Prompts School to Beef Up Security,” www.10news.com/news/9150360detail.html, 2006. [3] D. Roussinov and J. Robles , “Applying Question Answering Technology to Locating Malevolent Online Content,” to be published in Decision Support Systems. [4] P. Brown , P. deSouza , R. Mercer , V.D. Pietra , and J. Lai , “Class-Based $n\hbox{-}{\rm Gram}$ Models of Natural Language,” Computational Linguistics, vol. 18, no. 4, pp. 467-479, 1992. [5] S. Fong , D. Skillicorn , and D. Roussinov , “Measures to Detect Word Substitution in Intercepted Communication,” Proc. IEEE Int'l Conf. Intelligence and Security Informatics (ISI '06), May 2006. [6] J. Bilmes and K. Kirchhoff , “Factored Language Models and Generalized Parallel Backoff,” Proc. Human Language Technology Conf. North Am. Chapter of the Assoc. for Computational Linguistics (HLT/NACCL), 2003. [7] A.R. Golding and D. Roth , “A Winnow-Based Approach to Context-Sensitive Spelling Correction,” Machine Learning, special issue on machine learning and natural language, 1999. [8] H. Lee and A. Ng , “Spam Deobfuscation Using a Hidden Markov Model,” Proc. Second Conf. Email and Anti-Spam, 2005. [9] D. Roussinov , L. Zhao , and W. Fan , “Mining Context Specific Similarity Relationships Using the World Wide Web,” Proc. Conf. Human Language Technologies, 2005. [10] D. Roussinov and L. Zhao , “Automatic Discovery of Similarity Relationships through Web Mining,” Decision Support Systems, pp.149-166, 2003. [11] K. Olsen and J. Williams , “Spelling and Grammar Checking Using the Web as a Text Repository,” J. Am. Soc. for Information Science and Technology, vol. 5, no. 11, pp. 1020-1023, 2004. [12] R.F. i Cancho and R. Solé , “The Small World of Human Language,” Proc. Royal Soc. London Series B—Biological Sciences, pp. 2261-2265, 2001. [13] X. Zhu and R. Rosenfeld , “Improving Trigram Language Modeling with the World Wide Web,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 533-536, 2001. [14] P. Turney , “Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL,” Proc. 12th European Conf. Machine Learning (ECML '01), pp. 491-502, 2001. [15] British Nat'l Corpus, www.natcorp.ox.ac.uk, 2004. [16] L. Breiman , “Random Forests-Random Features,” Technical Report 567, Dept. of Statistics, Univ. of California, Sept. 1999. [17] S. Dumais , M. Banko , E. Brill , J. Lin , and A. Ng , “Web Question Answering: Is More Always Better,” Proc. 25th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp.11-15, 2002.