Web Intelligence and Intelligent Agent Technology, IEEE/WIC/ACM International Conference on (2011)
Aug. 22, 2011 to Aug. 27, 2011
Assuming a binomial distribution for word occurrence, we propose computing a standardized Z score to define the specific vocabulary of a subset compared to that of the entire corpus. This approach is applied to weight terms characterizing a document (or a sample of texts). We then show how these Z score values can be used to derive an efficient categorization scheme. To evaluate this proposition we categorize speeches given by B. Obama as either electoral or presidential. The results tend to show that the suggested classification scheme performs better than a Support Vector Machine scheme, and a Naive Bayes classifier (10-fold cross validation).
Text Categorization, Machine Learning, Lexical Analysis, Political Discourse, Natural Language Processing
O. Zubaryeva and J. Savoy, "Classification Based on Specific Vocabulary," 2011 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies(WI-IAT), Lyon, 2011, pp. 120-123.