The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.04 - April (2009 vol.31)
pp: 721-735
Man Lan , East China Normal University, Shanghai
Chew Lim Tan , National University of Singapore, Singapore
Jian Su , Institute for Infocomm Research, Singapore
Yue Lu , East China Normal University, Shanghai
ABSTRACT
In vector space model (VSM), text representation is the task of transforming the content of a textual document into a vector in the term space so that the document could be recognized and classified by a computer or a classifier. Different terms (i.e. words, phrases, or any other indexing units used to identify the contents of a text) have different importance in a text. The term weighting methods assign appropriate weights to the terms to improve the performance of text categorization. In this study, we investigate several widely-used unsupervised (traditional) and supervised term weighting methods on benchmark data collections in combination with SVM and kNN algorithms. In consideration of the distribution of relevant documents in the collection, we propose a new simple supervised term weighting method, i.e. tf.rf, to improve the terms' discriminating power for text categorization task. From the controlled experimental results, these supervised term weighting methods have mixed performance. Specifically, our proposed supervised term weighting method, tf.rf, has a consistently better performance than other term weighting methods while other supervised term weighting methods based on information theory or statistical metric perform the worst in all experiments. On the other hand, the popularly used tf.idf method has not shown a uniformly good performance in terms of different data sets.
INDEX TERMS
Knowledge and data engineering tools and techniques, Clustering, classification, and association rules, Text mining, Database Applications, Database Management, Information Technology and Systems, Indexing methods, Content Analysis and Indexing, Information Storage and Retrieval, Information Technolog, Text analysis, Natural Language Processing, Artificial Intelligence, Computing Methodologies
CITATION
Man Lan, Chew Lim Tan, Jian Su, Yue Lu, "Supervised and Traditional Term Weighting Methods for Automatic Text Categorization", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.31, no. 4, pp. 721-735, April 2009, doi:10.1109/TPAMI.2008.110
REFERENCES
[1] Y. Yang and J.O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” Proc. Int'l Conf. Machine Learning, pp. 412-420, 1997.
[2] Z.-H. Deng, S.-W. Tang, D.-Q. Yang, M. Zhang, L.-Y. Li, and K.Q. Xie, “A Comparative Study on Feature Weight in Text Categorization,” Proc. Asia-Pacific Web Conf., vol. 3007, pp. 588-597, 2004.
[3] F. Debole and F. Sebastiani, “Supervised Term Weighting for Automated Text Categorization,” Proc. ACM Symp. Applied Computing, pp. 784-788, 2003.
[4] P. Soucy and G.W. Mineau, “Beyond TFIDF Weighting for Text Categorization in the Vector Space Model,” Proc. Int'l Joint Conf. Artificial Intelligence, pp. 1130-1135, 2005.
[5] E.-H. Han, G. Karypis, and V. Kumar, “Text Categorization Using Weight Adjusted K-Nearest Neighbor Classification,” Proc. Pacific-Asia Conf. Knowledge Discovery and Data Mining, pp. 53-65, 2001.
[6] M. Lan, S.Y. Sung, H.B. Low, and C.L. Tan, “A Comparative Study on Term Weighting Schemes for Text Categorization,” Proc. Int'l Joint Conf. Neural Networks, pp. 546-551, 2005.
[7] E. Leopold and J. Kindermann, “Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?” Machine Learning, vol. 46, nos. 1-3, pp. 423-444, 2002.
[8] M. Lan, C.L. Tan, H.B. Low, and S.Y. Sung, “A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with Support Vector Machines,” Special Interest Tracks and Posters of the WWW, pp. 1032-1033, 2005.
[9] M. Lan, C.L. Tan, and H.B. Low, “Proposing a New Term Weighting Scheme for Text Categorization,” Proc. Nat'l Conf. Artificial Intelligence, pp. 763-768, 2006.
[10] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. European Conf. Machine Learning, pp. 137-142, 1998.
[11] G. Salton and C. Buckley, “Term-Weighting Approaches in Automatic Text Retrieval,” Information Processing and Management, vol. 24, no. 5, pp. 513-523, 1988.
[12] A. McCallum and K. Nigam, “A Comparison of Event Models for Naive Bayes Text Classification,” Proc. AAAI Workshop Learning for Text Categorization, 1998.
[13] C. Buckley, G. Salton, J. Allan, and A. Singhal, “Automatic Query Expansion Using SMART: TREC 3,” Proc. Third Text REtrieval Conf., pp. 69-80, 1994.
[14] H. Wu and G. Salton, “A Comparison of Search Term Weighting: Term Relevance versus Inverse Document Frequency,” Proc. SIGIR '81, pp. 30-39, 1981.
[15] K.S. Jones, “A Statistical Interpretation of Term Specificity and Its Application in Retrieval,” J. Documentation, vol. 28, no. 1, pp. 11-21, 1972.
[16] K.S. Jones, “A Statistical Interpretation of Term Specificity and Its Application in Retrieval,” J. Documentation, vol. 60, no. 5, pp. 493-502, 2004.
[17] S.E. Robertson, “Understanding Inverse Document Frequency: On Theoretical Arguments for IDF,” J. Documentation, vol. 60, no. 5, pp. 503-520, 2004.
[18] S.E. Robertson and K.S. Jones, “Relevance Weighting of Search Terms,” J. Am. Soc. Information Science, vol. 27, pp. 129-146, 1976.
[19] T. Mori, Information Gain Ratio as Term Weight: The Case of Summarization of IR Results, Assoc. Computational Linguistics, pp. 1-7, 2002.
[20] T. Saracevic, “Relevance: A Review of and a Framework for the Thinking on the Notion in Information Science,” J. Am. Soc. Information Science, vol. 26, pp. 321-343, 1975.
[21] S. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive Learning Algorithms and Representations for Text Categorization,” Proc. Conf. Information and Knowledge Management, pp. 148-155, 1998.
[22] Y. Yang and X. Liu, “A Re-Examination of Text Categorization Methods,” Proc. SIGIR '99, pp. 42-49, 1999.
[23] H.T. Ng, W.B. Goh, and K.L. Low, “Feature Selection, Perceptron Learning, and a Usability Case Study for Text Categorization,” Proc. SIGIR '97, pp. 67-73, 1997.
[24] C.-C. Chang and C.-J. Lin, LIBSVM: A Library for Support Vector Machines, http://www.csie.ntu.edu.tw/ cjlinlibsvm, 2001.
[25] M. Porter, “An Algorithm for Suffix Stripping,” Program, vol. 14, no. 3, pp. 130-137, 1980.
[26] D.D. Lewis, Y. Yang, T.G. Rose, and F. Li, “RCV1: A New Benchmark Collection for Text Categorization Research,” J.Machine Learning Research, vol. 5, pp. 361-397, 2004.
[27] T.G. Dietterich, “Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms,” Neural Computation, vol. 10, no. 7, pp. 1895-1923, 1998.
[28] Y.-S. Dong and K.-S. Han, “Text Classification Based on Data Partitioning and Parameter Varying Ensembles,” Proc. ACM Symp. Applied Computing, pp. 1044-1048, 2005.
25 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool