The Community for Technology Leaders
RSS Icon
Issue No.03 - March (2009 vol.21)
pp: 428-442
Xiao-Bing Xue , Nanjing University, Nanjing
Zhi-Hua Zhou , Nanjing University, Nanjing
Text categorization is the task of assigning predefined categories to natural language text. With the widely used 'bag of words' representation, previous researches usually assign a word with values such that whether this word appears in the document concerned or how frequently this word appears. Although these values are useful for text categorization, they have not fully expressed the abundant information contained in the document. This paper explores the effect of other types of values, which express the distribution of a word in the document. These novel values assigned to a word are called {\it distributional features}, which include the compactness of the appearances of the word and the position of the first appearance of the word. The proposed distributional features are exploited by a {\it tfidf} style equation and different features are combined using ensemble learning techniques. Experiments show that the distributional features are useful for text categorization. In contrast to using the traditional term frequency values solely, including the distributional features requires only a little additional cost, while the categorization performance can be significantly improved. Further analysis shows that the distributional features are especially useful when documents are long and the writing style is casual.
Data mining, Text mining, Modeling structured, textual and multimedia data
Xiao-Bing Xue, Zhi-Hua Zhou, "Distributional Features for Text Categorization", IEEE Transactions on Knowledge & Data Engineering, vol.21, no. 3, pp. 428-442, March 2009, doi:10.1109/TKDE.2008.166
[1] L.D. Baker and A.K. McCallum, “Distributional Clustering of Words for Text Classification,” Proc. ACM SIGIR '98, pp. 96-103, 1998.
[2] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, “Distributional Word Clusters versus Words for Text Categorization,” J.Machine Learning Research, vol. 3, pp. 1182-1208, 2003.
[3] D. Cai, S.-P. Yu, J.-R. Wen, and W.-Y. Ma, “VIPS: A Vision-Based Page Segmentation Algorithm,” Technical Report MSR-TR-2003-79, Microsoft, Seattle, Washington, 2003.
[4] J.P. Callan, “Passage Retrieval Evidence in Document Retrieval,” Proc. ACM SIGIR '94, pp. 302-310, 1994.
[5] M.F. Caropreso, S. Matwin, and F. Sebastiani, “A Learner-Independent Evaluation of the Usefulness of Statistical Phrases for Automated Text Categorization,” Text Databases and Document Management: Theory and Practice, A.G. Chin, ed., pp. 78-102, Idea Group Publishing, 2001.
[6] M. Craven, D. DiPasquo, D. Freitag, A.K. McCallum, T.M. Mitchell, K. Nigam, and S. Slattery, “Learning to Extract Symbolic Knowledge from the World Wide Web,” Proc. 15th Nat'l Conf. for Artificial Intelligence, pp. 509-516, 1998.
[7] F. Debole and F. Sebastiani, “Supervised Term Weighting for Automated Text Categorization,” Proc. 18th ACM Symp. Applied Computing (SAC '03), pp. 784-788, 2003.
[8] T.G. Dietterich, “Machine Learning Research: Four Current Directions,” AI Magazine, vol. 18, no. 4, pp. 97-136, 1997.
[9] S.T. Dumais, J.C. Platt, D. Heckerman, and M. Sahami, “Inductive Learning Algorithms and Representations for Text Categorization,” Proc. Seventh Int'l Conf. Information and Knowledge Management (CIKM '98), pp. 148-155, 1998.
[10] C. Fellbaum, WordNet: An Electronic Lexical Database. MIT Press, 1998.
[11] J. Fürnkranz, “A Study Using n-Gram Features for Text Categorization,” Technical Report OEFAI-TR-98-30, Austrian Inst. for Artificial Intelligence, Vienna, Austria, 1998.
[12] J. Fürnkranz, T. Mitchell, and E. Riloff, “A Case Study in Using Linguistic Phrases for Text Categorization on the WWW,” Proc. First AAAI Workshop Learning for Text Categorization, pp. 5-12, 1998.
[13] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. 10th European Conf. Machine Learning (ECML '98), pp. 137-142, 1998.
[14] J. Kim and M.H. Kim, “An Evaluation of Passage-Based Text Categorization,” J. Intelligent Information Systems, vol. 23, no. 1, pp. 47-65, 2004.
[15] Y. Ko, J. Park, and J. Seo, “Improving Text Categorization Using the Importance of Sentences,” Information Processing and Management, vol. 40, no. 1, pp. 65-79, 2004.
[16] M. Lan, S.Y. Sung, H.B. Low, and C.L. Tan, “A Comparative Study on Term Weighting Schemes for Text Categorization,” Proc. Int'l Joint Conf. Neural Networks (IJCNN '05), pp. 546-551, 2005.
[17] K. Lang, “Newsweeder: Learning to Filter Netnews,” Proc. 12th Int'l Conf. Machine Learning (ICML '95), pp. 331-339, 1995.
[18] E. Leopold and J. Kingermann, “Text Categorization with Support Vector Machines: How to Represent Text in Input Space?” Machine Learning, vol. 46, nos. 1-3, pp. 423-444, 2002.
[19] D. Lewis, Reuters-21578 Text Categorization Test Collection, Dist. 1.0, 1997.
[20] D.D. Lewis, “An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task,” Proc. ACM SIGIR '92, pp. 37-50, 1992.
[21] F. Li and Y. Yang, “A Loss Function Analysis for Classification Methods in Text Categorization,” Proc. 20th Int'l Conf. Machine Learning (ICML '03), pp. 472-479, 2003.
[22] D. Mladenic and M. Globelnik, “Word Sequences as Features in Text Learning,” Proc. 17th Electrotechnical and Computer Science Conf. (ERK '98), pp. 145-148, 1998.
[23] A. Moschitti and R. Basili, “Complex Linguistic Features for Text Classification: A Comprehensive Study,” Proc. 26th European Conf. IR Research (ECIR '04), pp. 181-196, 2004.
[24] K. Nigam, A.K. McCallum, S. Thrun, and T.M. Mitchell, “Learning to Classify Text from Labeled and Unlabeled Documents,” Proc. 15th Nat'l Conf. for Artificial Intelligence, pp. 792-799, 1998.
[25] B. Raskutti, H. Ferra, and A. Kowalczyk, “Second Order Features for Maximising Text Classification Performance,” Proc. 12th European Conf. Machine Learning (ECML '01), pp. 419-430, 2001.
[26] J. Rennie, L. Shih, J. Teevan, and D. Karger, “Tackling the Poor Assumptions of Naive Bayes Text Classifiers,” Proc. 20th Int'l Conf. Machine Learning (ICML '03), pp. 616-623, 2003.
[27] M. Sauban and B. Pfahringer, “Text Categorization Using Document Profiling,” Proc. Seventh European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD '03), pp. 411-422, 2003.
[28] R.E. Schapire and Y. Singer, “Boostexter: A Boosting-Based System for Text Categorization,” Machine Learning, vol. 39, nos. 2/3, pp. 135-168, 2000.
[29] S. Scott and S. Matwin, “Feature Engineering for Text Classification,” Proc. 16th Int'l Conf. Machine Learning (ICML'99), pp. 379-388, 1999.
[30] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[31] S. Shankar and G. Karypis, “A Feature Weight Adjustment Algorithm for Document Classification,” Proc. SIGKDD '00 Workshop Text Mining, 2000.
[32] P. Soucy and G.W. Mineau, “Beyond tfidf Weighting for Text Categorization in the Vector Space Model,” Proc. 19th Int'l Joint Conf. Artificial Intelligence (IJCAI '05), pp. 1130-1135, 2005.
[33] X.-B. Xue and Z.-H. Zhou, “Distributional Features for Text Categorization,” Proc. 17th European Conf. Machine Learning (ICML '06), pp. 497-508, 2006.
[34] Y. Yang, “A Study on Thresholding Strategies for Text Categorization,” Proc. ACM SIGIR '01, pp. 137-145, 2001.
[35] Y. Yang and X. Liu, “A Re-Examination of Text Categorization Methods,” Proc. ACM SIGIR '99, pp. 42-49, 1999.
[36] Y. Yang and J.O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” Proc. 14th Int'l Conf. Machine Learning (ICML '97), pp. 412-420, 1997.
3 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool