The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.05 - May (2011 vol.33)
pp: 1009-1021
Xiaojun Quan , City University of Hong Kong, Hong Kong
Wenyin Liu , City University of Hong Kong, Hong Kong
Bite Qiu , City University of Hong Kong, Hong Kong
ABSTRACT
Abstract—Term weighting has proven to be an effective way to improve the performance of text categorization. Very recently, with the development of user-interactive question answering or community question answering, there has emerged a need to accurately categorize questions into predefined categories. However, as a question is usually a piece of short text, can the existing term-weighting methods perform consistently in question categorization as they do in text categorization? The answer is not clear, since to the best of our knowledge, we have not seen any work related to this problem despite of its significance. In this study, we investigate the popular unsupervised and supervised term-weighting methods for question categorization. At the same time, we propose three new supervised term-weighting methods, namely, qf^{\ast}icf, iqf^{\ast}qf^{\ast}icf, and vrf. Comparisons of them with existing unsupervised and supervised term-weighting methods are made through a series of experiments on question collections of Yahoo! Answers. The experimental results show that iqf^{\ast}qf^{\ast}icf achieves the best performance among all term-weighting methods, while qf^{\ast}icf and vrf are also competitive for question categorization. Meanwhile, tf^{\ast}OR is proven to be the most significant one among existing methods. In addition, iqf^{\ast}qf^{\ast}icf and vrf are also effective for long document categorization.
INDEX TERMS
Question answering systems, term-weighting, question categorization, text categorization.
CITATION
Xiaojun Quan, Wenyin Liu, Bite Qiu, "Term Weighting Schemes for Question Categorization", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.33, no. 5, pp. 1009-1021, May 2011, doi:10.1109/TPAMI.2010.154
REFERENCES
[1] L. Wenyin, T.Y. Hao, W. Chen, and M. Feng, “A Web-Based Platform for User-Interactive Question-Answering,” World Wide Web: Internet and Web Information Systems, vol. 12, no. 2, pp. 107-124, 2009, doi: 10.1007/s11280-008-0051-3.
[2] Y. Liu, J. Bian, and E. Agichtein, “Predicting Information Seeker Satisfaction in Community Question Answering,” Proc. ACM SIGIR '08, pp. 483-490, 2008.
[3] Yahoo! Answers, http:/answers.yahoo.com, 2009.
[4] L. Prescott, “Yahoo! Answers Captures 96 Percent of Q and a Market Share,” http://weblogs.hitwise.com/leeann-prescott/ 2006/12yahoo_answers_captures_96_of_q.html , 2006.
[5] Baidu Knows, http:/zhidao.baidu.com, 2009.
[6] BuyAns, http:/www.buyans.com, 2009.
[7] X. Li and D. Roth, “Learning Question Classifiers,” Proc. 19th Int'l Conf. Computational Linguistics, pp. 1-7, 2002.
[8] D. Zhang and W.S. Lee, “Question Classification Using Support Vector Machines,” Proc. ACM SIGIR '03, pp. 26-32, 2003.
[9] F. Debole and F. Sebastiani, “Supervised Term Weighting for Automated Text Categorization,” Proc. ACM Symp. Applied Computing, pp. 784-788, 2003.
[10] M. Lan, C.L. Tan, J. Su, and Y. Lu, “Supervised and Traditional Term Weighting Methods for Automatic Text Categorization,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 4, pp. 721-735, Apr. 2009, doi:10.1109/TPAMI.2008.110.
[11] Y. Yang and J.O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” Proc. 14th Int'l Conf. Machine Learning, pp. 412-420, 1997.
[12] M. Lan, C.L. Tan, and H. Low, “Proposing a New Term Weighting Scheme for Text Categorization,” Proc. 21st AAAI Nat'l Conf. Artificial Intelligence, pp. 763-768, 2006.
[13] J. Zobel and A. Moffat, “Exploring the Similarity Space,” ACM SIGIR Forum, vol. 32, no. 1, pp. 18-34, 1998.
[14] Z.-H. Deng, S.-W. Tang, D.-Q. Yang, M. Zhang, L.-Y. Li, and K.Q. Xie, “A Comparative Study on Feature Weight in Text Categorization,” Proc. Asia Pacific Web Conf. '04, pp. 588-597, 2004.
[15] Y. Yang and C.G. Chute, “An Example-Based Mapping Method for Text Categorization and Retrieval,” ACM Trans. Information Systems, vol. 12, no. 3, pp. 252-277, 1994.
[16] Y. Yang, “An Evaluation of Statistical Approaches to Text Categorization,” J. Information Retrieval, vol 1, nos. 1/2, pp. 67-88, 1999.
[17] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[18] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. 10th European Conf. Machine Learning, pp. 137-142, 1998.
[19] H. Wu and G. Salton, “A Comparison of Search Term Weighting: Term Relevance versus Inverse Document Frequency,” Proc. ACM SIGIR '81, pp. 30-39, 1981.
[20] G. Salon and C.S. Yang, “On the Specification of Term Values in Automatic Indexing,” J. Documentation, vol. 29, no. 4, pp. 351-372, Dec. 1973.
[21] G. Salton and C. Buckley, “Term-Weighting Approaches in Automatic Text Retrieval,” Information Processing and Management, vol. 24, no. 5, pp. 513-523, 1988.
[22] G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Longman, 1989.
[23] Y. Yang and X. Liu, “A Re-Examination of Text Categorization Methods,” Proc. ACM SIGIR '99, pp 42-49, 1999.
[24] P. Soucy and G.W. Mineau, “Beyond tfidf Weighting for Text Categorization in the Vector Space Model,” Proc. Int'l Joint Conf. Artificial Intelligence, pp. 1130-1135, 2005.
[25] D. Mladenic and M. Grobelnik, “Feature Selection for Classification Based on Text Hierarchy,” Proc. Working Notes of Learning from Text and the Web Conf. Automated Learning and Discovery, 1998.
[26] M. Lan, S.Y. Sung, H.B. Low, and C.L. Tan, “A Comparative Study on Term Weighting Schemes for Text Categorization,” Proc. Int'l Joint Conf. Neural Networks, pp. 546-551, 2005.
[27] S. Robertson, “Understanding Inverse Document Frequency: On Theoretical Arguments for IDF,” J. Documentation, vol. 60, pp. 503-520, 2004.
[28] M. Porter, “An Algorithm for Suffix Stripping,” Program, vol. 14, no. 3, pp. 130-137, 1980.
[29] S. Mazzola, “A K-Nearest Neighbor-Based Method for the Restoration of Damaged Images,” Pattern Recognition, vol. 23, nos. 1/2, pp. 179-184, 1990.
[30] L. Galavotti, F. Sebastiani, and M. Simi, “Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization,” Proc. Fourth European Conf. Research and Advanced Technology for Digital Libraries, pp. 59-68, 2000.
[31] C. Chang and C. Lin, “LIBSVM: A Library for Support Vector Machines,” http://www.csie.ntu.edu.tw/cjlinlibsvm/, 2001.
[32] R. Kohavi, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection,” Proc. Int'l Joint Conf. Artificial Intelligence, pp. 1137-1145, 1995.
[33] T.G. Dietterich, “Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms,” Neural Computation, vol. 10, no. 7, pp. 1895-1923, 1998.
24 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool