The Community for Technology Leaders
RSS Icon
Issue No.07 - July (2011 vol.23)
pp: 961-976
Cam-Tu Nguyen , Tohoku University, Sendai
Dieu-Thu Le , University of Trento, Italy
Le-Minh Nguyen , Japan Advanced Institute of Science and Technology, Nomi
Susumu Horiguchi , Tohoku University, Sendai
Quang-Thuy Ha , Vietnam National University, Hanoi
This paper introduces a hidden topic-based framework for processing short and sparse documents (e.g., search result snippets, product descriptions, book/movie summaries, and advertising messages) on the Web. The framework focuses on solving two main challenges posed by these kinds of documents: 1) data sparseness and 2) synonyms/homonyms. The former leads to the lack of shared words and contexts among documents while the latter are big linguistic obstacles in natural language processing (NLP) and information retrieval (IR). The underlying idea of the framework is that common hidden topics discovered from large external data sets (universal data sets), when included, can make short documents less sparse and more topic-oriented. Furthermore, hidden topics from universal data sets help handle unseen data better. The proposed framework can also be applied for different natural languages and data domains. We carefully evaluated the framework by carrying out two experiments for two important online applications (Web search result classification and matching/ranking for contextual advertising) with large-scale universal data sets and we achieved significant results.
Web mining, hidden topic analysis, sparse data, classification, matching, ranking, contextual advertising.
Cam-Tu Nguyen, Dieu-Thu Le, Le-Minh Nguyen, Susumu Horiguchi, Quang-Thuy Ha, "A Hidden Topic-Based Framework toward Building Applications with Short Web Documents", IEEE Transactions on Knowledge & Data Engineering, vol.23, no. 7, pp. 961-976, July 2011, doi:10.1109/TKDE.2010.27
[1] L. Baker and A. McCallum, "Distributional Clustering of Words for Text Classification," Proc. ACM SIGIR, 1998.
[2] P. Baldi, P. Frasconi, and P. Smyth, Modeling the Internet and the Web: Probabilistic Methods and Algorithms. Wiley, 2003.
[3] S. Banerjee, K. Ramanathan, and A. Gupta, "Clustering Short Texts Using Wikipedia," Proc. ACM SIGIR, 2007.
[4] A. Berger, A. Pietra, and J. Pietra, "A Maximum Entropy Approach to Natural Language Processing," Computational Linguistics, vol. 22, no. 1, pp. 39-71, 1996.
[5] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, "Distributional Word Clusters vs. Words for Text Categorization," J. Machine Learning Research, vol. 3, pp. 1183-1208, 2003.
[6] D. Blei, A. Ng, and M. Jordan, "Latent Dirichlet Allocation," J. Machine Learning Research, vol. 3, pp. 993-1022, 2003.
[7] D. Blei and J. Lafferty, "A Correlated Topic Model of Science," Annals of Applied Statistics, vol. 1, no. 1, pp. 17-35, 2007.
[8] D. Bollegala, Y. Matsuo, and M. Ishizuka, "Measuring Semantic Similarity between Words Using Web Search Engines," Proc. 16th Int'l Conf. World Wide Web (WWW), 2007.
[9] A. Blum and T. Mitchell, "Combining Labeled and Unlabeled Data with Co-Training," Proc. 11th Ann. Conf. Computational Learning Theory (COLT), 1998.
[10] A. Broder, M. Fontoura, V. Josifovski, and L. Riedel, "A Semantic Approach to Contextual Advertising," Proc. ACM SIGIR, 2007.
[11] L. Cai and T. Hofmann, "Text Categorization by Boosting Automatically Extracted Concepts," Proc. ACM SIGIR, 2003.
[12] J. Cai, W. Lee, and Y. Teh, "Improving WSD Using Topic Features," Proc. Joint Conf. Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007.
[13] P. Chatterjee, D. Hoffman, and T. Novak, "Modeling the Clickstream: Implications for Web-Based Advertising Efforts," Marketing Science, vol. 22, no. 4, pp. 520-541, 2003.
[14] M. Ciaramita, V. Murdock, and V. Plachouras, "Semantic Associations for Contextual Advertising," J. Electronic Commerce Research, vol. 9, no. 1, pp. 1-15, 2008.
[15] S. Deerwester, G. Furnas, and T. Landauer, "Indexing by Latent Semantic Analysis," J. Am. Soc. for Information Science, vol. 41, no. 6, pp. 391-407, 1990.
[16] L. Denoyer and P. Gallinari, "The Wikipedia XML Corpus," Proc. ACM SIGIR Forum, 2006.
[17] I. Dhillon and D. Modha, "Concept Decompositions for Large Sparse Text Data Using Clustering," Machine Learning, vol. 42, nos. 1/2, pp. 143-175, 2001.
[18] E. Gabrilovich and S. Markovitch, "Computing Semantic Relatedness Using Wikipedia-Based Explicit Semantic Analysis," Proc. 20th Int'l Joint Conf. Artificial Intelligence (IJCAI), 2007.
[19] S. Geman and D. Geman, "Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. PAMI-6, no. 6, pp. 721-741, Nov. 1984.
[20] T. Griffiths and M. Steyvers, "Finding Scientific Topics," Proc. Nat'l Academy of Sciences of the United States of Am., vol. 101, pp. 5228-5235, 2004.
[21] IAB: Interactive Advertising Bureau, "IAB Internet Advertising Revenue Report," technical report, 2008.
[22] T. Joachims, "Text Categorization with SVMs: Learning with Many Relevant Features," Proc. 10th European Conf. Machine Learning (ECML), 1998.
[23] G. Heinrich, "Parameter Estimation for Text Analysis," technical report, 2005.
[24] T. Hofmann, "Probabilistic LSA," Proc. Fifteenth Ann. Conf. Uncertainty in Artificial Intelligence (UAI), 1999.
[25] T. Hofmann, "Latent Semantic Models for Collaborative Filtering," ACM Trans. Information Systems, vol. 22, no. 1, pp. 89-115, 2004.
[26] A. Lacerda, M. Cristo, M. Andre, G. Fan, N. Ziviani, and B. Ribeiro-Neto, "Learning to Advertise," Proc. ACM SIGIR, 2006.
[27] T.A. Letsche and M.W. Berry, "Large-Scale Information Retrieval with Latent Semantic Indexing," Information Science, vol. 100, nos. 1-4, pp. 105-137, 1997.
[28] D. Liu and J. Nocedal, "On the Limited Memory BFGS Method for Large-Scale Optimization," Math. Programming, vol. 45, pp. 503-528, 1989.
[29] C.D. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval. Cambridge Univ. Press, 2008.
[30] D. Metzler, S. Dumais, and C. Meek, "Similarity Measures for Short Segments of Text," Proc. 29th European Conf. IR Research (ECIR), 2007.
[31] X.-H. Phan, L.-M. Nguyen, and S. Horiguchi, "Learning to Classify Short and Sparse Text and Web with Hidden Topics from Large-Scale Data Collections," Proc. 17th Int'l Conf. World Wide Web (WWW), 2008.
[32] B. Ribeiro-Neto, M. Cristo, P. Golgher, and E. Moura, "Impedance Coupling in Content-Targeted Advertising," Proc. ACM SIGIR, 2005.
[33] M. Sahami and T. Heilman, "A Web-Based Kernel Function for Measuring the Similarity of Short Text Snippets," Proc. 15th Int'l Conf. World Wide Web (WWW), 2006.
[34] G. Salton, A. Wong, and C.S. Yang, "A Vector Space Model for Automatic Indexing," Comm. ACM, vol. 18, no. 11, pp. 613-620, 1975.
[35] P. Schonhofen, "Identifying Document Topics Using the Wikipedia Category Network," Proc. IEEE/WIC/ACM Int'l Conf. Web Intelligence, 2006.
[36] F. Sebastiani, "Machine Learning in Automated Text Categorization," ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[37] R. Wang, P. Zhang, and M. Eredita, "Understanding Consumers Attitude toward Advertising," Proc. Eighth Am. Conf. Information Systems (AMCIS), 2002.
[38] X. Wei and W. Croft, "LDA-Based Document Models for Ad-Hoc Retrieval," Proc. ACM SIGIR, 2006.
[39] W. Yih, J. Goodman, and V. Carvalho, "Finding Advertising Keywords on Web Pages," Proc. 15th Int'l Conf. World Wide Web (WWW), 2006.
[40] W. Yih and C. Meek, "Improving Similarity Measures for Short Segments of Text," Proc. 22nd Nat'l Conf. Artificial Intelligence (AAAI), 2007.
[41] O. Zamir and O. Etzioni, "Grouper: A Dynamic Clustering Interface to Web Search Results," Proc. Eighth Int'l Conf. World Wide Web (WWW), 1999.
[42] H. Zeng, Q. He, Z. Chen, W. Ma, and J. Ma, "Learning to Cluster Web Search Results," Proc. ACM SIGIR, 2004.
23 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool