This Article 
 Bibliographic References 
 Add to: 
Unsupervised Semantic Similarity Computation between Terms Using Web Documents
November 2010 (vol. 22 no. 11)
pp. 1637-1647
Elias Iosif, Technical University of Crete, Chania
Alexandros Potamianos, Technical University of Crete, Chania
In this work, Web-based metrics that compute the semantic similarity between words or terms are presented and compared with the state of the art. Starting from the fundamental assumption that similarity of context implies similarity of meaning, relevant Web documents are downloaded via a Web search engine and the contextual information of words of interest is compared (context-based similarity metrics). The proposed algorithms work automatically, do not require any human-annotated knowledge resources, e.g., ontologies, and can be generalized and applied to different languages. Context-based metrics are evaluated both on the Charles-Miller data set and on a medical term data set. It is shown that context-based similarity metrics significantly outperform co-occurrence-based metrics, in terms of correlation with human judgment, for both tasks. In addition, the proposed unsupervised context-based similarity computation algorithms are shown to be competitive with the state-of-the-art supervised semantic similarity algorithms that employ language-specific knowledge resources. Specifically, context-based metrics achieve correlation scores of up to 0.88 and 0.74 for the Charles-Miller and medical data sets, respectively. The effect of stop word filtering is also investigated for word and term similarity computation. Finally, the performance of context-based term similarity metrics is evaluated as a function of the number of Web documents used and for various feature weighting schemes.

[1] E. Voorhees, "Query Expansion Using Lexical-Semantic Relations," Proc. 17th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 61-69, 1994.
[2] R. Mihalcea and D. Moldovan, "Semantic Indexing Using WordNet Senses," Proc. ACL Workshop Recent Advances in Natural Language Processing and Information Retrieval, pp. 35-45, 2000.
[3] S. Flank, "A Layered Approach to NLP-Based Information Retrieval," Proc. Int'l Conf. Computational Linguistics, pp. 397-403, 1998.
[4] S. Gauch and J. Wang, "A Corpus Analysis Approach for Automatic Query Expansion," Proc. Int'l Conf. Information and Knowledge Management, pp. 278-284, 1997.
[5] E. Fosler-Lussier and H.-K. Kuo, "Using Semantic Class Information for Rapid Development of Language Models within ASR Dialogue Systems," Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 553-556, 2001.
[6] K.-C. Siu and H. Meng, "Semi-Automatic Acquisition of Domain-Specific Semantic Structures," Proc. European Conf. Speech Comm. and Technology, pp. 2039-2042, 1999.
[7] I. Dagan, L. Lee, and F.C. Pereira, "Similarity-Based Methods for Word Sense Disambiguation," Proc. Conf. Assoc. for Computational Linguistics, pp. 56-63, 1997.
[8] A. Pargellis, E. Fosler-Lussier, C.-H. Lee, A. Potamianos, and A. Tsai, "Auto-Induced Semantic Classes," Speech Comm. vol. 43, no. 3, pp. 183-203, 2004.
[9] E. Iosif, A. Tegos, A. Pangos, E. Fosler-Lussier, and A. Potamianos, "Combining Statistical Similarity Measures for Automatic Induction of Semantic Classes," Proc. IEEE/ACL Workshop Spoken Language Technology, pp. 86-89, 2006.
[10] E. Petrakis, G. Varelas, A. Hliaoutakis, and P. Raftopoulou, "X-Similarity: Computing Semantic Similarity between Concepts from Different Ontologies," J. Digital Information Management, vol. 4, no. 4, pp. 233-238, 2006.
[11] Y. Li, Z. Bandar, and D. McLean, "An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources," IEEE Trans. Knowledge and Data Eng., vol. 15, no. 4, pp. 871-882, July 2003.
[12] C. Leacock and M. Chodorow, "Combining Local Context and WordNet Similarity for Word Sense Identification in WordNet," An Electronic Lexical Database, C. Fellbaum, ed., pp. 265-283, MIT Press, 1998.
[13] J. Jiang and D. Conrath, "Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy," Proc. Int'l Conf. Research on Computational Linguistics, 1997.
[14] A. Budanitsky and G. Hirst, "Evaluating WordNet-Based Measures of Semantic Distance," Computational Linguistics, vol. 32, no. 1, pp. 13-47, 2006.
[15] X. Zhu and R. Rosenfeld, "Improving Trigram Language Modeling with the World Wide Web," Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 533-536, 2001.
[16] L. Dekang, Z. Shaojun, Q. Lijuan, and Z. Ming, "Identifying Synonyms among Distributionally Similar Words," Proc. Int'l Joint Conf. Artificial Intelligence, pp. 1492-1493, 2003.
[17] T. Chklovski and P. Pantel, "VERBOCEAN: Mining the Web for Fine-Grained Semantic Verb Relations," Proc. Conf. Empirical Methods in Natural Language Processing, pp. 33-40, 2004.
[18] E. Terra and C.L.A. Clarke, "Frequency Estimates for Statistical Word Similarity Measures," Proc. Conf. North Am. Chapter of the Assoc. for Computational Linguistics on Human Language Technology, pp. 165-172, 2003.
[19] M. Popovic and H. Ney, "Exploiting Phrasal Lexica and Additional Morpho-Syntactic Language Resources for Statistical Machine Translation with Scarce Training Data," Proc. 10th Ann. Conf. European Assoc. for Machine Translation, pp. 212-218, 2005.
[20] S. Dumais, M.B.E. Brill, J. Lin, and A. Ng, "Web Question Answering: Is More Always Better?" Proc. 25th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 291-298, 2002.
[21] P. Cimano, S. Handschuh, and S. Staab, "Towards the Self-Annotating Web," Proc. Int'l Conf. World Wide Web, pp. 462-471, 2004.
[22] P. Mika, "Ontologies Are Us: A Unified Model of Social Networks and Semantics," Proc. Int'l Semantic Web Conf., pp. 522-536, 2005.
[23] J. Mori, T. Tsujishita, Y. Matsuo, and M. Ishizuka, "Extracting Relations in Social Networks from the Web Using Similarity between Collective Contexts," Proc. Int'l Semantic Web Conf., pp. 487-500, 2006.
[24] M. Schedl, T. Pohle, P. Knees, and G. Widmer, "Assigning and Visualizing Music Genres by Web-Based Co-Occurrence Analysis," Proc. Int'l Conf. Music Information Retrieval, pp. 260-265, 2006.
[25] G. Geleijnse and J. Korst, "Tagging Artists Using Co-Occurrences on the Web," Proc. Philips Symp. Intelligent Algorithms, pp. 171-182, 2006.
[26] D. Bollegala, Y. Matsuo, and M. Ishizuka, "Measuring Semantic Similarity between Words Using Web Search Engines," Proc. Int'l Conf. World Wide Web, pp. 757-766, 2007.
[27] J. Gracia, R. Trillo, M. Espinoza, and E. Mena, "Querying the Web: A Multiontology Disambiguation Method," Proc. Int'l Conf. Web Eng., pp. 241-248, 2006.
[28] E. Iosif and A. Potamianos, "Unsupervised Semantic Similarity Computation Using Web Search Engines," Proc. Int'l Conf. Web Intelligence, pp. 381-387, 2007.
[29] K. Church and P. Hanks, "Word Association Norms, Mutual Information, and Lexicography," Computational Linguistics, vol. 16, no. 1, pp. 22-29, 1990.
[30] H. Rubenstein and J. Goodenough, "Contextual Correlates of Synonymy," Comm. ACM, vol. 8, no. 10, pp. 627-633, 1965.
[31] F. Jelinek, Statistical Methods for Speech Recognition. MIT Press, 1998.
[32] D. Lewis, "Naive Bayes at Forty: The Independence Assumption in Information Retrieval," Proc. European Conf. Machine Learning, pp. 4-15, 1998.
[33] A. Pangos, E. Iosif, A. Potamianos, and E. Fosler-Lussier, "Combining Statistical Similarity Measures for Automatic Induction of Semantic Classes," Proc. IEEE Automatic Speech Recognition and Understanding Workshop, pp. 278-283, 2005.
[34] R. Feldman, I. Dagan, and H. Hirsh, "Mining Text Using Keyword Distributions," J. Intelligent Information Systems, vol. 10, no. 3, pp. 281-300, 1998.
[35] P. Vitanyi, "Universal Similarity," Proc. Information Theory Workshop Coding and Complexity, pp. 238-243, 2005.
[36] R. Cilibrasi and P. Vitanyi, "The Google Similarity Distance," IEEE Trans. Knowledge and Data Eng., vol. 19, no. 3, pp. 370-383, Mar. 2007.
[37] F. Sebastiani and C.N.D. Ricerche, "Machine Learning in Automated Text Categorization," ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[38] YahooSearchAPI,, 2010.
[39] G. Miller and W. Charles, "Contextual Correlates of Semantic Similarity," Language and Cognitive Processes, vol. 6, no. 1, pp. 1-28, 1998.
[40] M. Sahami and T. Heilman, "A Web-Based Kernel Function for Measuring the Similarity of Short Text Snippets," Proc. Int'l Conf. World Wide Web, pp. 377-386, 2006.
[41] D. Yarowsky, "Unsupervised Word Sense Disambiguation Rivaling Supervised Methods," Proc. 33rd Ann. Meeting of the Assoc. for Computational Linguistics, pp. 189-196, 1995.
[42] H. Schütze and J. Pedersen, "Information Retrieval Based on Word Senses," Proc. Fourth Ann. Symp. Document Analysis and Information Retrieval, pp. 161-175, 1995.

Index Terms:
Natural language processing, semantic similarity, Web search, ontologies, knowledge acquisition.
Elias Iosif, Alexandros Potamianos, "Unsupervised Semantic Similarity Computation between Terms Using Web Documents," IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 11, pp. 1637-1647, Nov. 2010, doi:10.1109/TKDE.2009.193
Usage of this product signifies your acceptance of the Terms of Use.