The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.07 - July (2011 vol.23)
pp: 977-990
Danushka Bollegala , The University of Tokyo, Tokyo
Yutaka Matsuo , The University of Tokyo, Tokyo
Mitsuru Ishizuka , The University of Tokyo, Tokyo
ABSTRACT
Measuring the semantic similarity between words is an important component in various tasks on the web such as relation extraction, community mining, document clustering, and automatic metadata extraction. Despite the usefulness of semantic similarity measures in these applications, accurately measuring semantic similarity between two words (or entities) remains a challenging task. We propose an empirical method to estimate semantic similarity using page counts and text snippets retrieved from a web search engine for two words. Specifically, we define various word co-occurrence measures using page counts and integrate those with lexical patterns extracted from text snippets. To identify the numerous semantic relations that exist between two given words, we propose a novel pattern extraction algorithm and a pattern clustering algorithm. The optimal combination of page counts-based co-occurrence measures and lexical pattern clusters is learned using support vector machines. The proposed method outperforms various baselines and previously proposed web-based semantic similarity measures on three benchmark data sets showing a high correlation with human ratings. Moreover, the proposed method significantly improves the accuracy in a community mining task.
INDEX TERMS
Web mining, information extraction, web text analysis.
CITATION
Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka, "A Web Search Engine-Based Approach to Measure Semantic Similarity between Words", IEEE Transactions on Knowledge & Data Engineering, vol.23, no. 7, pp. 977-990, July 2011, doi:10.1109/TKDE.2010.172
REFERENCES
[1] A. Kilgarriff, "Googleology Is Bad Science," Computational Linguistics, vol. 33, pp. 147-151, 2007.
[2] M. Sahami and T. Heilman, "A Web-Based Kernel Function for Measuring the Similarity of Short Text Snippets," Proc. 15th Int'l World Wide Web Conf., 2006.
[3] D. Bollegala, Y. Matsuo, and M. Ishizuka, "Disambiguating Personal Names on the Web Using Automatically Extracted Key Phrases," Proc. 17th European Conf. Artificial Intelligence, pp. 553-557, 2006.
[4] H. Chen, M. Lin, and Y. Wei, "Novel Association Measures Using Web Search with Double Checking," Proc. 21st Int'l Conf. Computational Linguistics and 44th Ann. Meeting of the Assoc. for Computational Linguistics (COLING/ACL '06), pp. 1009-1016, 2006.
[5] M. Hearst, "Automatic Acquisition of Hyponyms from Large Text Corpora," Proc. 14th Conf. Computational Linguistics (COLING), pp. 539-545, 1992.
[6] M. Pasca, D. Lin, J. Bigham, A. Lifchits, and A. Jain, "Organizing and Searching the World Wide Web of Facts - Step One: The One-Million Fact Extraction Challenge," Proc. Nat'l Conf. Artificial Intelligence (AAAI '06), 2006.
[7] R. Rada, H. Mili, E. Bichnell, and M. Blettner, "Development and Application of a Metric on Semantic Nets," IEEE Trans. Systems, Man and Cybernetics, vol. 19, no. 1, pp. 17-30, Jan./Feb. 1989.
[8] P. Resnik, "Using Information Content to Evaluate Semantic Similarity in a Taxonomy," Proc. 14th Int'l Joint Conf. Aritificial Intelligence, 1995.
[9] D. Mclean, Y. Li, and Z.A. Bandar, "An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources," IEEE Trans. Knowledge and Data Eng., vol. 15, no. 4, pp. 871-882, July/Aug. 2003.
[10] G. Miller and W. Charles, "Contextual Correlates of Semantic Similarity," Language and Cognitive Processes, vol. 6, no. 1, pp. 1-28, 1998.
[11] D. Lin, "An Information-Theoretic Definition of Similarity," Proc. 15th Int'l Conf. Machine Learning (ICML), pp. 296-304, 1998.
[12] R. Cilibrasi and P. Vitanyi, "The Google Similarity Distance," IEEE Trans. Knowledge and Data Eng., vol. 19, no. 3, pp. 370-383, Mar. 2007.
[13] M. Li, X. Chen, X. Li, B. Ma, and P. Vitanyi, "The Similarity Metric," IEEE Trans. Information Theory, vol. 50, no. 12, pp. 3250-3264, Dec. 2004.
[14] P. Resnik, "Semantic Similarity in a Taxonomy: An Information Based Measure and Its Application to Problems of Ambiguity in Natural Language," J. Artificial Intelligence Research, vol. 11, pp. 95-130, 1999.
[15] R. Rosenfield, "A Maximum Entropy Approach to Adaptive Statistical Modelling," Computer Speech and Language, vol. 10, pp. 187-228, 1996.
[16] D. Lin, "Automatic Retrieval and Clustering of Similar Words," Proc. 17th Int'l Conf. Computational Linguistics (COLING), pp. 768-774, 1998.
[17] J. Curran, "Ensemble Methods for Automatic Thesaurus Extraction," Proc. ACL-02 Conf. Empirical Methods in Natural Language Processing (EMNLP), 2002.
[18] C. Buckley, G. Salton, J. Allan, and A. Singhal, "Automatic Query Expansion Using Smart: Trec 3," Proc. Third Text REtreival Conf., pp. 69-80, 1994.
[19] V. Vapnik, Statistical Learning Theory. Wiley, 1998.
[20] K. Church and P. Hanks, "Word Association Norms, Mutual Information and Lexicography," Computational Linguistics, vol. 16, pp. 22-29, 1991.
[21] Z. Bar-Yossef and M. Gurevich, "Random Sampling from a Search Engine's Index," Proc. 15th Int'l World Wide Web Conf., 2006.
[22] F. Keller and M. Lapata, "Using the Web to Obtain Frequencies for Unseen Bigrams," Computational Linguistics, vol. 29, no. 3, pp. 459-484, 2003.
[23] M. Lapata and F. Keller, "Web-Based Models for Natural Language Processing," ACM Trans. Speech and Language Processing, vol. 2, no. 1, pp. 1-31, 2005.
[24] R. Snow, D. Jurafsky, and A. Ng, "Learning Syntactic Patterns for Automatic Hypernym Discovery," Proc. Advances in Neural Information Processing Systems (NIPS), pp. 1297-1304, 2005.
[25] M. Berland and E. Charniak, "Finding Parts in Very Large Corpora," Proc. Ann. Meeting of the Assoc. for Computational Linguistics on Computational Linguistics (ACL '99), pp. 57-64, 1999.
[26] D. Ravichandran and E. Hovy, "Learning Surface Text Patterns for a Question Answering System," Proc. Ann. Meeting on Assoc. for Computational Linguistics (ACL '02), pp. 41-47, 2001.
[27] R. Bhagat and D. Ravichandran, "Large Scale Acquisition of Paraphrases for Learning Surface Patterns," Proc. Assoc. for Computational Linguistics: Human Language Technologies (ACL '08: HLT), pp. 674-682, 2008.
[28] J. Pei, J. Han, B. Mortazavi-Asi, J. Wang, H. Pinto, Q. Chen, U. Dayal, and M. Hsu, "Mining Sequential Patterns by Pattern-Growth: The Prefixspan Approach," IEEE Trans. Knowledge and Data Eng., vol. 16, no. 11, pp. 1424-1440, Nov. 2004.
[29] Z. Harris, "Distributional Structure," Word, vol. 10, pp. 146-162, 1954.
[30] J. Platt, "Probabilistic Outputs for Support Vector Machines and Comparison to Regularized Likelihood Methods," Advances in Large Margin Classifiers, pp. 61-74, MIT Press, 2000.
[31] P. Gill, W. Murray, and M. Wright, Practical Optimization. Academic Press, 1981.
[32] H. Rubenstein and J. Goodenough, "Contextual Correlates of Synonymy," Comm. ACM, vol. 8, pp. 627-633, 1965.
[33] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin, "Placing Search in Context: The Concept Revisited," ACM Trans. Information Systems, vol. 20, pp. 116-131, 2002.
[34] D. Bollegala, Y. Matsuo, and M. Ishizuka, "Measuring Semantic Similarity between Words Using Web Search Engines," Proc. Int'l Conf. World Wide Web (WWW '07), pp. 757-766, 2007.
[35] M. Strube and S.P. Ponzetto, "Wikirelate! Computing Semantic Relatedness Using Wikipedia," Proc. Nat'l Conf. Artificial Intelligence (AAAI '06), pp. 1419-1424, 2006.
[36] A. Gledson and J. Keane, "Using Web-Search Results to Measure Word-Group Similarity," Proc. Int'l Conf. Computational Linguistics (COLING '08), pp. 281-288, 2008.
[37] Z. Wu and M. Palmer, "Verb Semantics and Lexical Selection," Proc. Ann. Meeting on Assoc. for Computational Linguistics (ACL '94), pp. 133-138, 1994.
[38] C. Leacock and M. Chodorow, "Combining Local Context and Wordnet Similarity for Word Sense Disambiguation," WordNet: An Electronic Lexical Database, vol. 49, pp. 265-283, MIT Press, 1998.
[39] J. Jiang and D. Conrath, "Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy," Proc. Int'l Conf. Research in Computational Linguistics (ROCLING X), 1997.
[40] M. Jarmasz, "Roget's Thesaurus as a Lexical Resource for Natural Language Processing," technical report, Univ. of Ottowa, 2003.
[41] V. Schickel-Zuber and B. Faltings, "OSS: A Semantic Similarity Function Based on Hierarchical Ontologies," Proc. Int'l Joint Conf. Artificial Intelligence (IJCAI '07), pp. 551-556, 2007.
[42] E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Pasca, and A. Soroa, "A Study on Similarity and Relatedness Using Distributional and Wordnet-Based Approaches," Proc. Human Language Technologies: The 2009 Ann. Conf. North Am. Chapter of the Assoc. for Computational Linguistics (NAACL-HLT '09), 2009.
[43] G. Hirst and D. St-Onge, "Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms," WordNet: An Electronic Lexical Database, pp. 305-332, MIT Press, 1998.
[44] T. Hughes and D. Ramage, "Lexical Semantic Relatedness with Random Graph Walks," Proc. Joint Conf. Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL '07), pp. 581-589, 2007.
[45] E. Gabrilovich and S. Markovitch, "Computing Semantic Relatedness Using Wikipedia-Based Explicit Semantic Analysis," Proc. Int'l Joint Conf. Artificial Intelligence (IJCAI '07), pp. 1606-1611, 2007.
[46] Y. Matsuo, J. Mori, M. Hamasaki, K. Ishida, T. Nishimura, H. Takeda, K. Hasida, and M. Ishizuka, "Polyphonet: An Advanced Social Network Extraction System," Proc. 15th Int'l World Wide Web Conf., 2006.
[47] A. Bagga and B. Baldwin, "Entity-Based Cross Document Coreferencing Using the Vector Space Model," Proc. 36th Ann. Meeting of the Assoc. for Computational Linguistics and 17th Int'l Conf. Computational Linguistics (COLING-ACL), pp. 79-85, 1998.
444 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool