This Article 
 Bibliographic References 
 Add to: 
Sentence Similarity Based on Semantic Nets and Corpus Statistics
August 2006 (vol. 18 no. 8)
pp. 1138-1150
Sentence similarity measures play an increasingly important role in text-related research and applications in areas such as text mining, Web page retrieval, and dialogue systems. Existing methods for computing sentence similarity have been adopted from approaches used for long text documents. These methods process sentences in a very high-dimensional space and are consequently inefficient, require human input, and are not adaptable to some application domains. This paper focuses directly on computing the similarity between very short texts of sentence length. It presents an algorithm that takes account of semantic information and word order information implied in the sentences. The semantic similarity of two sentences is calculated using information from a structured lexical database and from corpus statistics. The use of a lexical database enables our method to model human common sense knowledge and the incorporation of corpus statistics allows our method to be adaptable to different domains. The proposed method can be used in a variety of applications that involve text knowledge representation and discovery. Experiments on two sets of selected sentence pairs demonstrate that the proposed method provides a similarity measure that shows a significant correlation to human intuition.

[1] J. Allen, Natural Language Understanding. Redwood City, Calif.: Benjamin Cummings, 1995.
[2] J. Atkinson-Abutridy, C. Mellish, and S. Aitken, “Combining Information Extraction with Genetic Algorithms for Text Mining,” IEEE Intelligent Systems, vol. 19, no. 3, 2004.
[3] Brown Corpus Information, pus_ling/ content/corpora/list/private/brownbrown.html , 2005.
[4] A. Budanitsky and G. Hirst, “Semantic Distance in WordNet: An Experimental, Application-Oriented Evaluation of Five Measures,” Proc. Workshop WordNet and Other Lexical Resources, Second Meeting of the North Am. Chapter of the Assoc. for Computational Linguistics, 2001.
[5] C. Burgess, K. Livesay, and K. Lund, “Explorations in Context Space: Words, Sentences, Discourse,” Discourse Processes, vol. 25, nos. 2-3, pp. 211-257, 1998.
[6] W.G. Charles, “Contextual Correlates of Meaning,” Applied Psycholinguistics, vol. 21, no. 4, pp. 505-524, 2000.
[7] J.H. Chiang and H.C. Yu, “Literature Extraction of Protein Functions Using Sentence Pattern Mining,” IEEE Trans. Knowledge and Data Eng., vol. 17, no. 8, pp. 1088-1098, Aug. 2005.
[8] T.A.S. Coelho, P.P. Calado, L.V. Souza, B. Ribeiro-Neto, and R. Muntz, “Image Retrieval Using Multiple Evidence Ranking,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 4, pp. 408-417, Apr. 2004.
[9] G. Erkan and D.R. Radev, “LexRank: Graph-Based Lexical Centrality As Salience in Text Summarization,” J. Artificial Intelligence Research, vol. 22, pp. 457-479, 2004.
[10] P.W. Foltz, W. Kintsch, and T.K. Landauer, “The Measurement of Textual Coherence with Latent Semantic Analysis,” Discourse Processes, vol. 25, nos. 2-3, pp. 285-307, 1998.
[11] P. Wiemer-Hastings, “Adding Syntactic Information to LSA,” Proc. 22nd Ann. Conf. Cognitive Science Soc., pp. 989-993, 2000.
[12] V. Hatzivassiloglou, J. Klavans, and E. Eskin, “Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning,” Proc. Joint SIGDAT Conf. Empirical Methods in NLP and Very Large Corpora, 1999.
[13] V. Hatzivassiloglou, J. Klavans, and E. Eskin, “Detecting Similarity by Applying Leaning over Indicators,” Proc. 37th Ann. Meeting of the Assoc. for Computational Linguistics, 1999.
[14] D. Jurafsky and J.H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall, 2000.
[15] Y. Ko, J. Park, and J. Seo, “Improving Text Categorization Using the Importance of Sentences,” Information Processing and Management, vol. 40, pp. 65-79, 2004.
[16] H. Kozima, “Computing Lexical Cohesion as a Tool for Text Analysis,” PhD thesis, Course in Computer Science and Information Math., Graduate School of Electro-Comm., Univ. of Electro-Communications, 1994.
[17] T.K. Landauer, D. Laham, B. Rehder, and M.E. Schreiner, “How Well Can Passage Meaning Be Derived without Using Word Order? A Comparison of Latent Semantic Analysis and Humans,” Proc. 19th Ann. Meeting of the Cognitive Science Soc., pp. 412-417, 1997.
[18] T.K. Landauer, P.W. Foltz, and D. Laham, “Introduction to Latent Semantic Analysis,” Discourse Processes, vol. 25, nos. 2-3, pp. 259-284, 1998.
[19] T.K. Landauer, D. Laham, and P.W. Foltz, “Learning Human-Like Knowledge by Sngular Value Decomposition: A Progress Report,” Advances in Neural Information Processing Systems 10, M.I. Jordan, M.J. Kearns, and S.A. Solla, eds., Cambridge, Mass.: MIT Press, pp. 45-51, 1998.
[20] Y.H. Li, Z. Bandar, and D. McLean, “An Approach for Measuring Semantic Similarity Using Multiple Information Sources,” IEEE Trans. Knowledge and Data Eng., vol. 15, no. 4, pp. 871-882, July/Aug. 2003.
[21] Y. Liu and C.Q. Zong, “Example-Based Chinese-English MT,” Proc. 2004 IEEE Int'l Conf. Systems, Man, and Cybernetics, vols. 1-7, pp. 6093-6096, 2004.
[22] J.L. McClelland and A.H. Kawamoto, “Mechanisms of Sentence Processing: Assigning Roles to Constituents of Sentences,” Parallel Distributed Process 2, D.E. Rumelhart, J.L. McClelland, and the PDP Research, eds., pp. 272-325, MIT Press, 1986.
[23] M. McHale, “A Comparison of WordNet and Roget's Taxonomy for Measuring Semantic Similarity,” Proc. COLING/ACL Workshop Usage of WordNet in Natural Language Processing Systems, 1998.
[24] C.T. Meadow, B.R. Boyce, and D.H. Kraft, Text Information Retrieval Systems, second ed. Academic Press, 2000.
[25] D. Michie, “Return of the Imitation Game,” Electronic Trans. Artificial Intelligence, vol. 6, no. 2, pp. 203-221, 2001.
[26] G.A. Miller, “WordNet: A Lexical Database for English,” Comm. ACM, vol. 38, no. 11, pp. 39-41, 1995.
[27] G.A. Miller and W.G. Charles, “Contextual Correlates of Semantic Similarity,” Language and Cognitive Processes, vol. 6, no. 1, pp. 1-28, 1991.
[28] N. Okazaki, Y. Matsuo, N. Matsumura, and M. Ishizuka, “Sentence Extraction by Spreading Activation through Sentence Similarity,” IEICE Trans. Information and Systems, vol. E86D, no. 9, pp. 1686-1694, 2003.
[29] E.K. Park, D.Y. Ra, and M.G. Jang, “Techniques for Improving Web Retrieval Effectiveness,” Information Processing and Management, vol. 41, no. 5, pp. 1207-1223, 2005.
[30] R. Rada, H. Mili, E. Bichnell, and M. Blettner, “Development and Application of a Metric on Semantic Nets,” IEEE Trans. System, Man, and Cybernetics, vol. 9, no. 1, pp. 17-30, 1989.
[31] A. Radford, M. Atkinson, D. Britain, H. Clahsen, and A. Spencer, Linguistics: An Introduction. Cambridge Univ. Press., 1999.
[32] P. Resnik, “Using Information Content to Evaluate Semantic Similarity in a Taxonomy,” Proc. 14th Int'l Joint Conf. AI, 1995.
[33] F.J. Ribadas, M. Vilares, and J. Vilares, “Semantic Similarity between Sentences through Approximate Tree Matching,” Lecture Notes in Computer Science, vol. 3523, pp. 638-646, 2005.
[34] M.A. Rodriguez and M.J. Egenhofer, “Determining Semantic Similarity among Entity Classes from Different Ontologies,” IEEE Trans. Knowledge and Data Eng., vol. 15, no. 2, pp. 442-456, Mar./Apr. 2003.
[35] H. Rubenstein and J.B. Goodenough, “Contextual Correlates of Synonymy,” Comm. ACM, vol. 8, no. 10, pp. 627-633, 1965.
[36] G. Salton, Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer. Wokingham, Mass.: Addison-Wesley, 1989.
[37] Collins Cobuild English Dictionary for Advanced Learners, J. Sinclair, ed., third ed. Harper Collins Pub., 2001.
[38] The Gene Ontology Consortium, “Gene Ontology Software and Databases,” http://www.geneontology.orgGO.doc.shtml , 2005.
[39] USGS, “View the SDTS Document,” gov/sdtsstandard.html , 2005.

Index Terms:
Sentence similarity, semantic nets, corpus, natural language processing, word similarity.
Yuhua Li, David McLean, Zuhair A. Bandar, James D. O'Shea, Keeley Crockett, "Sentence Similarity Based on Semantic Nets and Corpus Statistics," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 8, pp. 1138-1150, Aug. 2006, doi:10.1109/TKDE.2006.130
Usage of this product signifies your acceptance of the Terms of Use.