The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.10 - Oct. (2012 vol.24)
pp: 1862-1875
Tao Cheng , Microsoft Research, Redmond
Hady W. Lauw , Institute for Infocomm Research, Singapore
Stelios Paparizos , Microsoft Research, Mountain View
ABSTRACT
Nowadays, there are many queries issued to search engines targeting at finding values from structured data (e.g., movie showtime of a specific location). In such scenarios, there is often a mismatch between the values of structured data (how content creators describe entities) and the web queries (how different users try to retrieve them). Therefore, recognizing the alternative ways people use to reference an entity, is crucial for structured web search. In this paper, we study the problem of automatic generation of entity synonyms over structured data toward closing the gap between users and structured data. We propose an offline, data-driven approach that mines query logs for instances where content creators and web users apply a variety of strings to refer to the same webpages. This way, given a set of strings that reference entities, we generate an expanded set of equivalent strings (entity synonyms) for each entity. Our framework consists of three modules: candidate generation, candidate selection, and noise cleaning. We further study the cause of the problem through the identification of different entity synonym classes. The proposed method is verified with experiments on real-life data sets showing that we can significantly increase the coverage of structured web queries with good precision.
INDEX TERMS
Motion pictures, Web search, Noise, Search engines, Earth Observing System, Digital cameras, Databases, query log., Entity synonym, fuzzy matching, structured data, web query
CITATION
Tao Cheng, Hady W. Lauw, Stelios Paparizos, "Entity Synonyms for Structured Web Search", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 10, pp. 1862-1875, Oct. 2012, doi:10.1109/TKDE.2011.168
REFERENCES
[1] S. Agrawal, S. Chaudhuri, and G. Das, "Dbxplorer: A System for Keyword-Based Search over Relational Databases," Proc. 18th Int'l Conf. Data Eng. (ICDE), 2002.
[2] I. Antonellis, H. Garcia-Molina, and C. Chang, "Simrank++: Query Rewriting through Link Analysis of the Click Graph," Proc. Int'l Conf. Very Large Databases (VLDB), 2008.
[3] R. Baeza-Yates, C. Hurtado, and M. Mendoza, "Query Recommendation Using Query Logs in Search Engines," EDBT '04: Proc. Int'l Conf. Current Trends in Database Technology, 2004.
[4] R. Baeza-Yates and A. Tiberi, "Extracting Semantic Relations from Query Logs," Proc. 13th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2007.
[5] O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S.E. Whang, and J. Widom, "Swoosh: A Generic Approach to Entity Resolution," The VLDB J., vol. 18, pp. 255-276, 2009.
[6] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan, "Keyword Searching and Browsing in Databases Using Banks," Proc. 18th Int'l Conf. Data Eng. (ICDE), 2002.
[7] I. Bhattacharya and L. Getoor, "Collective Entity Resolution in Relational Data," ACM Trans. Knowledge Discovery from Data, vol. 1, pp. 1-36, 2007.
[8] Box Office, 2007 Yearly Box Office Results, http://www. boxofficemojo.com/yearly/chart ?yr=2007, 2007.
[9] S. Chaudhuri, V. Ganti, and D. Xin, "Exploiting Web Search to Generate Synonyms for Entities," Proc. 18th Int'l Conf. World Wide Web (WWW), 2009.
[10] N. Craswell and M. Szummer, "Random Walks on the Click Graph," Proc. 30th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 239-246, 2007.
[11] D. Dey, S. Sarkar, and P. De, "A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases," IEEE Trans. Knowledge and Data Eng., vol. 14, no. 3, pp. 567-582, May/June 2002.
[12] A. Doan, R. Ramakrishnan, and S. Vaithyanathan, "Managing Information Extraction: State of the Art and Research Directions," Proc. ACM SIGMOD Int'l Conf. Management of Data, 2006.
[13] X. Dong, A. Halevy, and J. Madhavan, "Reference Reconciliation in Complex Information Spaces," Proc. ACM SIGMOD Int'l Conf. Management of Data, 2005.
[14] R. Fagin, B. Kimelfeld, Y. Li, S. Raghavan, and S. Vaithyanathan, "Understanding Queries in a Search Database System," Proc. 29th ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS), 2010.
[15] A. Fuxman, P. Tsaparas, K. Achan, and R. Agrawal, "Using the Wisdom of the Crowds for Keyword Generation," Proc. Int'l Conf. World Wide Web (WWW), pp. 61-70, 2008.
[16] V. Hristidis, L. Gravano, and Y. Papakonstantinou, "Efficient Ir-Style Keyword Search over Relational Databases," Proc. 29th Int'l Conf. Very Large Data Bases (VLDB), 2003.
[17] J. Hu, L. Fang, Y. Cao, H.-J. Zeng, H. Li, Q. Yang, and Z. Chen, "Enhancing Text Clustering by Leveraging Wikipedia Semantics," Proc. 31st Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, 2008.
[18] R. Jones, B. Rey, O. Madani, and W. Greiner, "Generating Query Substitutions," Proc. 15th Int'l Conf. World Wide Web (WWW), pp. 387-396, 2006.
[19] X. Li, Y.-Y. Wang, and A. Acero, "Extracting Structured Information from User Queries with Semi-Supervised Conditional Random Fields," Proc. 32nd Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, 2009.
[20] F. Liu, C.T. Yu, W. Meng, and A. Chowdhury, "Effective Keyword Search in Relational Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data, 2006.
[21] S. Liu, F. Liu, C. Yu, and W. Meng, "An Effective Approach to Document Retrieval via Utilizing Wordnet and Recognizing Phrases," Proc. 27th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, 2004.
[22] A. Malekian, C.-C. Chang, R. Kumar, and G. Wang, "Optimizing Query Rewrites for Keyword-Based Advertising," Proc. Ninth ACM Conf. Electronic Commerce (EC), 2008.
[23] Q. Mei, D. Zhou, and K. Church, "Query Suggestion using Hitting Time," Proc. 17th ACM Conf. Information and Knowledge Management (CIKM), 2008.
[24] Microsoft, MSN Shopping XML Data Access API, http://shopping.msn.com/xml/v1getresults.aspx?text=digital+camera , 2012.
[25] G.A. Miller, "Wordnet: A Lexical Database for English," Comm. ACM, vol 38, no. 11, pp. 39-41, 1995.
[26] A. Nandi and H.V. Jagadish, "Qunits: Queried Units in Database Search," Proc. Conf. Innovative Data Systems Research (CIDR), 2009.
[27] S. Paparizos, A. Ntoulas, J.C. Shafer, and R. Agrawal, "Answering Web Queries Using Structured Data Sources," Proc. ACM SIGMOD Int'l Conf. Management of Data, 2009.
[28] N. Sarkas, S. Paparizos, and P. Tsaparas, "Structured Annotations of Web Queries," Proc. 35th ACM SIGMOD Int'l Conf. Management of Data, 2010.
[29] M. Strube and S.P. Ponzetto, "Wikirelate! Computing Semantic Relatedness Using Wikipedia," Proc. 21st Nat'l Conf. Artificial Intelligence, 2006.
[30] P.D. Turney, "Mining the Web for Synonyms: PMI-IR Versus LSA on TOEFL," Proc. 12th European Conf. Machine Learning (EMCL), 2001.
[31] V. Uren, P. Cimiano, J. Iria, S. Handschuh, M. Vargas-Vera, E. Motta, and F. Ciravegna, "Semantic Annotation for Knowledge Management: Requirements and a Survey of the State of the Art," Web Semantics: Science, Services and Agents on the World Wide Web, vol. 4, pp. 14-28, 2006.
[32] G. Varelas, E. Voutsakis, P. Raftopoulou, E.G. Petrakis, and E.E. Milios, "Semantic Similarity Methods in Wordnet and Their Application to Information Retrieval on the Web," Proc. Seventh Ann. ACM Int'l Workshop Web Information and Data Management (WIDM), 2005.
[33] J.-R. Wen, J.-Y. Nie, and H.-J. Zhang, "Clustering User Queries of a Search Engine," Proc. 10th Int'l Conf. World Wide Web (WWW), pp. 162-168, 2001.
20 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool