The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.10 - October (2011 vol.23)
pp: 1555-1568
Vagelis Hristidis , Florida International University, Miami
Yuheng Hu , Arizona State University, Tempe
Panagiotis G. Ipeirotis , New York University, New York
ABSTRACT
Many online or local data sources provide powerful querying mechanisms but limited ranking capabilities. For instance, PubMed allows users to submit highly expressive Boolean keyword queries, but ranks the query results by date only. However, a user would typically prefer a ranking by relevance, measured by an information retrieval (IR) ranking function. A naive approach would be to submit a disjunctive query with all query keywords, retrieve all the returned matching documents, and then rerank them. Unfortunately, such an operation would be very expensive due to the large number of results returned by disjunctive queries. In this paper, we present algorithms that return the top results for a query, ranked according to an IR-style ranking function, while operating on top of a source with a Boolean query interface with no ranking capabilities (or a ranking capability of no interest to the end user). The algorithms generate a series of conjunctive queries that return only documents that are candidates for being highly ranked according to a relevance metric. Our approach can also be applied to other settings where the ranking is monotonic on a set of factors (query keywords in IR) and the source query interface is a Boolean expression of these factors. Our comprehensive experimental evaluation on the PubMed database and a TREC data set show that we achieve order of magnitude improvement compared to the current baseline approaches.
INDEX TERMS
Hidden-web databases, keyword search, top-k ranking.
CITATION
Vagelis Hristidis, Yuheng Hu, Panagiotis G. Ipeirotis, "Relevance-Based Retrieval on Hidden-Web Text Databases without Ranking Support", IEEE Transactions on Knowledge & Data Engineering, vol.23, no. 10, pp. 1555-1568, October 2011, doi:10.1109/TKDE.2010.183
REFERENCES
[1] J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A.Y. Halevy, "Google's Deep Web Crawl," Proc. VLDB, vol. 1, no. 2, pp. 1241-1252, 2008.
[2] A. Ntoulas, P. Zerfos, and J. Cho, "Downloading Textual Hidden Web Content by Keyword Queries," Proc. Fifth ACM and IEEE Joint Conf. Digital Libraries (JCDL '05), 2005.
[3] J.R. Herskovic and E.V. Bernstam, "Using Incomplete Citation Data for Medline Results Ranking," Proc. AMIA Ann. Symp., pp. 316-20, 2005.
[4] Z. Lu, W. Kim, and W.J. Wilbur, "Evaluating Relevance Ranking Strategies for Medline Retrieval," J. Am. Medical Informatics Assoc., vol. 16, no. 1, pp. 32-36, 2009.
[5] G. Salton and M.J. McGill, Introduction to Modern Information Retrieval. McGraw-Hill, Inc., 1986.
[6] A. Singhal, "Modern Information Retrieval: A Brief Overview," Bull. IEEE CS Technical Committee on Data Eng., vol. 24, no. 4, pp. 35-42, http://singhal.infoieee2001.pdf, 2001.
[7] S.E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford, "Okapi at Trec-3," Proc. Text Retrieval Conf. (TREC), 1994.
[8] R.C. Geer et al., "Ncbi Advanced Workshop for Bioinformatics Information Specialists: Sample User Questions and Answers," http://www.ncbi.nlm.nih.gov/Class/NAWBIS index.html, Aug. 2007.
[9] D.A. Berry and B. Fristedt, Bandit Problems: Sequential Allocation of Experiments. Springer, 1985.
[10] J. Lee, J. Lee, and H. Lee, "Exploration and Exploitation in the Presence of Network Externalities," Management Science, vol. 49, no. 4, pp. 553-570, Apr. 2003.
[11] W.G. Macready and D.H. Wolpert, "Bandit Problems and the Exploration/Exploitation Tradeoff," IEEE Trans. Evolutionary Computation, vol. 2, no. 1, pp. 2-22, Apr. 1998.
[12] V. Hristidis, Y. Hu, and P.G. Ipeirotis, "Ranked Queries Over Sources with Boolean Query Interfaces without Ranking Support," Proc. 26th IEEE Int'l Conf. Data Eng. (ICDE '10), 2010.
[13] I.F. Ilyas, G. Beskales, and M.A. Soliman, "A Survey of Top-K Query Processing Techniques in Relational Database Systems," ACM Computing Survey, vol. 40, no. 4, pp. 1-58, 2008.
[14] R. Fagin, A. Lotem, and M. Naor, "Optimal Aggregation Algorithms for Middleware," Proc. PODS '01: Twentieth ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems, pp. 102-113, 2001.
[15] N. Bruno, L. Gravano, and A. Marian, "Evaluating Top-K Queries over Web-Accessible Databases," Proc. ICDE '02: 18th Int'l Conf. Data Eng., p. 369, 2002.
[16] V. Hristidis and Y. Papakonstantinou, "Algorithms and Applications for Answering Ranked Queries Using Ranked Views," The VLDB J., vol. 13, no. 1, pp. 49-70, 2004.
[17] M. Theobald, G. Weikum, and R. Schenkel, "Top-K Query Evaluation with Probabilistic Guarantees," Proc. Very Large Databases (VLDB), 2004.
[18] M.K. Bergman, "The Deep Web: Surfacing Hidden Value," J. Electronic Publishing, vol. 7, no. 1, Aug. 2001.
[19] W. Meng, K.-L. Liu, C.T. Yu, X. Wang, Y. Chang, and N. Rishe, "Determining Text Databases to Search in the Internet," Proc. VLDB '98, 24th Int'l Conf. Very Large Data Bases, pp. 14-25, 1998.
[20] W. Meng, K.-L. Liu, C.T. Yu, W. Wu, and N. Rishe, "Estimating the Usefulness of Search Engines," Proc. 15th Int'l Conf. Data Eng. (ICDE '99), pp. 146-153, 1999.
[21] J.P. Callan and M. Connell, "Query-Based Sampling of Text Databases," ACM Trans. Information Systems, vol. 19, no. 2, pp. 97-130, 2001.
[22] P.G. Ipeirotis and L. Gravano, "Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection," Proc. 28th Int'l Conf. Very Large Databases (VLDB '02), pp. 394-405, 2002.
[23] P. Domingos and M.J. Pazzani, "On the Optimality of The Simple Bayesian Classifier under Zero-One Loss," Machine Learning, vol. 29, nos. 2/3, pp. 103-130, 1997.
[24] P.G. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano, "To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks," Proc. ACM SIGMOD Conf., pp. 265-276, 2006.
[25] R. Fagin, R. Kumar, and D. Sivakumar, "Comparing Top K Lists," Proc. SODA '03: 14th Ann. ACM-SIAM Symp. Discrete Algorithms, Soc. for Industrial and Applied Math., pp. 28-36, 2003.
15 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool