The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.06 - June (2012 vol.24)
pp: 1014-1024
J. Zobel , Dept. of Comput. Sci. & Software Eng., Univ. of Melbourne, Melbourne, VIC, Australia
ABSTRACT
Extended Boolean retrieval (EBR) models were proposed nearly three decades ago, but have had little practical impact, despite their significant advantages compared to either ranked keyword or pure Boolean retrieval. In particular, EBR models produce meaningful rankings; their query model allows the representation of complex concepts in an and-or format; and they are scrutable, in that the score assigned to a document depends solely on the content of that document, unaffected by any collection statistics or other external factors. These characteristics make EBR models attractive in domains typified by medical and legal searching, where the emphasis is on iterative development of reproducible complex queries of dozens or even hundreds of terms. However, EBR is much more computationally expensive than the alternatives. We consider the implementation of the p-norm approach to EBR, and demonstrate that ideas used in the max-score and wand exact optimization techniques for ranked keyword retrieval can be adapted to allow selective bypass of documents via a low-cost screening process for this and similar retrieval models. We also propose term-independent bounds that are able to further reduce the number of score calculations for short, simple queries under the extended Boolean retrieval model. Together, these methods yield an overall saving from 50 to 80 percent of the evaluation cost on test queries drawn from biomedical search.
INDEX TERMS
statistical analysis, Boolean functions, document handling, information retrieval, keyword retrieval, efficient extended Boolean retrieval, EBR, complex concepts, document content, statistics collection, legal searching, medical searching, optimization techniques, Optimization, Query processing, Computational modeling, Biological system modeling, Law, Systematics, query processing., Document-at-a-time, efficiency, extended Boolean retrieval, p-norm
CITATION
J. Zobel, "Efficient Extended Boolean Retrieval", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 6, pp. 1014-1024, June 2012, doi:10.1109/TKDE.2011.63
REFERENCES
[1] S. Karimi, J. Zobel, S. Pohl, and F. Scholer, "The Challenge of High Recall in Biomedical Systematic Search," Proc. Third Int'l Workshop Data and Text Mining in Bioinformatics, pp. 89-92, Nov. 2009.
[2] J.H. Lee, "Analyzing the Effectiveness of Extended Boolean Models in Information Retrieval," Technical Report TR95-1501, Cornell Univ., 1995.
[3] G. Salton, E.A. Fox, and H. Wu, "Extended Boolean Information Retrieval," Comm. ACM, vol. 26, no. 11, pp. 1022-1036, Nov. 1983.
[4] J.H. Lee, W.Y. Kin, M.H. Kim, and Y.J. Lee, "On the Evaluation of Boolean Operators in the Extended Boolean Retrieval Framework," Proc. 16th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 291-297, 1993.
[5] V.N. Anh and A. Moffat, "Pruned Query Evaluation Using Pre-Computed Impacts," Proc. 29th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 372-379, 2006.
[6] Cochrane Handbook for Systematic Reviews of Interventions, Version 5.0.2 [updated September 2009], J.P.T. Higgins and S. Green, eds., The Cochrane Collaboration, 2009, http:/www.cochrane-hand book.org.
[7] L. Zhang, I. Ajiferuke, and M. Sampson, "Optimizing Search Strategies to Identify Randomized Controlled Trials in MEDLINE," BMC Medical Research Methodology, vol. 6, no. 1, p. 23, May 2006.
[8] M. Sampson, J. McGowan, C. Lefebvre, D. Moher, and J. Grimshaw, "PRESS: Peer Review of Electronic Search Strategies," Technical Report 477, Ottawa: Canadian Agency for Drugs and Technologies in Health, 2008.
[9] F. McLellan, "1966 and All that—When Is a Literature Search Done?," The Lancet, vol. 358, no. 9282, p. 646, Aug. 2001.
[10] S. Pohl, J. Zobel, and A. Moffat, "Extended Boolean Retrieval for Systematic Biomedical Reviews," Proc. 33rd Australasian Computer Science Conf. (ACSC '10), vol. 102, Jan. 2010.
[11] T.G. Armstrong, A. Moffat, W. Webber, and J. Zobel, "Has Adhoc Retrieval Improved Since 1994?," Proc. 32nd Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 692-693, 2009.
[12] D.W. Oard, B. Hedin, S. Tomlinson, and J.R. Baron, "Overview of the TREC 2008 Legal Track," Proc. Seventh Text Retrieval Conf. (TREC), E.M. Voorhees and L.P. Buckland, eds., 2008.
[13] D. Metzler and W.B. Croft, "Combining the Language Model and Inference Network Approaches to Retrieval," Information Processing and Management, vol. 40, no. 5, pp. 735-750, 2004.
[14] T. Radecki, "Fuzzy Set Theoretical Approach to Document Retrieval," Information Processing and Management, vol. 15, no. 5, pp. 247-259, 1979.
[15] W.G. Waller and D.H. Kraft, "A Mathematical Model of a Weighted Boolean Retrieval System," Information Processing and Management, vol. 15, no. 5, pp. 235-245, 1979.
[16] C.D. Paice, "Soft Evaluation of Boolean Search Queries in Information Retrieval Systems," Information Technology Research Development Applications, vol. 3, no. 1, pp. 33-41, Jan. 1984.
[17] M.E. Smith, "Aspects of the p-Norm Model of Information Retrieval: Syntactic Query Generation, Efficiency, and Theoretical Properties," PhD dissertation, Cornell Univ., May 1990.
[18] H. Turtle and W.B. Croft, "Inference Networks for Document Retrieval," Proc. 13th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 1-24, 1990.
[19] A.Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J.Y. Zien, "Efficient Query Evaluation Using a Two-Level Retrieval Process," Proc. 12th Int'l Conf. Information and Knowledge Management, pp. 426-434, 2003.
[20] G. Salton and E. Voorhees, "Automatic Assignment of Soft Boolean Operators," Proc. Eighth Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 54-69, 1985.
[21] J. Zobel and A. Moffat, "Inverted Files for Text Search Engines," ACM Computing Surveys, vol. 38, no. 2, p. 6, July 2006.
[22] A. Moffat and J. Zobel, "Self-Indexing Inverted Files for Fast Text Retrieval," ACM Trans. Information Systems, vol. 14, pp. 349-379, 1996.
[23] T. Strohman, H. Turtle, and W.B. Croft, "Optimization Strategies for Complex Queries," Proc. 28th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 219-225, 2005.
[24] M. Theobald, G. Weikum, and R. Schenkel, "Top-k Query Evaluation with Probabilistic Guarantees," Proc. 30th Int'l Conf. Very Large Data Bases, vol. 30, pp. 648-659, 2004.
[25] H. Turtle and J. Flood, "Query Evaluation: Strategies and Optimizations," Information Processing and Management, vol. 31, no. 6, pp. 831-850, 1995.
[26] M. Sampson, J. McGowan, E. Cogo, J. Grimshaw, D. Moher, and C. Lefebvre, "An Evidence-Based Practice Guideline for the Peer Review of Electronic Search Strategies," J. Clinical Epidemiology, vol. 62, no. 9, pp. 944-952, 2009.
[27] J.R. Herskovic, L.Y. Tanaka, W. Hersh, and E.V. Bernstam, "A Day in the Life of PubMed: Analysis of a Typical Day's Query Log," J. Am. Medical Informatics Assoc., vol. 14, no. 2, pp. 212-220, 2007.
[28] A.M. Cohen, W.R. Hersh, K. Peterson, and P.-Y. Yen, "Reducing Workload in Systematic Review Preparation Using Automated Citation Classification," J. Am. Medical Informatics Assoc., vol. 13, no. 2, pp. 206-219, 2006.
[29] H. Bast and I. Weber, "Type Less, Find More: Fast Autocompletion Search with a Succinct Index," Proc. 29th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 364-371, 2006.
6 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool