The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.04 - July-Aug. (2013 vol.17)
pp: 32-38
Paul Clough , University of Sheffield
Mark Sanderson , RMIT University
Jiayu Tang , Alibaba.com
Tim Gollins , The National Archives, UK
Amy Warner , Royal Holloway, University of London
ABSTRACT
Evaluation is instrumental to developing and managing effective information retrieval systems. For this process, enlisting crowdsourcing has proven viable. However, less understood are crowdsourcing's limits for evaluation, particularly for domain-specific search. The authors compare relevance assessments gathered using crowdsourcing with those from a domain expert to evaluate different search engines in a large government archive. Although crowdsourced judgments rank the tested search engines in the same order as expert judgments, crowdsourced workers appear unable to distinguish different levels of highly accurate search results the way expert assessors can.
INDEX TERMS
Performance evaluation, Navigation, Search engines, System analysis and design, Internet, Crowdsourcing, Search methods, Information retrieval, crowdsourcing, information search and retrieval, performance of systems
CITATION
Paul Clough, Mark Sanderson, Jiayu Tang, Tim Gollins, Amy Warner, "Examining the Limits of Crowdsourcing for Relevance Assessment", IEEE Internet Computing, vol.17, no. 4, pp. 32-38, July-Aug. 2013, doi:10.1109/MIC.2012.95
REFERENCES
1. C.W. Cleverdon, Report on the Testing and Analysis of an Investigation Into the Comparative Efficiency of Indexing Systems, tech. report, ASLIB Cranfield Research Project, 1962.
2. V.R. Carvalho, M. Lease, and E. Yilmaz, “Crowdsourcing for Search Evaluation,” ACM SIGIR Forum, vol. 44, no. 2, 2010, pp. 17–22.
3. K.A. Kinney, S.B. Huffman, and J. Zhai, “How Evaluator Domain Expertise Affects Search Result Relevance Judgments,” Proc. 17th ACM Conf. Information and Knowledge Management, ACM, 2008, pp. 591–598.
4. M. Sanderson and A. Warner, “Training Students to Evaluate Search Engines” The Information Retrieval Series, vol. 31, Springer, 2011, pp. 169–182.
5. P. Bailey et al., “Relevance Assessment: Are Judges Exchangeable and Does it Matter,” Proc. 31st Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, ACM, 2008, pp. 667–674.
6. E.M. Voorhees, “Topic Set Size Redux,” Proc. 32nd Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, ACM, 2009, pp. 806–807.
7. A. Broder, “A Taxonomy of Web Search,” SIGIR Forum, vol. 36, no. 2, 2002, pp. 3–10.
8. H. Zhu et al., “Navigating the Intranet with High Precision,” Proc. 16th Int'l Conf. World Wide Web, ACM, 2007, pp. 491–500.
9. A. Al-Maskari et al., “The Good and the Bad System: Does the Test Collection Predict Users' Effectiveness?” Proc. 31st Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, ACM, 2008, pp. 59–66.
10. C.W. Cleverdon, The Effect of Variations in Relevance Assessments in Comparative Experimental Rests of Index Languages, Cranfield Library report no. 3., Cranfield Inst. of Tech., 1970.
11. E.M. Voorhees, “Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness,” Proc. 21st Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, ACM, 1998, pp. 315–323.
23 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool