The Community for Technology Leaders
Green Image
Issue No. 03 - March (2014 vol. 40)
ISSN: 0098-5589
pp: 307-323
Alex Groce , School of Electrical Engineering and Computer Science, Oregon State University, Corvallis,
Todd Kulesza , School of Electrical Engineering and Computer Science, Oregon State University, Corvallis,
Chaoqiang Zhang , School of Electrical Engineering and Computer Science, Oregon State University, Corvallis,
Shalini Shamasunder , School of Electrical Engineering and Computer Science, Oregon State University, Corvallis,
Margaret Burnett , School of Electrical Engineering and Computer Science, Oregon State University, Corvallis,
Weng-Keen Wong , School of Electrical Engineering and Computer Science, Oregon State University, Corvallis,
Simone Stumpf , Centre for HCI Design, School of Informatics, City University London, London, United Kingdom
Shubhomoy Das , School of Electrical Engineering and Computer Science, Oregon State University, Corvallis,
Amber Shinsel , School of Electrical Engineering and Computer Science, Oregon State University, Corvallis,
Forrest Bice , School of Electrical Engineering and Computer Science, Oregon State University, Corvallis,
Kevin McIntosh , School of Electrical Engineering and Computer Science, Oregon State University, Corvallis,
ABSTRACT
How do you test a program when only a single user, with no expertise in software testing, is able to determine if the program is performing correctly? Such programs are common today in the form of machine-learned classifiers. We consider the problem of testing this common kind of machine-generated program when the only oracle is an end user: e.g., only you can determine if your email is properly filed. We present test selection methods that provide very good failure rates even for small test suites, and show that these methods work in both large-scale random experiments using a “gold standard” and in studies with real users. Our methods are inexpensive and largely algorithm-independent. Key to our methods is an exploitation of properties of classifiers that is not possible in traditional software testing. Our results suggest that it is plausible for time-pressured end users to interactively detect failures—even very hard-to-find failures—without wading through a large number of successful (and thus less useful) tests. We additionally show that some methods are able to find the arguably most difficult-to-detect faults of classifiers: cases where machine learning algorithms have high confidence in an incorrect result.
INDEX TERMS
Testing, Software, Training, Training data, Electronic mail, Software algorithms, Machine learning algorithms,test suite size, Machine learning, end-user testing
CITATION
Alex Groce, Todd Kulesza, Chaoqiang Zhang, Shalini Shamasunder, Margaret Burnett, Weng-Keen Wong, Simone Stumpf, Shubhomoy Das, Amber Shinsel, Forrest Bice, Kevin McIntosh, "You Are the Only Possible Oracle: Effective Test Selection for End Users of Interactive Machine Learning Systems", IEEE Transactions on Software Engineering, vol. 40, no. , pp. 307-323, March 2014, doi:10.1109/TSE.2013.59
202 ms
(Ver 3.3 (11022016))