This Article 
 Bibliographic References 
 Add to: 
What Size Test Set Gives Good Error Rate Estimates?
January 1998 (vol. 20 no. 1)
pp. 52-64

Abstract—We address the problem of determining what size test set guarantees statistically significant results in a character recognition task, as a function of the expected error rate. We provide a statistical analysis showing that if, for example, the expected character error rate is around 1 percent, then, with a test set of at least 10,000 statistically independent handwritten characters (which could be obtained by taking 100 characters from each of 100 different writers), we guarantee, with 95 percent confidence, that: (1) The expected value of the character error rate is not worse than 1.25 E, where E is the empirical character error rate of the best recognizer, calculated on the test set; and (2) a difference of 0.3 E between the error rates of two recognizers is significant. We developed this framework with character recognition applications in mind, but it applies as well to speech recognition and to other pattern recognition problems.

[1] A.M. Mood, F.A. Graybill, and D.C. Boes, Introduction to the Theory of Statistics. McGraw Hill, 1974.
[2] I. Guyon, L. Shomaker, R. Plamondon, M. Liberman, and S. Janet, "UNIPEN Project of On-Line Data Exchange and Benchmarks, Proc. 12th Int'l Conf. Pattern Recognition, IAPR-IEEE, 1994.
[3] H. Chernoff, "A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the Sums of Observations," Annals of Mathematical Statistics, vol. 23, pp. 493-509, 1952.
[4] W. Hoeffding, "Probability Inequalities for Sums of Bounded Random Variables," J. Am. Statistics Assoc., vol. 58, pp. 13-30, 1963.
[5] R. A. Wilkinson, J. Geist, S. Janet, P.J. Grother, C.J.C. Burges, R. Creccy, B. Hammond, J.J. Hull, N.J. Larsen, T.P. Vogl, and C. Wilson, "The First Census Optical Character Recognition Systems Conference," Technical Report NISTIR-4912, NIST, U.S. Dept. of Commerce, 1992.
[6] J. Geist, R.A. Wilkinson, S. Janet, P.J. Grother, B. Hammond, N.W. Larsen, R.M. Klear, M.J. Matsko, C.J.C. Burges, R. Creecy, J.J. Hull, T.P. Vogl, and C. Wilson, "The Second Census Optical Character Recognition Systems Conference," Technical Report NISTIR-5452, NIST, U.S. Dept. of Commerce, 1994.
[7] I. Guyon, M. Schenkel, and J. Denker, Overview and Synthesis of On-Line Cursive Handwriting Recognition Techniques. World Scientific, in press.
[8] I. Guyon, D. Henderson, P. Albrecht, Y. Le Cun, and J. Denker, "Writer Independent and Writer Adaptive Neural Network for On-Line Character Recognition," S. Impedovo, ed., From Pixels to Features III, pp. 493-506.Amsterdam: Elsevier, 1992.
[9] L. Gillick and S.J. Cox, "Some Statistical Issues in the Comparison of Speech Recognition Algorithms," Proc. ICASSP, IEEE 1989.
[10] L. Bottou and V. Vapnik, "Local Learning Algorithms," Technical Report TM-11359-920124-05, AT&T Laboratories, Holmdel, N.J., 1992.

Index Terms:
Pattern recognition, test set, test set size, benchmark, hypothesis testing, designed experiment, statistical significance, estimation, guaranteed estimators, recognition error.
Isabelle Guyon, John Makhoul, Richard Schwartz, Vladimir Vapnik, "What Size Test Set Gives Good Error Rate Estimates?," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 1, pp. 52-64, Jan. 1998, doi:10.1109/34.655649
Usage of this product signifies your acceptance of the Terms of Use.