This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
A Statistical, Nonparametric Methodology for Document Degradation Model Validation
November 2000 (vol. 22 no. 11)
pp. 1209-1223

Abstract—Printing, photocopying, and scanning processes degrade the image quality of a document. Statistical models of these degradation processes are crucial for document image understanding research. Models allow us to predict system performance, conduct controlled experiments to study the breakdown points of the systems, create large multilingual data sets with groundtruth for training classifiers, design optimal noise removal algorithms, choose values for the free parameters of the algorithms, and so on. Although research in document understanding started many decades ago, only two document degradation models have been proposed thus far. Furthermore, no attempts have been made to statistically validate these models. In this paper, we present a statistical methodology that can be used to validate local degradation models. This method is based on a nonparametric, two-sample permutation test. Another standard statistical device—the power function—is then used to choose between algorithm variables such as distance functions. Since the validation and the power function procedures are independent of the model, they can be used to validate any other degradation model. A method for comparing any two models is also described. It uses p-values associated with the estimated models to select the model that is closer to the real world.

[1] S.F. Arnold, Math. Statistics. N.J.: Prentice-Hall, 1990.
[2] H. Baird, “Document Image Defect Models,” Proc. IAPR Workshop Syntactic and Structural Pattern Recognition, pp. 38-46, June 1990.
[3] H. Baird, “Calibration of Document Image Defect Models,” Proc. Second Ann. Symp. Document Analysis and Information Retrieval, pp. 1-16, Apr. 1993.
[4] H.S. Baird, “Document Image Defect Models,” Structured Document Image Analysis. New York: Springer-Verlag, 1992.
[5] G. Borgefors, “Distance Transforms in Digital Images,” Computer Vision, Graphics, and Image Processing, vol. 34, pp. 344-371, 1986.
[6] B. Efron and R.J. Tibshirani, An Introduction to the Bootstrap. New York: Chapman and Hall, 1993.
[7] P. Good, Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses, New York: Springer-Verlag, 1994.
[8] T. Kanungo, “Document Degradation Models and a Methodology for Degradation Model Validation,” PhD thesis, Univ. of Washington, Seattle, 1996. .
[9] T. Kanungo and R.M. Haralick, “Morphological Degradation Parameter Estimation,” Proc. SPIE Conf. Nonlinear Image Processing, vol. 2,424, pp. 86-95, Feb. 1995.
[10] T. Kanungo, R.M. Haralick, H.S. Baird, W. Stuetzle, and D. Madigan, “Document Degradation Models: Parameter Estimation and Model Validation,” Proc. Int'l Workshop Machine Vision Applications, pp. 552-557, Dec. 1994.
[11] T. Kanungo, R.M. Haralick, and I. Phillips, “Global and Local Document Degradation Models,” Proc. Second Int'l Conf. Document Analysis and Recognition, pp. 730-734, Oct. 1993.
[12] T. Kanungo, R.M. Haralick, and I. Phillips, “Nonlinear Local and Global Document Degradation Models,” Int'l J. Imaging Systems and Technology, vol. 5, pp. 220-230, 1994.
[13] D.E. Knuth, TEX: The Program. Mass.: Addison-Wesley, 1988.
[14] L. Lamport, LATEX: A Document Preparation System, Mass.: Addison-Wesley, 1986.
[15] Y. Li, D. Lopresti, and A. Tomkins, “Validation of Document Defect Models for Optical Character Recognition,” Proc. Third Ann. Symp. Document Analysis and Information Retrieval,” pp. 137-150, Apr. 1994.
[16] Y. Li, D. Lopresti, and A. Tomkins, “Validation of Document Defect Models,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, pp. 99-107, 1996.
[17] G. Nagy, “Validation of OCR Data Sets,” Proc. Third Ann. Symp. Document Analysis and Information Retrieval, pp. 127-135, Apr. 1994.
[18] I. Phillips, “User's Reference Manual,” CD-ROM, UW-III Document Image Database-III.
[19] P. Vojta, XDVI Software. 1990.

Index Terms:
Model validation, nonparametric statistical tests, permutation tests, document degradation models, simulation models, OCR.
Citation:
Tapas Kanungo, Robert M. Haralick, Henry S. Baird, Werner Stuezle, David Madigan, "A Statistical, Nonparametric Methodology for Document Degradation Model Validation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 11, pp. 1209-1223, Nov. 2000, doi:10.1109/34.888707
Usage of this product signifies your acceptance of the Terms of Use.