The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.04 - April (2009 vol.31)
pp: 762
Published by the IEEE Computer Society
George Nagy , Rensselaer Polytechnic Institute, Troy
Sharad C. Seth , University of Nebraska-Lincoln, Lincoln
Mahesh Viswanathan , IBM, Yorktown Heights
ABSTRACT
A methodological error in the evaluation of page segmentation algorithms is examined.
A detailed comparative evaluation of six page segmentation methods, reported recently by Shafait et al. [ 1], follows the experimental protocol of an earlier study by Mao and Kanungo [ 2] that purported to show that recursive X-Y cut (RXYC) segmentation is much more error-prone than competitive methods, even on isothetic layouts. It does not, however, require much experimentation or reflection to discover that all pixel-projection methods (for either segmentation or skew determination) require removing any black background introduced by optical scanning.
We reported good results for RXYC segmentation in 1992 [ 3], from which Shafait et al. extracted the algorithm, and added to the evidence in [ 4]. The documents used in our experiments were either scanned or copied against a white background or obtained from an IEEE CD-ROM (this was pre-Web!) with all-white backgrounds. The page images tested in [ 1] and [ 2] were drawn from the University of Washington data set [ 5], which was evidently scanned against a black (or nonreflective) background. As indicated in Fig. 6 of [ 1], black borders probably account for most of the RXYC segmentation errors.
A reasonable motivation for a nonreflective background is that detecting the edges of the paper greatly simplifies eliminating black pixels that do not belong to the page, as well as estimating and removing any skew introduced in scanning. Neither of these steps was performed in the preparation of the database (presumably to allow testing different methods).
Like Mao and Kanungo, Shafait et al. suggest that the poor performance of the X-Y tree method is due to its vulnerability to noise. (Mao and Kanungo also called the black borders "noise," suggested removing them, but did not.) Black borders are not noise. In image or signal processing, noise denotes random, unpredictable phenomena, not entirely predictable artifacts. Even low-end fax scanners avoid adding black borders.
Performing the simple but essential step of border removal would not have violated the spirit of Mao's and Kanungo's general approach and the valuable pixel-based ground truth and "vectorial" performance measure proposed by Shafait et al. It would have much improved these otherwise careful and thorough evaluations of page segmentation algorithms.

    G. Nagy is with the Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, JEC 6020, 110 8th Street, Troy, NY 12180-0115. E-mail: nagy@ecse.rpi.edu.

    S. Seth is with the Department of Computer Science and Engineering, University of Nebraska-Lincoln, Avery Hall, Room 359, Lincoln, NE 68588-0115. E-mail: seth@cse.unl.edu.

    M. Viswanathan is with the IBM T.J. Watson Research Center, 294 Route 100, Mail Stop: SOM4/4L04, Somers, NY 10589.

    E-mail: maheshv@us.ibm.com.

Manuscript received 30 May 2008; accepted 14 July 2008; published online 22 July 2008.

Recommended for acceptance by L. O'Gorman.

For information on obtaining reprints of this article, please send e-mail to: tpami@computer.org, and reference IEEECS Log Number TPAMI-2008-05-0314.

Digital Object Identifier no. 10.1109/TPAMI.2008.192.

REFERENCES

15 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool