Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2 Content-level Annotation of Large Collection of Printed Document Images Curitiba, Parana, Brazil September 23-September 26 ISBN: 0-7695-2822-8
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDAR.2007.89
A large annotated corpus is critical to the development of robust optical character recognizers (OCRs). However, creation of annotated corpora is a tedious task. It is la- borious, especially when the annotation is at the character level. In this paper, we propose an efficient hierarchical approach for annotation of large collection of printed doc- ument images. We align document images with indepen- dently keyed-in text. The method is model-driven and is in- tended to annotate large collection of documents, scanned in three different resolutions, at character level. We employ an XML representation for storage of the annotation infor- mation. APIs are provided for access at content level for easy use in training and evaluation of OCRs and other doc- ument understanding tasks.
Citation:
A. Kumar, C.V. Jawahar, "Content-level Annotation of Large Collection of Printed Document Images," icdar, vol. 2, pp.799-803, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2, 2007 Usage of this product signifies your acceptance of the Terms of Use. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||