This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2
Content-level Annotation of Large Collection of Printed Document Images
Curitiba, Parana, Brazil
September 23-September 26
ISBN: 0-7695-2822-8
A. Kumar, International Institute of Information Technology, Hyderabad - 500032, INDIA
C.V. Jawahar, International Institute of Information Technology, Hyderabad - 500032, INDIA
A large annotated corpus is critical to the development of robust optical character recognizers (OCRs). However, creation of annotated corpora is a tedious task. It is la- borious, especially when the annotation is at the character level. In this paper, we propose an efficient hierarchical approach for annotation of large collection of printed doc- ument images. We align document images with indepen- dently keyed-in text. The method is model-driven and is in- tended to annotate large collection of documents, scanned in three different resolutions, at character level. We employ an XML representation for storage of the annotation infor- mation. APIs are provided for access at content level for easy use in training and evaluation of OCRs and other doc- ument understanding tasks.
Citation:
A. Kumar, C.V. Jawahar, "Content-level Annotation of Large Collection of Printed Document Images," icdar, vol. 2, pp.799-803, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2, 2007
Usage of this product signifies your acceptance of the Terms of Use.