The Community for Technology Leaders
Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) (2007)
Curitiba, Parana, Brazil
Sept. 23, 2007 to Sept. 26, 2007
ISSN: 1520-5363
ISBN: 0-7695-2822-8
pp: 1063-1067
T. Breuel , U. Kaiserslautern and DFKI, Germany
ABSTRACT
Large scale scanning and document conversion ef- forts have led to a renewed interest in OCR systems and workflows. This paper describes a new format for representing both intermediate and final OCR results, developed in response to the needs of a newly developed OCR system and ground truth data release. The format embeds OCR information invisibly inside the HTML and CSS standards and therefore can represent a wide range of linguistic and typographic phenomena with al- ready well-defined, widely understood markup and can be processed using widely available and known tools. The format is based on a new, multi-level abstraction of OCR results based on logical markup, common typeset- ting models, and OCR engine-specific markup, making it suitable both for the support of existing workflows and the development of future model-based OCR engines.
INDEX TERMS
CITATION

T. Breuel, "The hOCR Microformat for OCR Workflow and Results," Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)(ICDAR), Curitiba, Parana, Brazil, 2007, pp. 1063-1067.
doi:10.1109/ICDAR.2007.249
98 ms
(Ver 3.3 (11022016))