Curitiba, Parana, Brazil
Sept. 23, 2007 to Sept. 26, 2007
T. Breuel , U. Kaiserslautern and DFKI, Germany
Large scale scanning and document conversion ef- forts have led to a renewed interest in OCR systems and workflows. This paper describes a new format for representing both intermediate and final OCR results, developed in response to the needs of a newly developed OCR system and ground truth data release. The format embeds OCR information invisibly inside the HTML and CSS standards and therefore can represent a wide range of linguistic and typographic phenomena with al- ready well-defined, widely understood markup and can be processed using widely available and known tools. The format is based on a new, multi-level abstraction of OCR results based on logical markup, common typeset- ting models, and OCR engine-specific markup, making it suitable both for the support of existing workflows and the development of future model-based OCR engines.
T. Breuel, "The hOCR Microformat for OCR Workflow and Results", ICDAR, 2007, 2013 12th International Conference on Document Analysis and Recognition, 2013 12th International Conference on Document Analysis and Recognition 2007, pp. 1063-1067, doi:10.1109/ICDAR.2007.249