2006 10th IEEE International Enterprise Distributed Object Computing Conference Workshops (EDOCW'06) (2006)
Hong Kong, China
Oct. 16, 2006 to Oct. 20, 2006
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/EDOCW.2006.29
Gaurav Gupta , Indian Institute of Technology Kanpur
Shobhit Niranjan , Indian Institute of Technology Kanpur
Ankit Shrivastava , Indian Institute of Technology Kanpur
RMK Sinha , Indian Institute of Technology Kanpur
Digitization of paper-bound documents is one of the foremost commercial interests worldwide. First step in all such applications is transforming a paper bound document into an electronic document by scanning, subsequently applying to the image OCR to generate textual information from the document image. In this paper we describe our work that acts as a pre-processing stage for OCR application. Automatic document layout extraction and segmentation is done using spatial configuration of various text/image segments represented as bounded boxes; this segmented layout is than analyzed with certain heuristic tests and each segment is assigned labels (title, authors, abstract, body, header, footer etc). This information is than passed on to OCR module as an XML interface, accelerating it?s performance by allowing it to label recognized text segments and identifying only those parts of the document which have text resulting saving in computation. Although, the work has been motivated for application to an automated machine translation system preserving the overall document layout, it has a number of other applications such as in information retrieval, search etc. This information is also being used to classify technical documents into three categories which can be extended to any number of classes based on spatial configuration heuristics.
R. Sinha, S. Niranjan, G. Gupta and A. Shrivastava, "Document Layout Analysis and Classification and Its Application in OCR," 2006 10th IEEE International Enterprise Distributed Object Computing Conference Workshops (EDOCW'06)(EDOCW), Hong Kong, China, 2006, pp. 58.