This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
10th IEEE International Enterprise Distributed Object Computing Conference Workshops (EDOCW'06)
Document Layout Analysis and Classification and Its Application in OCR
Hong Kong, China
October 16-October 20
ISBN: 0-7695-2743-4
Gaurav Gupta, Indian Institute of Technology Kanpur
Shobhit Niranjan, Indian Institute of Technology Kanpur
Ankit Shrivastava, Indian Institute of Technology Kanpur
RMK Sinha, Indian Institute of Technology Kanpur
Digitization of paper-bound documents is one of the foremost commercial interests worldwide. First step in all such applications is transforming a paper bound document into an electronic document by scanning, subsequently applying to the image OCR to generate textual information from the document image. In this paper we describe our work that acts as a pre-processing stage for OCR application. Automatic document layout extraction and segmentation is done using spatial configuration of various text/image segments represented as bounded boxes; this segmented layout is than analyzed with certain heuristic tests and each segment is assigned labels (title, authors, abstract, body, header, footer etc). This information is than passed on to OCR module as an XML interface, accelerating it?s performance by allowing it to label recognized text segments and identifying only those parts of the document which have text resulting saving in computation. Although, the work has been motivated for application to an automated machine translation system preserving the overall document layout, it has a number of other applications such as in information retrieval, search etc. This information is also being used to classify technical documents into three categories which can be extended to any number of classes based on spatial configuration heuristics.
Citation:
Gaurav Gupta, Shobhit Niranjan, Ankit Shrivastava, RMK Sinha, "Document Layout Analysis and Classification and Its Application in OCR," edocw, pp.58, 10th IEEE International Enterprise Distributed Object Computing Conference Workshops (EDOCW'06), 2006
Usage of this product signifies your acceptance of the Terms of Use.