2011 International Conference on Document Analysis and Recognition (2011)
Sept. 18, 2011 to Sept. 21, 2011
In general document image analysis methods are pre-processing steps for Optical Character Recognition (OCR) systems. In contrast, the proposed method aims at clustering document snippets, so that an automated clustering of documents can be performed. Therefore, words are classified according to printed text, manuscripts, and noise. Where, the third class corrects falsely segmented background elements. Having classified text elements, a layout analysis is carried out which groups words into text lines and paragraphs. A back propagation of the class weights - assigned to each word in the first step - enables correcting wrong class labels. The proposed method shows promising results on a dataset consisting of document snippets with varying shapes, content writing and layout. In addition, the system is compared to page segmentation methods of the ICDAR 2009 Page Segmentation Competition.
local features, text classification, layout analysis
F. Kleber, R. Sablatnig and M. Diem, "Text Classification and Document Layout Analysis of Paper Fragments," 2011 International Conference on Document Analysis and Recognition(ICDAR), Beijing, China, 2011, pp. 854-858.