loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Tools for Enabling Digital Access to Multi-Lingual Indic Documents
Palo Alto, California
January 23-January 24
ISBN: 0-7695-2088-X
Venu Govindaraju, State University of New York at Buffalo
Swapnil Khedekar, State University of New York at Buffalo
Suryaprakash Kompalli, State University of New York at Buffalo
Faisal Farooq, State University of New York at Buffalo
Srirangaraj Setlur, State University of New York at Buffalo
Ramanaprasad Vemulapati, State University of New York at Buffalo
In this paper we present methodologies for three important tasks that will eventually enable digital access of multilingual Indian document images. Firstly, we describe several document image analysis techniques necessary to prepare Devanagari document images for OCR. The second task is OCR for machine printed Devanagari words without the help of a lexicon. We describe the OCR methodology and show how it is being extended to other Indian languages. Finally, we describe a verstile platform that facilitates automatic segmentation of document images in multiple Indian languages and an interface to capture the ground truth corresponding to the text. We use transliterated English text and virtual keyboards in a range of Indian languages for this purpose. The multi-lingual data entry capabilities of the tool and its underlying UNICODE data representation within a structured XML document also allow users to annotate passages of text in one language in other languages using a markup scheme to switch between scripts. Text and annotations are rendered in the appropriate scripts as the text is being annotated, thus providing users prompt and natural feedback. The XML back-end allows meta-data to be recorded describing the annotated document.
Citation:
Venu Govindaraju, Swapnil Khedekar, Suryaprakash Kompalli, Faisal Farooq, Srirangaraj Setlur, Ramanaprasad Vemulapati, "Tools for Enabling Digital Access to Multi-Lingual Indic Documents," dial, pp.122, First International Workshop on Document Image Analysis for Libraries (DIAL'04), 2004
Usage of this product signifies your acceptance of the Terms of Use.