Eighth International Conference on Document Analysis and Recognition (ICDAR'05) The Same is Not The same - Post Correction of Alphabet Confusion Erros in Mixed-Alphabet OCR Recognation Seoul, Korea August 31-September 01 ISBN: 0-7695-2420-6
Character sets for Eastern European languages typically contain symbols that are optically almost or fully identical to Latin letters. When scanning documents with mixed Cyrillic-Latin or Greek-Latin alphabets, even high-quality OCR-software is often not able to correctly sep ante between Cyrillic(Greek) and Latin symbols. This effect leads to an error rate that is far beyond the usual error rates observed when recqnizing single-alphabet documents. In this paper we first survey similarities between en latin and Cyrillic (Greek) letters and words for distinct languages and fonts: After briefly introducing a new and public corpus collected by our groups for evaluating OCR-technology over mixed-alphabet documents, we describ how to adapt general algorithms and tools for post correction of OCR results to the new context of mixed-alphabet recognition. Experimental results on Bulgarian documents from the corpus and from other sources demonstrate that a drastic reduction of error rates can be achieved.
Citation:
Christoph Ringlstetter, Klaus U. Schulz, Stoyan Mihov, Katerina Louka, "The Same is Not The same - Post Correction of Alphabet Confusion Erros in Mixed-Alphabet OCR Recognation," icdar, pp.406-410, Eighth International Conference on Document Analysis and Recognition (ICDAR'05), 2005 Usage of this product signifies your acceptance of the Terms of Use. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||