loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Eighth International Conference on Document Analysis and Recognition (ICDAR'05)
Fast Optical Character Recognition through Glyph Hashing for Document Conversion
Seoul, Korea
August 31-September 01
ISBN: 0-7695-2420-6
Kumar Chellapilla, Microsoft Research
Patrice Simard, Microsoft Research
Radoslav Nickolov, Microsoft Research

This paper proposes a glyph hashing approach to optical character recognition with applications in document conversion. The viability and efficiency of the approach is tested through its implementation in a print driver on 68,987 PDF documents containing 1.15 billion characters. Results indicate that a hash table with (a) 3.2 million hashes is sufficient to represent all characters from these documents, and (b) 480 fonts are sufficient to cover over 90% of these documents. Glyph recognizing experiments indicate that 80% of unique character glyphs and over 96% of all characters from unseen documents can be found in a hash table built using all 68,987 documents.

The hashing approach is used to not only recognize the character codes but also, size, style (bold, italic, etc), and font name. We found that the hashing approach can scale to hundreds of fonts and thousands of characters per font. Further, it is extremely fast and can recognize over 100,000 characters per second. Owing to its speed, such a hashing approach can complement any existing OCR system by acting as a pre-filter to produce a 4-5 times speedup during document conversion.

Citation:
Kumar Chellapilla, Patrice Simard, Radoslav Nickolov, "Fast Optical Character Recognition through Glyph Hashing for Document Conversion," icdar, pp.829-834, Eighth International Conference on Document Analysis and Recognition (ICDAR'05), 2005
Usage of this product signifies your acceptance of the Terms of Use.