This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Font Adaptive Word Indexing of Modern Printed Documents
August 2006 (vol. 28 no. 8)
pp. 1187-1199
We propose an approach for the word-level indexing of modern printed documents which are difficult to recognize using current OCR engines. By means of word-level indexing, it is possible to retrieve the position of words in a document, enabling queries involving proximity of terms. Web search engines implement this kind of indexing, allowing users to retrieve Web pages on the basis of their textual content. Nowadays, digital libraries hold collections of digitized documents that can be retrieved either by browsing the document images or relying on appropriate metadata assembled by domain experts. Word indexing tools would therefore increase the access to these collections. The proposed system is designed to index homogeneous document collections by automatically adapting to different languages and font styles without relying on OCR engines for character recognition. The approach is based on three main ideas: the use of Self Organizing Maps (SOM) to perform unsupervised character clustering, the definition of one suitable vector-based word representation whose size depends on the word aspect-ratio, and the run-time alignment of the query word with indexed words to deal with broken and touching characters. The most appropriate applications are for processing modern printed documents (17th to 19th centuries) where current OCR engines are less accurate. Our experimental analysis addresses six data sets containing documents ranging from books of the 17th century to contemporary journals.

[1] D. Doermann, “The Indexing and Retrieval of Document Images: A Survey,” Computer Vision and Image Understanding, vol. 70, pp. 287-298, June 1998.
[2] M. Mitra and B. Chaudhuri, “Information Retrieval from Documents: A Survey,” Information Retrieval, vol. 2, nos. 2/3, pp. 141-163, 2000.
[3] J.D. Curtis and E. Chen, “Keyword Spotting via Word Shape Recognition,” Proc. SPIE— Document Recognition II, pp. 270-277, 1995.
[4] J. Trenkle and R. Vogt, “Word Recognition for Information Retrieval in the Image Domain,” Proc. Second Ann. Symp. Document Analysis and Information Retrieval, pp. 105-122, 1993.
[5] W. Williams, E. Zalubas, and A. Hero, “Word Spotting in Bitmapped Fax Documents,” Information Retrieval, vol. 2, nos. 2/3, pp. 207-226, 2000.
[6] K. Marukawa, T. Hu, H. Fujisawa, and Y. Shima, “Document Retrieval Tolerating Character Recognition Errors— Evaluation and Application,” Pattern Recognition, vol. 30, no. 8, pp. 1361-1371, 1997.
[7] K. Taghva, J. Borsack, and A. Condit, “Evaluation of Model-Based Retrieval Effectiveness with OCR Text,” ACM Trans. Information Systems, vol. 14, pp. 64-93, Jan. 1996.
[8] D.P. Lopresti, “Robust Retrieval of Noisy Text,” Proc. Third Forum Research and Advances in Digital Libraries, pp. 76-85, 1996.
[9] E. Keogh and C.A. Ratanamahatana, “Exact Indexing of Dynamic Time Warping,” Knowledge and Information Systems, vol. 7, pp. 358-386, 2005.
[10] K. Terasawa, T. Nagasaki, and T. Kawashima, “Eigenspace Method for Text Retrieval in Historical Documents,” IEEE Proc. Eighth Int'l Conf. Document Analysis and Recognition, pp. 437-441, 2005.
[11] S. Madhvanath and V. Govindaraju, “The Role of Holistic Paradigms in Handwritten Word Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 149-164, Feb. 2001.
[12] F. Cesarini, M. Gori, S. Marinai, and G. Soda, “INFORMys: A Flexible Invoice-Like Form Reader System,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 7, pp. 730-745, July 1998.
[13] T.M. Rath, R. Manmatha, and V. Lavrenko, “A Search Engine for Historical Manuscript Images,” Proc. Ann. ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 369-376, 2004.
[14] P. Haffner, L. Bottou, P.G. Howard, and Y. LeCun, “DjVu: Analyzing and Compressing Scanned Documents for Internet Distribution,” Proc. Fifth IEEE Int'l Conf. Document Analysis and Recognition, pp. 625-628, 1999.
[15] I.H. Witten, A. Moffat, and T.C. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images. Academic Press, 1999.
[16] A.F. Smeaton and A.L. Spitz, “Using Character Shape Coding for Information Retrieval,” IEEE Proc. Fourth Int'l Conf. Document Analysis and Recognition, pp. 974-978, 1997.
[17] C.L. Tan, W. Huang, Z. Yu, and Y. Xu, “Imaged Document Text Retrieval without OCR,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 6, pp. 838-844, June 2002.
[18] Y. Lu and C. Tan, “Information Retrieval in Document Image Databases,” IEEE Trans. Knowledge and Data Discovery, vol. 16, pp. 1398-1410, Nov. 2004.
[19] S. Marinai, E. Marino, and G. Soda, “Indexing and Retrieval of Words in Old Documents,” Proc. Seventh IEEE Int'l Conf. Document Analysis and Recognition, pp. 223-227, 2003.
[20] R.G. Casey and E. Lecolinet, “A Survey of Methods and Strategies in Character Segmentation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 7, pp. 690-706, July 1996.
[21] S. Kahan and T. Pavlidis, “On the Recognition of Printed Characters of Any Font and Size,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 9, no. 2, pp. 274-287, Mar. 1987.
[22] T. Kohonen, Self-Organizing Maps. Springer Series in Information Sciences, 2001.
[23] S. Marinai, M. Gori, and G. Soda, “Artificial Neural Networks for Document Analysis and Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 1, pp. 23-35, Jan. 2005.
[24] J.W. Sammon, “A Nonlinear Mapping for Data Structure Analysis,” IEEE Trans. Computers, vol. 18, pp. 401-409, 1969.
[25] A. Konig, “Interactive Visualization and Analysis of Hierarchical Neural Projections for Data Mining,” IEEE Trans. Neural Networks, vol. 11, no. 3, pp. 615-624, 2000.
[26] J.B. Tenenbaum, V. de Silva, and J.C. Langford, “A Global Geometric Framework for Non Linear Dimensionality Reduction,” Science, vol. 290, pp. 2319-2323, 2000.
[27] S.T. Roweis and L. Saul, “Nonlinear Dimensionality Reduction by Locally Linear Embedding,” Science, vol. 290, pp. 2323-2326, 2000.
[28] T. Nagasaki, T. Takahashi, and K. Marukawa, “Document Retrieval System Tolerant of Segmentation Errors of Document Images,” Proc. Ninth Int'l Workshop Frontiers in Handwriting Recognition, pp. 280-285, 2004.
[29] Y. Lu, “Machine Printed Character Segmentation— An Overview,” Pattern Recognition, vol. 28, no. 1, pp. 67-80, 1995.
[30] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addison Wesley, 1999.
[31] S. Marinai, E. Marino, and G. Soda, “Layout Based Document Image Retrieval by Means of XY Tree Reduction,” Proc. Eighth IEEE Int'l Conf. Document Analysis and Recognition, pp. 432-436, 2005.

Index Terms:
Clustering, digital libraries, document image retrieval, heuristic oversegmentation, holistic word representation, modern documents, self organizing map.
Citation:
Simone Marinai, Emanuele Marino, Giovanni Soda, "Font Adaptive Word Indexing of Modern Printed Documents," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp. 1187-1199, Aug. 2006, doi:10.1109/TPAMI.2006.162
Usage of this product signifies your acceptance of the Terms of Use.