This Article 
 Bibliographic References 
 Add to: 
Information Retrieval in Document Image Databases
November 2004 (vol. 16 no. 11)
pp. 1398-1410
With the rising popularity and importance of document images as an information source, information retrieval in document image databases has become a growing and challenging problem. In this paper, we propose an approach with the capability of matching partial word images to address two issues in document image retrieval: word spotting and similarity measurement between documents. First, each word image is represented by a primitive string. Then, an inexact string matching technique is utilized to measure the similarity between the two primitive strings generated from two word images. Based on the similarity, we can estimate how a word image is relevant to the other and, thereby, decide whether one is a portion of the other. To deal with various character fonts, we use a primitive string which is tolerant to serif and font differences to represent a word image. Using this technique of inexact string matching, our method is able to successfully handle the problem of heavily touching characters. Experimental results on a variety of document image databases confirm the feasibility, validity, and efficiency of our proposed approach in document image retrieval.

[1] D. Doermann, The Indexing and Retrieval of Document Images: A Survey Computer Vision and Image Understanding, vol. 70, no. 3, pp. 287-298, 1998.
[2] M. Mitra and B.B. Chaudhuri, Information Retrieval from Documents: A Survey Information Retrieval, vol. 2, nos. 2/3, pp. 141-163, 2000.
[3] G. Salton, J. Allan, C. Buckley, and A. Singhal, Automatic Analysis, Theme Generation, and Summarization of Machine-Readable Text Science, vol. 264, pp. 1421-1426, 1994.
[4] Y. Yang and X. Liu, A Re-Examination of Text Categorization Methods Proc. 22th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 42-49, 1999.
[5] K. Tagvam, J. Borsack, A. Condir, and S. Erva, The Effects of Noisy Data on Text Retrieval J. Am. Soc. for Information Science, vol. 45, no. 1, pp. 50-58, 1994.
[6] Y. Ishitani, Model-Based Information Extraction Method Tolerant of OCR Errors for Document Images Proc. Sixth Int'l Conf. Document Analysis and Recognition, pp. 908-915, 2001.
[7] M. Ohtam, A. Takasu, and J. Adachi, Retrieval Methods for English Text with Misrecognized OCR Characters Proc. Fourth Int'l Conf. Document Analysis and Recognition, pp. 950-956, 1997.
[8] S.M. Harding, W.B. Croft, and C. Weir, Probabilistic Retrieval of OCR Degraded Text Using N-Grams Proc. European Conf. Research and Advanced Technology for Digital Libraries (ECDL '97), pp. 345-359, 1997.
[9] P. Kantor and E. Voorhes, Report on the TREC-5 Confusion Track Online Proc. TREC-5, NIST special publication 500-238, pp. 65-74, 1997.
[10] T. Kameshiro, T. Hirano, Y. Okada, and F. Yoda, A Document Image Retrieval Method Tolerating Recognition and Segmentation Errors of OCR Using Shape-feature and Multiple Candidates Proc. Fifth Int'l Conf. Document Analysis and Recognition, pp. 681-684, 1999.
[11] K. Katsuyama et al., Highly Accurate Retrieval of Japanese Document Images through a Combination of Morphological Analysis and OCR Proc. SPIE, Document Recognition and Retrieval, vol. 4670, pp. 57-67, 2002.
[12] F.R. Chen and D.S. Bloomberg, Summarization of Imaged Documents without OCR Computer Vision and Image Understanding, vol. 70, no. 3, pp. 307-319, 1998.
[13] D.S. Bloomberg and F.R. Chen, Document Image Summarization without OCR Proc. Int'l Conf. Image Processing, vol. 2, pp. 229-232, 1996.
[14] J. Liu and A.K. Jain, Image-Based Form Document Retrieval Pattern Recognition, vol. 33, no. 3, pp. 503-513, 2000.
[15] D. Niyogi and S. Srihari, The Use of Document Structure Analysis to Retrieve Information from Documents in Digital Libraries Proc. SPIE, Document Recognition IV, vol. 3027, pp. 207-218, 1997.
[16] Y.Y. Tang, C.Y. Suen, and C.D. Yan, "Document Processing for Automatic Knowledge Acquisition," IEEE Trans. on Knowledge and Data Engineering, vol. 6, no. 1, pp. 3-21, 1994.
[17] Y. He, Z. Jiang, B. Liu, and H. Zhao, Content-Based Indexing and Retrieval Method of Chinese Document Images Prof. Fifth Int'l Conf. Document Analysis and Recognition (ICDAR '99), pp. 685-688, 1999.
[18] A.L. Spitz, Duplicate Document Detection Proc. SPIE, Document Recognition IV, vol. 3027, pp. 88-94, 1997.
[19] A.F. Smeaton and A.L. Spitz, Using Character Shape Coding for Information Retrieval Proc. Fourth Int'l Conf. Document Analysis and Recognition, pp. 974-978, 1997.
[20] A.L. Spitz, Shape-Based Word Recognition Int'l J. Document Analysis and Recognition, vol. 1, no. 4, pp. 178-190, 1999.
[21] A.L. Spitz, Progress in Document Reconstruction Proc. 16th Int'l Conf. Pattern Recognition, vol. 1, pp. 464-467, 2002.
[22] Z. Yu and C.L. Tan, Image-Based Document Vectors for Text Retrieval Proc. 15th Int'l Conf. Pattern Recognition, vol. 4, pp. 393-396, 2000.
[23] C.L. Tan, W. Huang, Z. Yu, and Y. Xu, Imaged Document Text Retrieval without OCR IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 6, pp. 838-844, June 2002.
[24] T.K. Ho, J.J. Hull, and S.N. Srihari, A Word Shape Analysis Approach to Lexicon Based Word Recognition Pattern Recognition Letters, vol. 13, pp. 821-826, 1992.
[25] T. Syeda-Mahmood, Indexing of Handwritten Document Images Proc. Workshop Document Image Analysis, pp. 66-73, 1997.
[26] A. Kolcz, J. Alspector, M. Augusteijn, R. Carlson, and G.V. Popescu, A Line-Oriented Approach to Word Spotting in Handwritten Documents Pattern Analysis and Applications, vol. 3, no. 2, pp. 153-168, 2000.
[27] R. Manmatha, C. Han, and E.M. Riseman, Word Spotting: A New Approach to Indexing Handwriting Proc. Int'l Conf. Computer Vision and Pattern Recognition, pp. 631-637, 1996.
[28] J. DeCurtins and E. Chen, Keyword Spotting via Word Shape Recognition Proc. SPIE, Document Recognition II, vol. 2422, pp. 270-277, 1995.
[29] S.S. Kuo and O. Agazzi, “Keyword Spotting in Poorly Printed Documents Using Pseudo 2-D Hidden Markov Models,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 16, no. 8, pp. 842-848, Aug. 1994.
[30] F.R. Chen, L.D. Wilcox, and D.S. Bloomberg, Word Spotting in Scanned Images Using Hidden Markov Models Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, vol. 5, pp. 1-4, 1993.
[31] F.R. Chen, L.D. Wilcox, and D.S. Bloomberg, “Detecting and Locating Partially Specified Keywords in Scanned Images Using Hidden Markov Models,” Proc. Second Int'l Conf. Document Analysis and Recognition, pp. 133-138, 1993.
[32] Y. Lu, C.L. Tan, W. Huang, and L. Fan, An Approach to Word Image Matching Based on Weighted Hausdorff Distance Proc. Sixth Int'l Conf. Document Analysis and Recognition, pp. 921-925, 2001.
[33] J.J. Hull, Document Matching on CCITT Group 4 Compressed Images Proc. SPIE, Document Recognition IV, vol. 3027, pp. 82-87, 1997.
[34] Y. Lu and C.L. Tan, Document Retrieval from Compressed Images Pattern Recognition, vol. 36, no. 4, pp. 987-996, 2003.
[35] A. Apostolico and R. Giancarlo, Sequence Alignment in Molecular Biology DIMACS Series in Discrete Math. and Theoretical Computer Sciences, vol. 47, pp. 85-115, 1999.
[36] D. Lopresti and J. Zhou, Retrieval Strategies for Noisy Text Proc. Fifth Ann. Symp. Document Analysis and Information Retrieval, pp. 255-269, 1996.
[37] D.P. Lopresti, A Comparison of Text-Based Methods for Detecting Duplication in Scanned Document Databases Information Retrieval, vol. 4, no. 2, pp. 153-173, 2001.
[38] D. Gusfield, Algorithms on Strings, Trees, and Sequences. Cambridge Univ. Press, 1997.
[39] R.A. Wagner and M.J. Fisher, The String-to-String Correction Problem J. ACM, vol. 21, pp. 168-173, 1974.
[40] Y. Lu and C.L. Tan, A Nearest-Neighbor-Chain Based Approach to Skew Estimation in Document Images Pattern Recognition Letters, vol. 24, pp. 2315-2323, 2003.

Index Terms:
Document image retrieval, partial word image matching, primitive string, word searching, document similarity measurement.
Yue Lu, Chew Lim Tan, "Information Retrieval in Document Image Databases," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 11, pp. 1398-1410, Nov. 2004, doi:10.1109/TKDE.2004.76
Usage of this product signifies your acceptance of the Terms of Use.