This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Imaged Document Text Retrieval Without OCR
June 2002 (vol. 24 no. 6)
pp. 838-844

We propose a method for text retrieval from document images without the use of OCR. Documents are segmented into character objects. Image features, namely, the Vertical Traverse Density (VTD) and Horizontal Traverse Density (HTD), are extracted. An n-gram based document vector is constructed for each document based on these features. Text similarity between documents is then measured by calculating the dot product of the document vectors. Testing with seven corpora of imaged textual documents in English and Chinese as well as images from UW1 database confirms the validity of the proposed method.

[1] S.M. Harding, W.B. Croft, and C. Weir, “Probabilistic Retrieval of OCR Degraded Text Using N-grams,” Proc. European Conf. Digital Libraries, pp. 345-359, 1997.
[2] W.B. Croft, S.M. Harding, K. Taghva, and J. Borsack, “An Evaluation of Information Retrieval Accuracy with Simulated OCR Output,” Proc. Symp. Document Analysis and Information Retrieval, pp. 115-126, 1994.
[3] M. Damashek, “Gauging Similarity via N-Grams: Language-Independent Sorting, Categorization, and Retrieval of Text,” Science, vol. 267, pp. 843-848, 1995.
[4] D. Harman, C. Buckley, J. Callan, S. Dumais, D. Lewis, S. Robertson, A. Smeaton, K.S. Jones, R. Tong, G. Salton, and M. Damashek, “Performance of Text Retrieval Systems,” Science, vol. 268, pp. 1417-1420, 1995.
[5] F.R. Chen and D.S. Bloomberg, “Extraction of Indication Summary Sentences from Imaged-Documents,” Proc. Fourth Int'l Conf. Document Analysis and Recognition, ICDAR'97, vol. 1, pp. 227-232, 1997.
[6] F.R. Chen, D.S. Bloomberg, and L.D. Wilcox, “Detection and Location of Multi-Character Sequences in Lines of Imaged Text,” J. Electronic Imaging, vol. 5, pp. 37-49, 1996.
[7] Y. He, Z. Jiang, B. Liu, and H. Zhao, Content-Based Indexing and Retrieval Method of Chinese Document Images Prof. Fifth Int'l Conf. Document Analysis and Recognition (ICDAR '99), pp. 685-688, 1999.
[8] R.S. Caprari, “Duplicated Document Detection by Template Matching,” Image and Vision Computing 18, pp. 633-643, 2000.
[9] J.J. Hull and J.F. Cullen, “Document Image Similarity and Equivalence Detection,” Proc. Fourth Int'l Conf. Document Analysis and Recognition, pp. 308-312, 1997.
[10] D-S. Lee and J.J. Hull, “Duplicate Detection for Symbolically Compressed Documents,” Proc. Fifth Int'l Conf. Document Analysis and Recognition, pp. 305-308, Sept. 1999.
[11] A.L. Spitz, “Determination of the Script and Language Content of Document Images,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 3, pp. 235-245, Mar. 1997.
[12] C.Y. Suen, S. Bergler, N. Nobile, B. Waked, C.P. Nadal, and A. Bloch, “Categorizing Document Images into Script and Language Classes,” Proc. Int'l Conf. Advances in Pattern Recognition, ICAPR'98, pp. 297-306, 1998.
[13] C.L. Tan, P.Y. Leong, and S. He, “Language Identification in Multilingual Documents,” Proc. Int'l Symp. Intelligent Multimedia and Distance Education, ISIMADE '99, pp. 59-64, 1999.
[14] Z. Yu, “Similarity Measure of Text Images,” master's thesis, School of Computing, Nat'l Univ. Singapore, Sept. 2000.
[15] H.J. Lee, Chinese Character Recognition in Taiwan. Handbook of Character Recognition and Document Image Analysis, H. Bunke and P.S.P. Wang, eds., World Scientific, pp. 331-355, 1997.

Index Terms:
Document image analysis, document vector, text similarity, textretrieval.
Citation:
Chew Lim Tan, Weihua Huang, Zhaohui Yu, Yi Xu, "Imaged Document Text Retrieval Without OCR," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 6, pp. 838-844, June 2002, doi:10.1109/TPAMI.2002.1008389
Usage of this product signifies your acceptance of the Terms of Use.