|
| This Article | ||
| ||
| Share | ||
| Bibliographic References | ||
| Add to: | ||
| | ||
| Search | ||
| ||
| ASCII Text | x | ||
| Chew Lim Tan, Weihua Huang, Zhaohui Yu, Yi Xu, "Imaged Document Text Retrieval Without OCR," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 6, pp. 838-844, June, 2002. | |||
| BibTex | x | ||
| @article{ 10.1109/TPAMI.2002.1008389, author = {Chew Lim Tan and Weihua Huang and Zhaohui Yu and Yi Xu}, title = {Imaged Document Text Retrieval Without OCR}, journal ={IEEE Transactions on Pattern Analysis and Machine Intelligence}, volume = {24}, number = {6}, issn = {0162-8828}, year = {2002}, pages = {838-844}, doi = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2002.1008389}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, } | |||
| RefWorks Procite/RefMan/Endnote | x | ||
| TY - JOUR JO - IEEE Transactions on Pattern Analysis and Machine Intelligence TI - Imaged Document Text Retrieval Without OCR IS - 6 SN - 0162-8828 SP838 EP844 EPD - 838-844 A1 - Chew Lim Tan, A1 - Weihua Huang, A1 - Zhaohui Yu, A1 - Yi Xu, PY - 2002 KW - Document image analysis KW - document vector KW - text similarity KW - textretrieval. VL - 24 JA - IEEE Transactions on Pattern Analysis and Machine Intelligence ER - | |||
We propose a method for text retrieval from document images without the use of OCR. Documents are segmented into character objects. Image features, namely, the Vertical Traverse Density (VTD) and Horizontal Traverse Density (HTD), are extracted. An n-gram based document vector is constructed for each document based on these features. Text similarity between documents is then measured by calculating the dot product of the document vectors. Testing with seven corpora of imaged textual documents in English and Chinese as well as images from UW1 database confirms the validity of the proposed method.
[1] S.M. Harding, W.B. Croft, and C. Weir, “Probabilistic Retrieval of OCR Degraded Text Using N-grams,” Proc. European Conf. Digital Libraries, pp. 345-359, 1997.
[2] W.B. Croft, S.M. Harding, K. Taghva, and J. Borsack, “An Evaluation of Information Retrieval Accuracy with Simulated OCR Output,” Proc. Symp. Document Analysis and Information Retrieval, pp. 115-126, 1994.
[3] M. Damashek, “Gauging Similarity via N-Grams: Language-Independent Sorting, Categorization, and Retrieval of Text,” Science, vol. 267, pp. 843-848, 1995.
[4] D. Harman, C. Buckley, J. Callan, S. Dumais, D. Lewis, S. Robertson, A. Smeaton, K.S. Jones, R. Tong, G. Salton, and M. Damashek, “Performance of Text Retrieval Systems,” Science, vol. 268, pp. 1417-1420, 1995.
[5] F.R. Chen and D.S. Bloomberg, “Extraction of Indication Summary Sentences from Imaged-Documents,” Proc. Fourth Int'l Conf. Document Analysis and Recognition, ICDAR'97, vol. 1, pp. 227-232, 1997.
[6] F.R. Chen, D.S. Bloomberg, and L.D. Wilcox, “Detection and Location of Multi-Character Sequences in Lines of Imaged Text,” J. Electronic Imaging, vol. 5, pp. 37-49, 1996.
[7] Y. He, Z. Jiang, B. Liu, and H. Zhao, Content-Based Indexing and Retrieval Method of Chinese Document Images Prof. Fifth Int'l Conf. Document Analysis and Recognition (ICDAR '99), pp. 685-688, 1999.
[8] R.S. Caprari, “Duplicated Document Detection by Template Matching,” Image and Vision Computing 18, pp. 633-643, 2000.
[9] J.J. Hull and J.F. Cullen, “Document Image Similarity and Equivalence Detection,” Proc. Fourth Int'l Conf. Document Analysis and Recognition, pp. 308-312, 1997.
[10] D-S. Lee and J.J. Hull, “Duplicate Detection for Symbolically Compressed Documents,” Proc. Fifth Int'l Conf. Document Analysis and Recognition, pp. 305-308, Sept. 1999.
[11] A.L. Spitz, “Determination of the Script and Language Content of Document Images,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 3, pp. 235-245, Mar. 1997.
[12] C.Y. Suen, S. Bergler, N. Nobile, B. Waked, C.P. Nadal, and A. Bloch, “Categorizing Document Images into Script and Language Classes,” Proc. Int'l Conf. Advances in Pattern Recognition, ICAPR'98, pp. 297-306, 1998.
[13] C.L. Tan, P.Y. Leong, and S. He, “Language Identification in Multilingual Documents,” Proc. Int'l Symp. Intelligent Multimedia and Distance Education, ISIMADE '99, pp. 59-64, 1999.
[14] Z. Yu, “Similarity Measure of Text Images,” master's thesis, School of Computing, Nat'l Univ. Singapore, Sept. 2000.
[15] H.J. Lee, Chinese Character Recognition in Taiwan. Handbook of Character Recognition and Document Image Analysis, H. Bunke and P.S.P. Wang, eds., World Scientific, pp. 331-355, 1997.

