This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Noisy Text Categorization
December 2005 (vol. 27 no. 12)
pp. 1882-1895
This work presents categorization experiments performed over noisy texts. By noisy, we mean any text obtained through an extraction process (affected by errors) from media other than digital texts (e.g., transcriptions of speech recordings extracted with a recognition system). The performance of a categorization system over the clean and noisy (Word Error Rate between \sim 10 and \sim 50 percent) versions of the same documents is compared. The noisy texts are obtained through handwriting recognition and simulation of optical character recognition. The results show that the performance loss is acceptable for Recall values up to 60-70 percent depending on the noise sources. New measures of the extraction process performance, allowing a better explanation of the categorization results, are proposed.

[1] D. Chen, J.M. Odobez, and H. Bourlard, “Text Detection and Recognition in Images and Videos,” Pattern Recognition, vol. 37, no. 3, pp. 595-609, 2004.
[2] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addison Wesley, 1999.
[3] J.S. Garofolo, C.G.P. Auzanne, and E.M. Voorhees, “The TREC Spoken Document Retrieval Track: A Success Story,” Proc. Eighth Text Retrieval Conf., pp. 107-129, 1999.
[4] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik, “Feature Selection for SVM's,” Advances in Neural Information Processing Systems 13, pp. 668-674, 2000.
[5] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[6] T. Joachims, Learning to Classify Text Using Support Vector Machines. Kluwer, 2002.
[7] D. Lewis, “An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task,” Proc. 15th ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 37-50, 1992.
[8] A. Vinciarelli, S. Bengio, and H. Bunke, “Offline Recognition of Large Vocabulary Cursive Handwritten Text,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 6, pp. 709-720, June 2004.
[9] B. Croft, S.M. Harding, K. Taghva, and J. Borsack, “An Evaluation of Information Retrieval Accuracy with Simulated OCR Output,” Proc. Symp. Document Analysis and Information Retrieval, pp. 115-126, 1994.
[10] D. Graff, C. Cieri, S. Strassel, and N. Martey, “The TDT-3 Text and Speech Corpus,” Proc. Topic Detection and Tracking Workshop, 2000.
[11] S.E. Johnson, P. Jourlin, K. Spärck-Jones, and P.C. Woodland, “Spoken Document Retrieval for TREC-8 at Cambridge University,” Proc. Eighth Text Retrieval Conf., pp. 197-206, 1999.
[12] D. Abberley, S. Renals, D. Ellis, and T. Robinson, “The THISL SDR System at TREC-8,” Proc. Eighth Text Retrieval Conf., pp. 699-706, 1999.
[13] M. Franz, J.S. McCarley, and R.T. Ward, “Ad Hoc, Cross-Language and Spoken Document Information Retrieval at IBM,” Proc. Eighth Text Retrieval Conf., pp. 391-398, 1999.
[14] J.L. Gauvain, Y. de Kercadio, L. Lamel, and G. Adda, “The LIMSI SDR System for TREC-8,” Proc. Eighth Text Retrieval Conf., pp. 475-482, 1999.
[15] A. Singhal, S. Abney, M. Bacchiani, M. Collins, D. Hindle, and F. Pereira, “AT&T at TREC-8,” Proc. Eighth Text Retrieval Conf., pp. 317-330, 1999.
[16] B. Han, R. Nagarajan, R. Srihari, and M. Srikanth, “TREC-8 Experiments at SUNY at Buffalo,” Proc. Eighth Text Retrieval Conf., pp. 591-596, 1999.
[17] W. Kraaij, R. Pohlmann, and D. Hiemstra, “Twenty-One at TREC-8 Using Language Technology for Information Retrieval,” Proc. Eighth Text Retrieval Conf., pp. 285-300, 1999.
[18] D. Doermann, “The Indexing and Retrieval of Document Images: A Survey,” Computer Vision and Image Understanding, vol. 70, no. 3, pp. 287-298, 1998.
[19] C. Zhai, X. Tong, N. Milic-Frailing, and D.A. Evans, “OCR Correction and Query Expansion for Retrieval on OC Data— CLARIT TREC-5 Confusion Track Report,” Proc. Fifth Text Retrieval Conf., pp. 341-344, 1996.
[20] M. Ohta, A. Takasu, and J. Adachi, “Retrieval Methods for English Text with Misrecognized OCR Characters,” Proc. IEEE Int'l Conf. Document Analysis and Recognition, pp. 950-956, 1997.
[21] D. Lopresti and J. Zhou, “Retrieval Strategies for Noisy Text,” Proc. Symp. Document Analysis and Information Retrieval, pp. 255-270, 1996.
[22] K. Taghva, J. Borsack, and A. Condit, “Expert System for Automatically Correcting OCR Output,” Proc. SPIE-Document Recognition, pp. 270-278, 1994.
[23] D. Doermann and S. Yao, “Generating Synthetic Data for Text Analysis Systems,” Proc. Symp. Document Analysis and Information Retrieval, pp. 449-467, 1995.
[24] C.L. Tan, W. Huang, Z. Yu, and Y. Xu, “Imaged Document Text Retrieval without OCR,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 6, pp. 838-844, June 2002.
[25] G. Russell, M.P. Perrone, and Y.M. Chee, “Handwritten Document Retrieval,” Proc. Int'l Workshop Frontiers in Handwriting Recognition, pp. 233-23, 2002.
[26] T.M Rath and R. Manmatha, “Features for Word Spotting in Historical Manuscripts,” Proc. IEEE Int'l Conf. Document Analysis and Recognition, pp. 218-222, 2003.
[27] C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121-167, 1998.
[28] D. Miller, S. Boisen, R. Schwartz, R. Stone, and R. Weischedel, “Named Entity Extraction from Noisy Input: Speech and OCR,” Proc. Sixth Conf. Applied Natural Language Processing, pp. 316-324, 2000.
[29] K. Koumpis and S. Renals, “Evaluation of Extractive Voicemail Summarization,” Proc. ISCA Workshop Multilingual Spoken Document Retrieval, pp. 19-24, 2003.
[30] T. Bayer, U. Kressel, H. Mogg-Schneider, and I. Renz, “Categorizing Paper Documents,” Computer Vision and Image Understanding, vol. 70, no. 3, pp. 299-306, 1998.
[31] R. Hoch, “Using IR Techniques for Text Classification in Document Analysis,” Proc. 17th ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 31-40, 1994.
[32] K. Taghva, T. Narkter, J. Borsack, S. Lumos, A. Condit, and R. Young, “Evaluating Text Categorization in the Presence of OCR Errors,” Proc. IS&T SPIE 2001 Int'l Symp. Electronic Imaging Science and Technology, pp. 68-74, 2001.
[33] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines. Cambridge Univ. Press, 2000.
[34] C. Fox, “Lexical Analysis and Stoplists,” Information Retrieval. Data Structures and Algorithms, W.B. Frakes and R. Baeza-Yates, eds., pp. 102-130, Prentice Hall, 1992.
[35] W.B. Frakes, “Stemming Algorithms,” Information Retrieval. Data Structures and Algorithms, W.B. Frakes and R. Baeza-Yates, eds., pp. 131-160, Prentice Hall, 1992.
[36] A.K. Jain, P.W. Duin, and J. Mao, “Statistical Pattern Recognition: A Review,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4-37, Jan. 2000.
[37] M.F. Porter, “An Algorithm for Suffix Stripping,” Program, vol. 14, no. 3, pp. 130-137, 1980.
[38] G. Salton and C. Buckley, “Term-Weighting Approaches in Automatic Text Retrieval,” Information Processing and Management, vol. 24, pp. 513-523, 1988.
[39] G. Zipf, Human Behaviour and the Principle of Least Effort. Addison-Wesley, 1949.
[40] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. European Conf. Machine Learning, pp. 137-142, 1998.
[41] T. Joachims, “Making Large-Scale SVM Learning Practical,” Advances in Kernel Methods, B. Schölkopf, C. Burges, and A. Smola, eds., MIT Press, 1999.
[42] C. Apté, F. Damerau, and S.M. Weiss, “Automated Learning Decision Rules for Text Categorization,” ACM Trans. Information Systems, vol. 12, no. 3, pp. 233-251, 1994.
[43] A. Vinciarelli and J. Luettin, “A New Normalization Technique for Cursive Handwritten Words,” Pattern Recognition Letters, vol. 22, no. 9, pp. 1043-1050, 2001.
[44] F. Jelinek, Statistical Methods for Speech Recognition. MIT Press, 1997.
[45] F. Wilcoxon, “Individual Comparisons by Ranking Methods,” Biometrics, vol. 1, pp. 80-83, 1945.
[46] Y. Yang and J.O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” Proc. 14th Int'l Conf. Machine Learning, pp. 412-420, 1997.

Index Terms:
Index Terms- Text categorization, noisy text, indexing, offline cursive handwriting recognition, optical character recognition.
Citation:
Alessandro Vinciarelli, "Noisy Text Categorization," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 12, pp. 1882-1895, Dec. 2005, doi:10.1109/TPAMI.2005.248
Usage of this product signifies your acceptance of the Terms of Use.