This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Automatic Text Categorization and Its Application to Text Retrieval
November/December 1999 (vol. 11 no. 6)
pp. 865-879

Abstract—We develop an automatic text categorization approach and investigate its application to text retrieval. The categorization approach is derived from a combination of a learning paradigm known as instance-based learning and an advanced document retrieval technique known as retrieval feedback. We demonstrate the effectiveness of our categorization approach using two real-world document collections from the MEDLINE database. Next, we investigate the application of automatic categorization to text retrieval. Our experiments clearly indicate that automatic categorization improves the retrieval performance compared with no categorization. We also demonstrate that the retrieval performance using automatic categorization achieves the same retrieval quality as the performance using manual categorization. Furthermore, detailed analysis of the retrieval performance on each individual test query is provided.

[1] C. Apte, F. Damerau, and S. Weiss, "Automated Learning of Decision Rules for Text Categorization," ACM Trans. Information Systems, Vol. 12, No. 3, July 1994, pp. 233-251.
[2] C. Buckley, G. Salton, J. Allan, and A. Singhal, “Automatic Query Expansion Using SMART: TREC-3 Report,” Proc. TREC-3, Third Text REtrieval Conf., pp. 69–80, 1995.
[3] W.W. Cohen and Y. Singer, “Context-Sensitive Learning Methods for Text Categorization,” Proc. 19th Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 307–315, 1996.
[4] D.K. Harman, “Overview of the Third Text REtrieval Conf. (TREC-3),” Proc. TREC-3, Third Text REtrieval Conf., pp. 1-19, 1995.
[5] W. Hersh, D. Hickam, R. Haynes, and K. McKibbon, “A Performance and Failure Analysis of SAPHIRE with a MEDLINE Test Collection,” J. Am. Medical Informatics Assoc., vol. 1, no. 1, pp. 51–60, 1994.
[6] W. Hersh, C. Buckley, T. Leone, and D. Hickam, “OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research,” Proc. 17th Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 192–201, 1994.
[7] W. Lam and C.Y. Ho, “Using a Generalized Instance Set for Automatic Text Categorization,” Proc. 21st Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 81–89, 1998.
[8] W. Lam, “Intelligent Content-Based Document Delivery via Automatic Filtering Profile Generation,” Int'l J. Intelligent Systems, vol. 14, no. 10, pp. 963-979, 1999.
[9] L.S. Larkey and W.B. Croft, “Combining Classifiers in Text Categorization,” Proc. 19th Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 289–297, 1996.
[10] D.D. Lewis, “Feature Selection and Feature Extraction for Text Categorization,” Proc. Speech and Natural Language Workshop, pp. 212–217, Arden House, 1992.
[11] D.D. Lewis, R.E. Schapire, J.P. Callan, and R. Papka, “Training Algorithms for Linear Text Classifiers,” Proc. 19th Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 298–306, 1996.
[12] B. Masand, G. Linoff, and D. Waltz, “Classifying News Stories Using Memory Based Reasoning,” Proc. 15th Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 59–65, 1992.
[13] R. Mehnert, “Federal Agency and Federal Library Reports: National Library of Medicine,” Bowker Ann.: Library and Book Trade Almanac, second ed., pp. 110-115, 1997.
[14] J. Mostafa, S. Mukhopadhyay, W. Lam, and M. Palakal, “A Multi-Level Approach to Intelligent Information Filtering: Model, Systems, and Evaluation,” ACM Trans. Information Systems, vol. 15, no. 4, pp. 368–399, 1997.
[15] S.E. Robertson, S. Walker, S. Jones, M.M. Hancock-Beaulieu, and M. Gatford, “Okapi at TREC-3,” Proc. TREC-3, Third Text REtrieval Conf., pp. 109-126, 1995.
[16] G. Salton, The Smart System—Experiments in Automatic Document Processing. Englewood Cliffs, N.J.: Prentice Hall, 1971.
[17] G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison Wesley, New York, 1989.
[18] P. Srinivasan, “Query Expansion and MEDLINE,” Information Processing and Management, vol. 32, no. 4, pp. 431-443, 1996.
[19] P. Srinivasan, “Optimal Document-Indexing Vocabulary for MEDLINE,” Information Processing and Management, vol. 32, no. 5, pp. 503-514, 1996.
[20] P. Srinivasan, “Retrieval Feedback in MEDLINE,” J. Am. Medical Informatics Assoc., vol. 3, no. 2, pp. 157-167, 1996.
[21] Y. Yang and C.D. Chute, “An Example-Based Mapping Method for Text Categorization and Retrieval,” ACM Trans. Information Systems, vol. 12, no. 3, pp. 252–277, 1994.
[22] Y. Yang, “Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval,” Proc. 17th Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 13–22, 1994.
[23] Y. Yang, “An Evaluation of Statistical Approaches to MEDLINE Indexing,” Proc. AMIA Ann. Fall Symp., pp. 358-362, 1996.

Index Terms:
Text categorization, automatic classification, text retrieval, instance-based learning, query processing.
Citation:
Wai Lam, Miguel Ruiz, Padmini Srinivasan, "Automatic Text Categorization and Its Application to Text Retrieval," IEEE Transactions on Knowledge and Data Engineering, vol. 11, no. 6, pp. 865-879, Nov.-Dec. 1999, doi:10.1109/69.824599
Usage of this product signifies your acceptance of the Terms of Use.