This Article 
 Bibliographic References 
 Add to: 
On Using Partial Supervision for Text Categorization
February 2004 (vol. 16 no. 2)
pp. 245-255

Abstract—In this paper, we discuss the merits of building text categorization systems by using supervised clustering techniques. Traditional approaches for document classification on a predefined set of classes are often unable to provide sufficient accuracy because of the difficulty of fitting a manually categorized collection of documents in a given classification model. This is especially the case for heterogeneous collections of Web documents which have varying styles, vocabulary, and authorship. Hence, this paper investigates the use of clustering in order to create the set of categories and its use for classification of documents. Completely unsupervised clustering has the disadvantage that it has difficulty in isolating sufficiently fine-grained classes of documents relating to a coherent subject matter. In this paper, we use the information from a preexisting taxonomy in order to supervise the creation of a set of related clusters, though with some freedom in defining and creating the classes. We show that the advantage of using partially supervised clustering is that it is possible to have some control over the range of subjects that one would like the categorization system to address, but with a precise mathematical definition of how each category is defined. An extremely effective way then to categorize documents is to use this a priori knowledge of the definition of each category. We also discuss a new technique to help the classifier distinguish better among closely related clusters.

[1] C.C. Aggarwal, C. Procopiuc, J.L. Wolf, P.S. Yu, and J.-S. Park, Fast Algorithms for Projected Clustering Proc. ACM SIGMOD Conf. Management of Data, 1999.
[2] C.C. Aggarwal, S.C. Gates, and P.S. Yu, On the Merits of Using Supervised Clustering for Building Categorization Systems Proc. ACM SIGKDD Conf. Knowledge Discovery and Data Mining, 1999.
[3] P. Anick and S. Vaithyanathan, Exploiting Clustering and Phrases for Context-Based Information Retrieval Proc. SIGIR, pp. 314-322, 1997.
[4] C. Apte, F. Damerau, and S.M. Weiss, Automated Learning of Decision Rules for Text Categorization ACM Trans. Information Systems, 1994.
[5] S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan, Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases Proc. VLDB Conf., Aug. 1997. Extended Version: Scalable Feature Selection, Classification, and Signature Generation for Organizing Text Databases into Hierarchical Topic Taxonomies, VLDB J., vol. 7, pp. 163-178, 1998.
[6] S. Chakrabarti, B. Dom, and P. Indyk, Enhanced Hypertext Categorization Using Hyperlinks Proc. 1998 ACM SIGMOD Conf. Management of Data, 1998.
[7] D.R. Cutting, D.R. Karger, J.O. Pedersen, and J.W. Tukey, Scatter/Gather: A Cluster-Based Approach to Browsing Large Document Collections Proc. SIGIR, pp. 318-329, 1992.
[8] D.R. Cutting, D.R. Karger, and J.O. Pedersen, Constant Interaction-Time Scatter/Gather Browsing of Very Large Document Collections Proc. 16th Ann. ACM SIGIR, 1993.
[9] B.L. Douglas and A.K. McCallum, Distributional Clustering of Words for Text Classification Proc. ACM SIGIR, pp. 96-103, 1998.
[10] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, N.J.: Prentice Hall, 1988.
[11] M.A. Hearst and J.O. Pedersen, Re-Eexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results Proc. ACM SIGIR, pp. 76-84, 1996.
[12] D. Koller and M. Sahami, Hierarchically Classifying Documents Using Very Few Words Proc. Int'l Conf. Machine Learning, July 1997.
[13] W. Lam and C.Y. Ho, Using a Generalized Instance Set for Automatic Text Categorization Proc. ACM SIGIR, pp. 81-88, 1998.
[14] D.D. Lewis, Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval Proc. ECML, 1998.
[15] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell, Learning to Classify Text from Labeled and Unlabeled Documents Proc. AAAI, 1998.
[16] G. Salton and M.J. McGill, Introduction to Modern Information Retrieval. New York: Mc Graw Hill, 1983.
[17] R. Sibson, SLINK: An Optimally Efficient Algorithm for the Single Link Cluster Method Computer J., vol. 16, pp. 30-34, 1973.
[18] H. Schutze and C. Silverstein, Projections for Efficient Document Clustering Proc. ACM SIGIR, pp. 74-81, 1997.
[19] C. Silverstein and J.O. Pedersen, Almost-Constant Time Clustering of Arbitrary Corpus Sets Proc. ACM SIGIR, pp. 60-66, 1997.
[20] O. Zamir and O. Etzioni, Web Document Clustering: A Feasibility Demonstration Proc. ACM SIGIR. pp. 46-53, 1998.

Index Terms:
Clustering, categorization, supervision, taxonomy, text.
Charu C. Aggarwal, Stephen C. Gates, Philip S. Yu, "On Using Partial Supervision for Text Categorization," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 2, pp. 245-255, Feb. 2004, doi:10.1109/TKDE.2004.1269601
Usage of this product signifies your acceptance of the Terms of Use.