This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Multitype Features Coselection for Web Document Clustering
April 2006 (vol. 18 no. 4)
pp. 448-459
Feature selection has been widely applied in text categorization and clustering. Compared to unsupervised selection, supervised feature selection is more successful in filtering out noise in most cases. However, due to a lack of label information, clustering can hardly exploit supervised selection. Some studies have proposed to solve this problem by "pseudoclass.” As empirical results show, this method is sensitive to selection criteria and data sets. In this paper, we propose a novel feature coselection for Web document clustering, which is called Multitype Features Coselection for Clustering (MFCC). MFCC uses intermediate clustering results in one type of feature space to help the selection in other types of feature spaces. Our experiments show that for most selection criteria, MFCC reduces effectively the noise introduced by "pseudoclass,” and further improves clustering performance.

[1] A. Blum and T. Mitchell, “Combining Labeled and Unlabeled Data with Co-Training,” Proc. Conf. Computational Learning Theory, pp. 92-100, 1998.
[2] A.L Blum and P. Langley, “Selection of Relevant Features and Examples in Machine Learning,” Artificial Intelligence, vol. 1, no. 2, pp. 245-271, 1997.
[3] T.M. Cover and J.A. Thomas, Elements of Information Theory. Wiley, 1991.
[4] M. Dash and H. Liu, “Feature Selection for Classification,” Int'l J. Intelligent Data Analysis, vol. 1, no. 3, pp. 131-156, 1997.
[5] M. Dash and H. Liu, “Feature Selection for Clustering,” Proc. 2000 Pacific-Asia Conf. Knowledge Discovery and Data Mining, pp. 110-121, 2000.
[6] L. Galavotti, F. Sebastiani, and M. Simi, “Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization,” Proc. European Conf. Digital Libraries (ECDL '00), pp. 59-68, 2000.
[7] X. He, H. Zha, C. Ding, and H.D. Simon, “Web Document Clustering Using Hyperlink Structures,” Computational Statistics and Data Analysis, vol. 41, no. 1, pp. 19-45, 2002.
[8] A.K. Jain, P.W. Duin, and M. Jianchang, “Statistical Pattern Recognition: A Review,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4-37, 2000.
[9] M.M. Kessler, “Bibliographic Coupling between Scientific Papers,” Am. Documentation, vol. 14, no. 1, pp. 10-25, 1963.
[10] D. Koller and M. Sahami, “Toward Optimal Feature Selection,” Proc. Int'l Conf. Machine Learning (ICML '96), pp. 284-292, 1996.
[11] B. Larsen and C. Aone, “Fast and Effective Text Mining Using Linear-Time Document Clustering,” Proc. Fifth ACM SIGKDD Int'l Conf., pp. 16-22, 1999.
[12] T. Liu, S. Liu, Z. Chen, and W.-Y. Ma, “An Evaluation on Feature Selection for Text Clustering,” Proc. Int'l Conf. Machine Learning (ICML '03), pp. 488-495, 2003.
[13] H.C.L. Martin, A.T.F. Mario, and A.K. Jain, “Feature Saliency in Unsupervised Learning,” Technical Report, Michigan State Univ., 2002.
[14] M. Montague, “Metasearch: Data Fusion for Document Retrieval,” PhD Thesis, Dartmouth College, 2002.
[15] K. Nigam and R. Ghani, “Analyzing the Effectiveness and Applicability of Co-Training,” Proc. Information and Knowledge Management, pp. 86-93, 2000.
[16] G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.
[17] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, pp. 1-47, 2002.
[18] H. Small, “Co-Citation in Scientific Literature: A New Measure of the Relationship between Two Documents,” J. Am. Soc. Information, pp. vol. 24, no. 4, pp. 265-269, 1973.
[19] J. Wang, H.-J. Zeng, Z. Chen, H. Lu, T. Li, and W.-Y. Ma, “ReCoM: Reinforcement Clustering of Multi-Type Interrelated Data Objects,” Proc. 26th ACM SIGIR Conf., pp. 274-281, 2003.
[20] Y. Wang and M. Kitsuregawa, “Link Based Clustering of Web Search Results,” Proc. Second Int'l Conf. Web-Age Information Management (WAIM '01), 2001.
[21] R. Weiss, B. Velez, M.A. Sheldon, C. Namprempre, P. Szilagyi, A. Duda, and D.K. Gifford, “HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering,” Proc. Seventh ACM Conf. Hypertext, pp. 180-193, 1996.
[22] Y. Yang and J.O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” Proc. Int'l Conf. Machine Learning (ICML '97), pp. 412-420, 1997.
[23] Z. Zheng and R. Srihari, “Optimally Combining Positive and Negative Features for Text Categorization,” Proc. Int'l Conf. Machine Learning Workshop, 2003.

Index Terms:
Web mining, clustering, feature evaluation and selection.
Citation:
Shen Huang, Zheng Chen, Yong Yu, Wei-Ying Ma, "Multitype Features Coselection for Web Document Clustering," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 4, pp. 448-459, April 2006, doi:10.1109/TKDE.2006.63
Usage of this product signifies your acceptance of the Terms of Use.