Machine Learning and Applications, Fourth International Conference on (2006)
Dec. 14, 2006 to Dec. 16, 2006
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICMLA.2006.50
Joel W. Reed , Oak Ridge National Laboratory, USA
Yu Jiao , Oak Ridge National Laboratory, USA
Thomas E. Potok , Oak Ridge National Laboratory, USA
Brian A. Klump , Oak Ridge National Laboratory, USA
Mark T. Elmore , Oak Ridge National Laboratory, USA
Ali R. Hurson , The Pennsylvania State University, USA
In this paper, we propose a new term weighting scheme called Term Frequency -- Inverse Corpus Frequency (TF-ICF). It does not require term frequency information from other documents within the document collection and thus, it enables us to generate the document vectors of N streaming documents in linear time. In the context of a machine learning application, unsupervised document clustering, we evaluated the effectiveness of the proposed approach in comparison to five widely used term weighting schemes through extensive experimentation. Our results show that TF-ICF can produce document clusters that are of comparable quality as those generated by the widely recognized term weighting schemes and it is significantly faster than those methods.
J. W. Reed, M. T. Elmore, T. E. Potok, A. R. Hurson, Y. Jiao and B. A. Klump, "TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams," 2006 International Conference on Machine Learning and Applications(ICMLA), Orlando, FL, 2006, pp. 258-263.