Seventh IEEE International Conference on Data Mining (ICDM 2007) (2007)
Omaha, Nebraska, USA
Oct. 28, 2007 to Oct. 31, 2007
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDM.2007.69
Many text processing applications adopted the Bag of Words (BOW) model representation of documents, in which each document is represented as a vector of weighted terms or n-grams, and then cosine distance between two vectors is used as the similarity measurement. Although the great success in information retrieval and text categorization, the conventional BOW model ignores the detailed local text information, i.e. the co-occurrence pattern of words at sentence or paragraph level. In this paper, we propose a novel approach to represent a document as a set of local tf-idf vectors, or what we called local word bags (LWB). By encapsulating local information distributed around a document into multiple LWBs, we can measure the similarity of two documents via the partial match of their corresponding local bags. To perform the matching efficiently, we introduce the Local Word Bag kernel (LWB kernel), a variant of VGPyramid match kernel. The new kernel enables the discriminative machine learning methods like SVM to compute the partial matching between two sets of LWBs in linear time after an one time hierarchical clustering procedure over all local bags at the initialization stage. Experiments on real world datasets demonstrate the effectiveness of our new approach.
Z. Chen, K. Xie, J. Yan, S. Yan, N. Liu and W. Pu, "Local Word Bag Model for Text Categorization," Seventh IEEE International Conference on Data Mining (ICDM 2007)(ICDM), Omaha, Nebraska, USA, 2007, pp. 625-630.