2011 IEEE 11th International Conference on Data Mining (2011)
Dec. 11, 2011 to Dec. 14, 2011
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDM.2011.59
Vector Space Model (VSM) is widely used to represent documents and web pages. It is simple and easy to deal computationally, but it also oversimplifies a document into a vector, susceptible to noise, and cannot explicitly represent underlying topics of a document. A matrix representation of document is proposed in this paper: rows represent distinct terms and columns represent cohesive segments. The matrix model views a document as a set of segments, and each segment is a probability distribution over a limited number of latent topics which can be mapped to clustering structures. The latent topic extraction based on the matrix representation of documents is formulated as a constraint optimization problem in which each matrix (i.e., a document) A_i is factorized into a common base determined by non-negative matrices L and R^\top, and a non-negative weight matrix M_i such that the sum of reconstruction error on all documents is minimized. Empirical evaluation demonstrates that it is feasible to use the matrix model for document clustering: (1) compared with vector representation, using matrix representation improves clustering quality consistently, and the proposed approach achieves a relative accuracy improvement up to 66\% on the studied datasets, and (2) the proposed method outperforms baseline methods such as k-means and NMF, and complements the state-of-the-art methods like LDA and PLSI. Furthermore, the proposed matrix model allows more refined information retrieval at a segment level instead of at a document level, which enables the return of more relevant documents in information retrieval tasks.
Document Clustering, Document Representation, Matrix Representation, Non-Negative Matrix Approximation
X. Wang, J. Tang and H. Liu, "Document Clustering via Matrix Representation," 2011 IEEE 11th International Conference on Data Mining(ICDM), Vancouver, Canada, 2011, pp. 804-813.