|
| This Article | ||
| ||
| Share | ||
| Bibliographic References | ||
| Add to: | ||
| | ||
| Search | ||
| ||
2011 IEEE 11th International Conference on Data Mining
Document Clustering via Matrix Representation
Vancouver, Canada
December 11-December 14
ISBN: 978-0-7695-4408-3
| ASCII Text | x | ||
| Xufei Wang, Jiliang Tang, Huan Liu, "Document Clustering via Matrix Representation," Data Mining, IEEE International Conference on, pp. 804-813, 2011 IEEE 11th International Conference on Data Mining, 2011. | |||
| BibTex | x | ||
| @article{ 10.1109/ICDM.2011.59, author = {Xufei Wang and Jiliang Tang and Huan Liu}, title = {Document Clustering via Matrix Representation}, journal ={Data Mining, IEEE International Conference on}, volume = {0}, year = {2011}, issn = {1550-4786}, pages = {804-813}, doi = {http://doi.ieeecomputersociety.org/10.1109/ICDM.2011.59}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, } | |||
| RefWorks Procite/RefMan/Endnote | x | ||
| TY - CONF JO - Data Mining, IEEE International Conference on TI - Document Clustering via Matrix Representation SN - 1550-4786 SP804 EP813 A1 - Xufei Wang, A1 - Jiliang Tang, A1 - Huan Liu, PY - 2011 KW - Document Clustering KW - Document Representation KW - Matrix Representation KW - Non-Negative Matrix Approximation VL - 0 JA - Data Mining, IEEE International Conference on ER - | |||
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDM.2011.59
Vector Space Model (VSM) is widely used to represent documents and web pages. It is simple and easy to deal computationally, but it also oversimplifies a document into a vector, susceptible to noise, and cannot explicitly represent underlying topics of a document. A matrix representation of document is proposed in this paper: rows represent distinct terms and columns represent cohesive segments. The matrix model views a document as a set of segments, and each segment is a probability distribution over a limited number of latent topics which can be mapped to clustering structures. The latent topic extraction based on the matrix representation of documents is formulated as a constraint optimization problem in which each matrix (i.e., a document) A_i is factorized into a common base determined by non-negative matrices L and R^\top, and a non-negative weight matrix M_i such that the sum of reconstruction error on all documents is minimized. Empirical evaluation demonstrates that it is feasible to use the matrix model for document clustering: (1) compared with vector representation, using matrix representation improves clustering quality consistently, and the proposed approach achieves a relative accuracy improvement up to 66\% on the studied datasets, and (2) the proposed method outperforms baseline methods such as k-means and NMF, and complements the state-of-the-art methods like LDA and PLSI. Furthermore, the proposed matrix model allows more refined information retrieval at a segment level instead of at a document level, which enables the return of more relevant documents in information retrieval tasks.
Index Terms:
Document Clustering, Document Representation, Matrix Representation, Non-Negative Matrix Approximation
Citation:
Xufei Wang, Jiliang Tang, Huan Liu, "Document Clustering via Matrix Representation," icdm, pp.804-813, 2011 IEEE 11th International Conference on Data Mining, 2011
Usage of this product signifies your acceptance of the Terms of Use.
