Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06)
A Hybrid Strategy for Clustering Data Mining Documents
Hong Kong, China
December 18-December 22
ISBN: 0-7695-2702-7
Yi Peng, University of Nebraska at Omaha
Gang Kou, Thomson Legal & Regulatory, R&D, 610 Opperman Drive, Eagan
With the increase in the number of electronic documents?C it is hard to manually organize, analyze and present these documents efficiently. Document clustering, which automatically groups similar or related documents together, has been used in practical applications to understand the contents and structures of documents. Although a variety of methods and algorithms have been proposed, it is still a challenging task to generate meaningful document clusters. This paper uses an approach that combines quantitative and qualitative methods in order to create high-quality clusters for a collection of data mining and knowledge discovery (DMKD) publications. The quantitative method extracts a list of noun/noun phrases from the DMKD documents and uses an optimization procedure from CLUTO toolkit to assign documents to clusters. The qualitative method uses grounded theory to identify major categories of the documents to improve the comprehensibility of resultant clusters. The results demonstrate that the strategy produces more meaningful clusters than single-term k-way clustering algorithm in terms of internal metrics and human assessment.
Index Terms:
Document clustering, Hard clustering, Soft clustering, Optimization algorithm, Data mining, Grounded theory
Citation:
Yi Peng, Gang Kou, Yong Shi, Zhengxin Chen, "A Hybrid Strategy for Clustering Data Mining Documents," icdmw, pp.838-842, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06), 2006