2016 International Conference on Big Data and Smart Computing (BigComp) (2016)
Hong Kong, China
Jan. 18, 2016 to Jan. 20, 2016
Heeryon Cho , School of Computer Science, Kookmin University, Seoul, South Korea
Jong-Seok Lee , School of Integrated Technology & Yonsei Institute of Convergence Technology, Yonsei University, Incheon, South Korea
Popular news articles attract thousands of online comments, making it tedious and time-consuming for a manual review. Automatically clustering similar comments can help reduce the burden of manual analyses, but appropriate feature words must be selected for successful clustering. In this paper, we present a data-driven feature word selection method which realizes structurally superior clustering of online comments. The top 1,000 most frequent nouns appearing across the entire 7.44 million Korean online comments are selected to construct an overall noun set. Frequent nouns in the online comments of each news article are selected to construct the local noun set. The intersection between the local and overall noun set is taken to construct the global noun set. The global noun set is removed from the corresponding local noun set to construct the distinct noun set. The top 250 most frequent nouns are selected for each of the local, global, and distinct noun sets for K-means clustering. The clustered results are evaluated using three internal cluster validation indices, Dunn, PBM, and Silhouette. As a result, online comments clustered using distinct nouns produced structurally superior clusters when compared to the other types of nouns, local and global.
Senior citizens, Indexes, Media, Government, Education, Employment, Manuals
Heeryon Cho and Jong-Seok Lee, "Data-driven feature word selection for clustering online news comments," 2016 International Conference on Big Data and Smart Computing (BigComp)(BIGCOMP), Hong Kong, China, 2016, pp. 494-497.