Subscribe

Issue No.06 - June (2012 vol.24)

pp: 988-1001

Chee Keong Chan , Div. of Inf. Eng., Nanyang Technol. Univ., Singapore, Singapore

Lihui Chen , Div. of Inf. Eng., Nanyang Technol. Univ., Singapore, Singapore

Duc Thang Nguyen , Div. of Inf. Eng., Nanyang Technol. Univ., Singapore, Singapore

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.86

ABSTRACT

All clustering methods have to assume some cluster relationship among the data objects that they are applied on. Similarity between a pair of objects can be defined either explicitly or implicitly. In this paper, we introduce a novel multiviewpoint-based similarity measure and two related clustering methods. The major difference between a traditional dissimilarity/similarity measure and ours is that the former uses only a single viewpoint, which is the origin, while the latter utilizes many different viewpoints, which are objects assumed to not be in the same cluster with the two objects being measured. Using multiple viewpoints, more informative assessment of similarity could be achieved. Theoretical analysis and empirical study are conducted to support this claim. Two criterion functions for document clustering are proposed based on this new measure. We compare them with several well-known clustering algorithms that use other popular similarity measures on various document collections to verify the advantages of our proposal.

INDEX TERMS

pattern clustering, document handling, clustering algorithm, multiviewpoint-based similarity measure, data objects, dissimilarity measure, informative assessment, document clustering, Clustering algorithms, Strontium, Euclidean distance, Current measurement, Proposals, Partitioning algorithms, Algorithm design and analysis, similarity measure., Document clustering, text mining

CITATION

Chee Keong Chan, Lihui Chen, Duc Thang Nguyen, "Clustering with Multiviewpoint-Based Similarity Measure",

*IEEE Transactions on Knowledge & Data Engineering*, vol.24, no. 6, pp. 988-1001, June 2012, doi:10.1109/TKDE.2011.86REFERENCES

- [1] X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J. McLachlan, A. Ng, B. Liu, P.S. Yu, Z.-H. Zhou, M. Steinbach, D.J. Hand, and D. Steinberg, "Top 10 Algorithms in Data Mining,"
Knowledge Information Systems, vol. 14, no. 1, pp. 1-37, 2007.- [2] I. Guyon, U.V. Luxburg, and R.C. Williamson, "Clustering: Science or Art?,"
Proc. NIPS Workshop Clustering Theory, 2009.- [3] I. Dhillon and D. Modha, "Concept Decompositions for Large Sparse Text Data Using Clustering,"
Machine Learning, vol. 42, nos. 1/2, pp. 143-175, Jan. 2001.- [4] S. Zhong, "Efficient Online Spherical K-means Clustering,"
Proc. IEEE Int'l Joint Conf. Neural Networks (IJCNN), pp. 3180-3185, 2005.- [5] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, "Clustering with Bregman Divergences,"
J. Machine Learning Research, vol. 6, pp. 1705-1749, Oct. 2005.- [6] E. Pekalska, A. Harol, R.P.W. Duin, B. Spillmann, and H. Bunke, "Non-Euclidean or Non-Metric Measures Can Be Informative,"
Structural, Syntactic, and Statistical Pattern Recognition, vol. 4109, pp. 871-880, 2006.- [7] M. Pelillo, "What Is a Cluster? Perspectives from Game Theory,"
Proc. NIPS Workshop Clustering Theory, 2009.- [8] D. Lee and J. Lee, "Dynamic Dissimilarity Measure for Support Based Clustering,"
IEEE Trans. Knowledge and Data Eng., vol. 22, no. 6, pp. 900-905, June 2010.- [9] A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra, "Clustering on the Unit Hypersphere Using Von Mises-Fisher Distributions,"
J. Machine Learning Research, vol. 6, pp. 1345-1382, Sept. 2005.- [10] W. Xu, X. Liu, and Y. Gong, "Document Clustering Based on Non-Negative Matrix Factorization,"
Proc. 26th Ann. Int'l ACM SIGIR Conf. Research and Development in Informaion Retrieval, pp. 267-273, 2003.- [11] I.S. Dhillon, S. Mallela, and D.S. Modha, "Information-Theoretic Co-Clustering,"
Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 89-98, 2003.- [12] C.D. Manning, P. Raghavan, and H. Schütze,
An Introduction to Information Retrieval. Cambridge Univ. Press, 2009.- [13] C. Ding, X. He, H. Zha, M. Gu, and H. Simon, "A Min-Max Cut Algorithm for Graph Partitioning and Data Clustering,"
Proc. IEEE Int'l Conf. Data Mining (ICDM), pp. 107-114, 2001.- [14] H. Zha, X. He, C. Ding, H. Simon, and M. Gu, "Spectral Relaxation for K-Means Clustering,"
Proc. Neural Info. Processing Systems (NIPS), pp. 1057-1064, 2001.- [15] J. Shi and J. Malik, "Normalized Cuts and Image Segmentation,"
IEEE Trans. Pattern Analysis Machine Intelligence, vol. 22, no. 8, pp. 888-905, Aug. 2000.- [16] I.S. Dhillon, "Co-Clustering Documents and Words Using Bipartite Spectral Graph Partitioning,"
Proc. Seventh ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 269-274, 2001.- [17] Y. Gong and W. Xu,
Machine Learning for Multimedia Content Analysis. Springer-Verlag, 2007.- [18] Y. Zhao and G. Karypis, "Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering,"
Machine Learning, vol. 55, no. 3, pp. 311-331, June 2004.- [19] G. Karypis, "CLUTO a Clustering Toolkit," technical report, Dept. of Computer Science, Univ. of Minnesota, http://glaros.dtc.umn. edu/~gkhome/views cluto, 2003.
- [20] A. Strehl, J. Ghosh, and R. Mooney, "Impact of Similarity Measures on Web-Page Clustering,"
Proc. 17th Nat'l Conf. Artificial Intelligence: Workshop of Artificial Intelligence for Web Search (AAAI), pp. 58-64, July 2000.- [21] A. Ahmad and L. Dey, "A Method to Compute Distance Between Two Categorical Values of Same Attribute in Unsupervised Learning for Categorical Data Set,"
Pattern Recognition Letters, vol. 28, no. 1, pp. 110-118, 2007.- [22] D. Ienco, R.G. Pensa, and R. Meo, "Context-Based Distance Learning for Categorical Data Clustering,"
Proc. Eighth Int'l Symp. Intelligent Data Analysis (IDA), pp. 83-94, 2009.- [23] P. Lakkaraju, S. Gauch, and M. Speretta, "Document Similarity Based on Concept Tree Distance,"
Proc. 19th ACM Conf. Hypertext and Hypermedia, pp. 127-132, 2008.- [24] H. Chim and X. Deng, "Efficient Phrase-Based Document Similarity for Clustering,"
IEEE Trans. Knowledge and Data Eng., vol. 20, no. 9, pp. 1217-1229, Sept. 2008.- [25] S. Flesca, G. Manco, E. Masciari, L. Pontieri, and A. Pugliese, "Fast Detection of xml Structural Similarity,"
IEEE Trans. Knowledge and Data Eng., vol. 17, no. 2, pp. 160-175, Feb. 2005.- [26] E.-H. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore, "Webace: A Web Agent for Document Categorization and Exploration,"
Proc. Second Int'l Conf. Autonomous Agents (AGENTS '98), pp. 408-415, 1998.- [27] J. Friedman and J. Meulman, "Clustering Objects on Subsets of Attributes,"
J. Royal Statistical Soc. Series B Statistical Methodology, vol. 66, no. 4, pp. 815-839, 2004.- [28] L. Hubert, P. Arabie, and J. Meulman,
Combinatorial Data Analysis: Optimization by Dynamic Programming. SIAM, 2001.- [29] R.O. Duda, P.E. Hart, and D.G. Stork,
Pattern Classification, second ed. John Wiley & Sons, 2001.- [30] S. Zhong and J. Ghosh, "A Comparative Study of Generative Models for Document Clustering,"
Proc. SIAM Int'l Conf. Data Mining Workshop Clustering High Dimensional Data and Its Applications, 2003.- [31] Y. Zhao and G. Karypis, "Criterion Functions for Document Clustering: Experiments and Analysis," technical report, Dept. of Computer Science, Univ. of Minnesota, 2002.
- [32] T.M. Mitchell,
Machine Learning. McGraw-Hill, 1997. |