The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.09 - September (2008 vol.20)
pp: 1217-1229
Hung Chim , City University of Hong Kong
Xiaotie Deng , City University of Hong Kong
ABSTRACT
In this paper, we propose a phrase-based document similarity to compute the pair-wise similarities of documents based on the Suffix Tree Document (STD) model. By mapping each node in the suffix tree of STD model into a unique feature term in the Vector Space Document (VSD) model, the phrase-based document similarity naturally inherits the term tf-idf weighting scheme in computing the document similarity with phrases. We apply the phrase-based document similarity to the group-average Hierarchical Agglomerative Clustering (HAC) algorithm and develop a new document clustering approach. Our evaluation experiments indicate that, the new clustering approach is very effective on clustering the documents of two standard document benchmark corpora OHSUMED and RCV1. The quality of the clustering results significantly surpass the results of traditional single-word \textit{tf-idf} similarity measure in the same HAC algorithm, especially in large document data sets. Furthermore, by studying the property of STD model, we conclude that the feature vector of phrase terms in the STD model can be considered as an expanded feature vector of the traditional single-word terms in the VSD model. This conclusion sufficiently explains why the phrase-based document similarity works much better than the single-word tf-idf similarity measure.
INDEX TERMS
Clustering, Linguistic processing, Trees
CITATION
Hung Chim, Xiaotie Deng, "Efficient Phrase-Based Document Similarity for Clustering", IEEE Transactions on Knowledge & Data Engineering, vol.20, no. 9, pp. 1217-1229, September 2008, doi:10.1109/TKDE.2008.50
REFERENCES
[1] P.O.R. Allen and M. Littman, “An Interface for Navigating Clustered Document Sets Returned by Queries,” Proc. ACM Conf. Organizational Computing Systems (COCS '93), pp. 166-171, 1993.
[2] W.B. Croft, “Organizing and Searching Large Files of Documents,” PhD dissertation, Univ. of Cambridge, 1978.
[3] G. Salton, A. Wong, and C.S. Yang, “A Vector Space Model for Automatic Indexing,” Comm. ACM, vol. 18, no. 11, pp. 613-620, 1975.
[4] C.J. van Rijsbergen, Information Retrieval. Butterworths, 1975.
[5] O. Zamir and O. Etzioni, “Grouper: A Dynamic Clustering Interface to Web Search Results,” Computer Networks, vol. 31, nos.11-16, pp. 1361-1374, 1999.
[6] E. Charniak, Statistical Language Learning. MIT Press, 1993.
[7] M. Yamamoto and K.W. Church, “Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus,” Computational Linguistics, vol. 27, no. 1, pp. 1-30, 2001.
[8] U. Manber and G. Myers, “Suffix Arrays: A New Method for On-Line String Searches,” SIAM J. Computing, vol. 22, no. 5, pp. 935-948, 1993.
[9] K.M. Hammouda and M.S. Kamel, “Efficient Phrase-Based Document Indexing for Web Document Clustering,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 10, pp. 1279-1296, Oct. 2004.
[10] O.M. Oren Zamir, O. Etzioni, and R.M. Karp, “Fast and Intuitive Clustering of Web Documents,” Proc. Third Int'l Conf. Knowledge Discovery and Data Mining (KDD), 1997.
[11] O. Zamir and O. Etzioni, “Web Document Clustering: A Feasibility Demonstration,” Proc. 21st Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR), 1998.
[12] D.S. Sven Meyer zu Eissen and M. Potthast, “The Suffix Tree Document Model Revisited,” Proc. Fifth Int'l Conf. Knowledge Management (I-Know '05), pp. 596-603, 2005.
[13] D.D. Lewis, Y. Yang, and F. Li, “RCV1: A New Benchmark Collection for Text Categorization Research,” J. Machine Learning Research, vol. 5, pp. 361-397, 2004.
[14] W. Hersh, C. Buckley, and D. Hickam, “Ohsumed: An Interactive Retrieval Evaluation and New Large Test Collection for Research,” Proc. 17th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '94), pp. 192-201, 1994.
[15] P. Willett, “Recent Trends in Hierarchic Document Clustering: A Critical Review,” Information Processing and Management, vol. 24, no. 5, pp. 577-597, 1988.
[16] K. Cios and W. Pedrycz, Data Mining Methods for Knowledge Discovery. Kluwer Academic Publishers, 1998.
[17] J.J. Carlson, M.R. Muguira, J.B. Jordan, G.M. Flachs, and A.K. Peterson, “Final Report: Weighted Neighbor Data Mining,” SANDIA Report, vol. SAND2000-312, 2000.
[18] F. Jelinek, Statistical Methods for Speech Recognition. MIT Press, 1997.
[19] J.R. Paul Bieganski and J.V. Carlis, “Generalized Suffix Trees for Biological Sequence Data: Application and Implementation,” Proc. 27th Ann. Hawaii Int'l Conf. System Sciences (HICSS '94), pp. 35-44, 1994.
[20] A. Ehrenfeucht and D. Haussler, “A New Distance Metric on Strings Computable in Linear Time,” Discrete Applied Math., vol. 40, 1988.
[21] B.M. Rajesh Pampapathi and M. Levene, “A Suffix Tree Approach to Anti-Spam Email Filtering,” Machine Learning, vol. 65, 2006.
[22] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addison Wesley, 1999.
[23] G. Salton and C. Buckley, “On the Use of Spreading Activation Methods in Automatic Information Retrieval,” Proc. 11th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '88), pp. 147-160, 1988.
[24] E. Ukkonen, “On-Line Construction of Suffix Trees,” Algorithmica, vol. 14, no. 3, pp. 249-260, 1995.
[25] R. Giegerich and S. Kurtz, “From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction,” Algorithmica, vol. 19, no. 3, pp. 331-353, 1997.
[26] M. Porter, “New Models in Probabilistic Information Retrieval,” British Library Research and Development Report, no. 5587, 1980.
[27] M. Eisen, Cluster 3.0. Stanford Univ.
[28] B. Larsen and C. Aone, “Fast and Effective Text Mining Using Linear-Time Document Clustering,” Proc. KDD Workshop Web Usage Analysis and User Profiling (WebKDD), 1999.
[29] K. Lang, “Newsweeder: Learning to Filter Netnews,” Proc. 12th Int'l Conf. Machine Learning (ICML '95), pp. 331-339, 1995.
17 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool