This Article 
 Bibliographic References 
 Add to: 
Efficient Phrase-Based Document Indexing for Web Document Clustering
October 2004 (vol. 16 no. 10)
pp. 1279-1296
Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This paper presents two key parts of successful document clustering. The first part is a novel phrase-based document index model, the Document Index Graph, which allows for incremental construction of a phrase-based index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods.

[1] K. Cios, W. Pedrycs, and R. Swiniarski, Data Mining Methods for Knowledge Discovery. Boston: Kluwer Academic Publishers, 1998.
[2] W.B. Frakes and R. Baeza-Yates, Information Retrieval: Data Structures and Algorithms. Englewood Cliffs, N.J.: Prentice Hall, 1992.
[3] R. Kosala and H. Blockeel, Web Mining Research: A Survey ACM SIGKDD Explorations Newsletter, vol. 2, no. 1, pp. 1-15, 2000.
[4] O. Zamir and O. Etzioni, Grouper: A Dynamic Clustering Interface to Web Search Results Computer Networks, vol. 31, nos. 11-16, pp. 1361-1374, 1999.
[5] S. Dumais, J. Platt, D. Heckerman, and M. Sahami, Inductive Learning Algorithms and Representations for Text Categorization Proc. Seventh Int'l Conf. Information and Knowledge Management, pp. 148-15, Nov. 1998.
[6] H. Kargupta, I. Hamzaoglu, and B. Stafford, Distributed Data Mining Using an Agent Based Architecture Proc. Knowledge Discovery and Data Mining, pp. 211-214, 1997.
[7] U.Y. Nahm and R.J. Mooney, A Mutually Beneficial Integration of Data Mining and Information Extraction Proc. 17th Nat'l Conf. Artificial Intelligence (AAAI-00), pp. 627-632, 2000.
[8] Y. Yang, J. Carbonell, R. Brown, T. Pierce, B. Archibald, and X. Liu, Learning Approaches for Detecting and Tracking News Events IEEE Intelligent Systems, vol. 14, no. 4, pp. 32-43, 1999.
[9] D. Freitag and A. McCallum, Information Extraction with HMMs and Shrinkage Proc. AAAI-99 Workshop Machine Learning for Information Extraction, pp. 31-36, 1999.
[10] T. Hofmann, The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data Proc. 16th Int'l Joint Conf. Artificial Intelligence (IJCAI-99), pp. 682-687, 1999.
[11] T. Honkela, S. Kaski, K. Lagus, and T. Kohonen, WEBSOM Self-Organizing Maps of Document Collections Proc. WSOM '97, Workshop Self-Organizing Maps, pp. 310-315, June 1997.
[12] W.W. Cohen, Learning to Classify English Text with ILP Methods Proc. Fifth Int'l Workshop Inductive Logic Programming, pp. 3-24, 1995.
[13] M. Junker, M. Sintek, and M. Rinck, Learning for Text Categorization and Information Extraction with ILP Proc. First Workshop Learning Language in Logic, J. Cussens, ed., pp. 84-93, 1999.
[14] S. Scott and S. Matwin, Feature Engineering for Text Classification Proc. 16th Int'l Conf. Machine Learning (ICML-99), pp. 379-388, 1999.
[15] S. Soderland, Learning Information Extraction Rules for Semi-Structured and Free Text Machine Learning, vol. 34, nos. 1-3, pp. 233-272, 1999.
[16] K. Aas and L. Eikvil, Text Categorisation: A Survey Technical Report 941, Norwegian Computing Center, June 1999.
[17] G. Salton, A. Wong, and C. Yang, A Vector Space Model for Automatic Indexing Comm. ACM, vol. 18, no. 11, pp. 613-620, Nov. 1975.
[18] G. Salton and M.J. McGill, Introduction to Modern Information Retrieval. McGraw-Hill Computer Science Series, New York: McGraw-Hill, 1983.
[19] G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Reading, Mass.: Addison Wesley, 1989.
[20] O. Zamir, O. Etzioni, O. Madanim, and R.M. Karp, Fast and Intuitive Clustering of Web Documents Proc. Third Int'l Conf. Knowledge Discovery and Data Mining, pp. 287-290, Aug. 1997.
[21] O. Zamir and O. Etzioni, Web Document Clustering: A Feasibility Demonstration Proc. 21st Ann. Int'l ACM SIGIR Conf., pp. 46-54, 1998.
[22] M.F. Porter, An Algorithm for Suffix Stripping Program, vol. 14, no. 3, pp. 130-137, July 1980.
[23] S. Kurtz, Reducing the Space Requirement of Suffix Trees Software Practice and Experience, vol. 29, no. 13, pp. 1149-1171, 1999.
[24] A. Apostolico, The Myriad Virtues of Subword Trees Combinatorial Algorithms on Words, A. Apostolico and Z. Galil, eds., pp. 85-96, (NATO ISI Series), 1985.
[25] U. Manber and G. Myers, Suffix Arrays: A New Method for On-Line String Searches SIAM J. Computing, vol. 22, no. 5, pp. 935-948, 1993.
[26] J.L. Fagan, Experiments in Automatic Phrase Indexing for Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods PhD thesis, Dept. of Computer Science, Cornell Univ., Sept. 1987.
[27] M.F. Caropreso, S. Matwin, and F. Sebastiani, Statistical Phrases in Automated Text Categorization Technical Report IEI-B4-07-2000, Pisa, Italy, 2000.
[28] J.D. Isaacs and J.A. Aslam, Investigating Measures for Pairwise Document Similarity Technical Report PCS-TR99-357, Dartmouth College, Computer Science, Hanover, N.H., June 1999.
[29] D. Lin, An Information-Theoretic Definition of Similarity Proc. 15th Int'l Conf. Machine Learning, pp. 296-304, 1998.
[30] A. Strehl, J. Ghosh, and R. Mooney, Impact of Similarity Measures on Web-Page Clustering Proc. 17th Nat'l Conf. Artificial Intelligence: Workshop Artificial Intelligence for Web Search (AAAI 2000), pp. 58-64, July 2000.
[31] Y. Yang and J.P. Pedersen, A Comparative Study on Feature Selection in Text Categorization Proc. 14th Int'l Conf. Machine Learning (ICML '97), pp. 412-420, 1997.
[32] M. Steinbach, G. Karypis, and V. Kumar, A Comparison of Document Clustering Techniques Proc. KDD-2000 Workshop TextMining, Aug. 2000.
[33] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, N.J.: Prentice Hall, 1988.
[34] A.K. Jain, M.N. Murty, and P.J. Flynn, Data Clustering: A Review ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.
[35] M. Charikar, C. Chekuri, T. Feder, and R. Motwani, Incremental Clustering and Dynamic Information Retrieval Proc. 29th Ann. ACM Symp. Theory of Computing, pp. 626-635, 1997.
[36] F. Beil, M. Ester, and X. Xu, Frequent Term-Based Text Clustering Proc. Eighth Int'l Conf. Knowledge Discovery and Data Mining (KDD 2002), pp. 436-442, 2002.
[37] P. Pantel and D. Lin, Document Clustering with Committees Proc. ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 199-206, 2002.
[38] D.R. Hill, A Vector Clustering Technique Mechanized Information Storage, Retrieval and Dissemination. K. Samuelson, ed., Amsterdam: North-Holland Publishing, 1968.
[39] B.V. Dasarathy, Nearest Neighbor NN Norms: NN Pattern Classification Techniques. McGraw-Hill Computer Science Series. IEEE CS Press, 1991.
[40] S.Y. Lu and K.S. Fu, A Sentence-to-Sentence Clustering Procedure for Pattern Analysis IEEE Trans. Systems, Man, and Cybernetics, vol. 8, pp. 381-389, 1978.
[41] W. Wong and A. Fu, Incremental Document Clustering for Web Page Classification Proc. 2000 Int'l Conf. Information Soc. in the 21st Century: Emerging Technologies and New Challenges (IS2000), 2000.
[42] K. Hammouda and M. Kamel, Phrase-Based Document Similarity Based on an Index Graph Model Proc. 2002 IEEE Int'l Conf. Data Mining (ICDM '02), pp. 203-210, Dec. 2002.
[43] D. Boley, Principal Direction Divisive Partitioning Data Mining and Knowledge Discovery, vol. 2, no. 4, pp. 325-344, 1998.
[44] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore, Partitioning-Based Clustering for Web Document Categorization Decision Support Systems, vol. 27, pp. 329-341, 1999.
[45] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore, Document Categorization and Query Generation on the World Wide Web Using WebACE AI Rev., vol. 13, nos. 5-6, pp. 365-391, 1999.

Index Terms:
Web mining, document similarity, phrase-based indexing, document clustering, document structure, document index graph, phrase matching.
Khaled M. Hammouda, Mohamed S. Kamel, "Efficient Phrase-Based Document Indexing for Web Document Clustering," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 10, pp. 1279-1296, Oct. 2004, doi:10.1109/TKDE.2004.58
Usage of this product signifies your acceptance of the Terms of Use.