The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.05 - May (2009 vol.21)
pp: 681-698
Khaled M. Hammouda , University of Waterloo, Waterloo
Mohamed S. Kamel , University of Waterloo, Waterloo
ABSTRACT
In distributed data mining, adopting a flat node distribution model can affect scalability. To address the problem of modularity, flexibility and scalability, we propose a Hierarchically-distributed Peer-to-Peer (HP2PC) architecture and clustering algorithm. The architecture is based on a multi-layer overlay network of peer neighborhoods. Supernodes, which act as representatives of neighborhoods, are recursively grouped to form higher level neighborhoods. Within a certain level of the hierarchy, peers cooperate within their respective neighborhoods to perform P2P clustering. Using this model, we can partition the clustering problem in a modular way across neighborhoods, solve each part individually using a distributed K-means variant, then successively combine clusterings up the hierarchy where increasingly more global solutions are computed. In addition, for document clustering applications, we summarize the distributed document clusters using a distributed keyphrase extraction algorithm, thus providing interpretation of the clusters. Results show decent speedup, reaching 165 times faster than centralized clustering for a 250-node simulated network, with comparable clustering quality to the centralized approach. We also provide comparison to the P2P K-means algorithm and show that HP2PC accuracy is better for typical hierarchy heights. Results for distributed cluster summarization match those of their centralized counterparts with up to 88% accuracy.
INDEX TERMS
Clustering, Text mining, Data mining, Abstracting methods, Distributed systems
CITATION
Khaled M. Hammouda, Mohamed S. Kamel, "Hierarchically Distributed Peer-to-Peer Document Clustering and Cluster Summarization", IEEE Transactions on Knowledge & Data Engineering, vol.21, no. 5, pp. 681-698, May 2009, doi:10.1109/TKDE.2008.189
REFERENCES
[1] N.F. Samatova, G. Ostrouchov, A. Geist, and A.V. Melechko, “RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets,” Distributed and Parallel Databases, vol. 11, no. 2, pp. 157-180, 2002.
[2] S. Merugu and J. Ghosh, “Privacy-Preserving Distributed Clustering Using Generative Models,” Proc. Third IEEE Int'l Conf. Data Mining (ICDM '03), pp. 211-218, 2003.
[3] J. da Silva, C. Giannella, R. Bhargava, H. Kargupta, and M. Klusch, “Distributed Data Mining and Agents,” Eng. Applications of Artificial Intelligence, vol. 18, no. 7, pp. 791-807, 2005.
[4] A. Strehl and J. Ghosh, “Cluster Ensembles—A Knowledge Reuse Framework for Combining Multiple Partitions,” J. Machine Learning Research, vol. 3, pp. 583-617, Dec. 2002.
[5] E. Januzaj, H.-P. Kriegel, and M. Pfeifle, “DBDC: Density Based Distributed Clustering,” Proc. Ninth Int'l Conf. Extending Database Technology (EDBT '04), pp. 88-105, 2004.
[6] M. Klusch, S. Lodi, and G. Moro, “Agent-Based Distributed Data Mining: The KDEC Scheme,” Proc. AgentLink, pp. 104-122, 2003.
[7] M. Eisenhardt, W. Muller, and A. Henrich, “Classifying Documents by Distributed P2P Clustering,” Informatik 2003: Innovative Information Technology Uses, 2003.
[8] S. Datta, C. Giannella, and H. Kargupta, “$K$ -Means Clustering over Peer-to-Peer Networks,” Proc. Eighth Int'l Workshop High Performance and Distributed Mining (HPDM), SIAM Int'l Conf. Data Mining (SDM), 2005.
[9] S. Datta, C. Giannella, and H. Kargupta, “$K$ -Means Clustering over a Large, Dynamic Network,” Proc. Sixth SIAM Int'l Conf. Data Mining (SDM '06), pp. 153-164, 2006.
[10] S. Datta, K. Bhaduri, C. Giannella, R. Wolff, and H. Kargupta, “Distributed Data Mining in Peer-to-Peer Networks,” IEEE Internet Computing, vol. 10, no. 4, pp. 18-26, 2006.
[11] S. Bandyopadhyay, C. Giannella, U. Maulik, H. Kargupta, K. Liu, and S. Datta, “Clustering Distributed Data Streams in Peer-to-Peer Environments,” Information Sciences, vol. 176, pp. 1952-1985, 2006.
[12] K. Hammouda and M. Kamel, “Collaborative Document Clustering,” Proc. Sixth SIAM Int'l Conf. Data Mining (SDM'06), pp. 453-463, Apr. 2006.
[13] H. Kargupta, I. Hamzaoglu, and B. Stafford, “Scalable, Distributed Data Mining Using an Agent-Based Architecture,” Proc. Third Int'l Conf. Knowledge Discovery and Data Mining (KDD '97), pp. 211-214, 1997.
[14] J. Li and R. Morris, “Document Clustering for Distributed Fulltext Search,” Proc. Second MIT Student Oxygen Workshop, Aug. 2002.
[15] A. Kumar, M. Kantardzic, and S. Madden, “Guest Editors' Introduction: Distributed Data Mining—Framework and Implementations,” IEEE Internet Computing, vol. 10, no. 4, pp. 15-17, 2006.
[16] R. Wolff, K. Bhaduri, and H. Kargupta, “Local L2-Thresholding Based Data Mining in Peer-to-Peer Systems,” Proc. Sixth SIAM Int'l Conf. Data Mining (SDM '06), pp. 430-441, 2006.
[17] I.S. Dhillon and D.S. Modha, “A Data-Clustering Algorithm on Distributed Memory Multiprocessors,” Large-Scale Parallel Data Mining, pp. 245-260, Springer, 2000.
[18] K. Hammouda and M. Kamel, “Incremental Document Clustering Using Cluster Similarity Histograms,” Proc. IEEE/WIC Int'l Conf. Web Intelligence (WI '03), pp. 597-601, Oct. 2003.
[19] K. Hammouda and M. Kamel, “Corephrase: Keyphrase Extraction for Document Clustering,” Proc. IAPR Int'l Conf. Machine Learning and Data Mining in Pattern Recognition (MLDM '05), P. Perner and A. Imiya, eds., pp. 265-274, July 2005.
[20] K. Hammouda and M. Kamel, “Document Similarity Using a Phrase Indexing Graph Model,” Knowledge and Information Systems, vol. 6, no. 6, pp. 710-727, Nov. 2004.
[21] D. Boley, “Principal Direction Divisive Partitioning,” Data Mining and Knowledge Discovery, vol. 2, no. 4, pp. 325-344, 1998.
[22] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore, “Partitioning-Based Clustering for Web Document Categorization,” Decision Support Systems, vol. 27, pp. 329-341, 1999.
[23] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore, “Document Categorization and Query Generation on the World Wide Web Using WebACE,” AI Rev., vol. 13, nos. 5/6, pp. 365-391, 1999.
[24] A. Strehl, “Relationship-Based Clustering and Cluster Ensembles for High-Dimensional Data Mining,” PhD dissertation, Faculty of Graduate School, Univ. of Texas at Austin, 2002.
[25] D.D. Lewis, Y. Yang, T. Rose, and F. Li, “RCV1: A New Benchmark Collection for Text Categorization Research,” J.Machine Learning Research, vol. 5, pp. 361-397, 2004.
[26] M.F. Porter, “An Algorithm for Suffix Stripping,” Program, vol. 14, no. 3, pp. 130-137, July 1980.
[27] G. Salton, A. Wong, and C. Yang, “A Vector Space Model for Automatic Indexing,” Comm. ACM, vol. 18, no. 11, pp. 613-620, Nov. 1975.
[28] W. Wong and A. Fu, “Incremental Document Clustering for Web Page Classification,” Proc. Int'l Conf. Information Soc. in the 21st Century: Emerging Technologies and New Challenges (IS), 2000.
[29] Y. Yang and J.P. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” Proc. 14th Int'l Conf. Machine Learning (ICML '97), pp. 412-420, 1997.
[30] J. He, A.-H. Tan, C.-L. Tan, and S.-Y. Sung, “On Quantitative Evaluation of Clustering Systems,” Clustering and Information Retrieval, pp. 105-133, Kluwer Academic, 2003.
[31] J.C. Dunn, “Well Separated Clusters and Optimal Fuzzy Partitions,” J. Cybernetica, vol. 4, pp. 95-104, 1974.
[32] A.-H. Tan, H.-L. Ong, H. Pan, J. Ng, and Q.-X. Li, “Towards Personalized Web Intelligence,” Knowledge and Information Systems, vol. 6, no. 5, pp. 595-616, May 2004.
32 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool