Issue No. 05 - May (2009 vol. 21)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2008.189
Khaled M. Hammouda , University of Waterloo, Waterloo
Mohamed S. Kamel , University of Waterloo, Waterloo
In distributed data mining, adopting a flat node distribution model can affect scalability. To address the problem of modularity, flexibility and scalability, we propose a Hierarchically-distributed Peer-to-Peer (HP2PC) architecture and clustering algorithm. The architecture is based on a multi-layer overlay network of peer neighborhoods. Supernodes, which act as representatives of neighborhoods, are recursively grouped to form higher level neighborhoods. Within a certain level of the hierarchy, peers cooperate within their respective neighborhoods to perform P2P clustering. Using this model, we can partition the clustering problem in a modular way across neighborhoods, solve each part individually using a distributed K-means variant, then successively combine clusterings up the hierarchy where increasingly more global solutions are computed. In addition, for document clustering applications, we summarize the distributed document clusters using a distributed keyphrase extraction algorithm, thus providing interpretation of the clusters. Results show decent speedup, reaching 165 times faster than centralized clustering for a 250-node simulated network, with comparable clustering quality to the centralized approach. We also provide comparison to the P2P K-means algorithm and show that HP2PC accuracy is better for typical hierarchy heights. Results for distributed cluster summarization match those of their centralized counterparts with up to 88% accuracy.
Clustering, Text mining, Data mining, Abstracting methods, Distributed systems
K. M. Hammouda and M. S. Kamel, "Hierarchically Distributed Peer-to-Peer Document Clustering and Cluster Summarization," in IEEE Transactions on Knowledge & Data Engineering, vol. 21, no. , pp. 681-698, 2008.