Issue No.10 - October (2009 vol.21)
Souptik Datta , University of Maryland, Baltimore
Chris R. Giannella , Loyola College, Baltimore
Hillol Kargupta , University of Maryland, Baltimore
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2008.222
Data intensive Peer-to-Peer (P2P) networks are finding increasing number of applications. Data mining in such P2P environments is a natural extension. However, common monolithic data mining architectures do not fit well in such environments since they typically require centralizing the distributed data which is usually not practical in a large P2P network. Distributed data mining algorithms that avoid large-scale synchronization or data centralization offer an alternate choice. This paper considers the distributed K-means clustering problem where the data and computing resources are distributed over a large P2P network. It offers two algorithms which produce an approximation of the result produced by the standard centralized K-means clustering algorithm. The first is designed to operate in a dynamic P2P network that can produce clusterings by “local” synchronization only. The second algorithm uses uniformly sampled peers and provides analytical guarantees regarding the accuracy of clustering on a P2P network. Empirical results show that both the algorithms demonstrate good performance compared to their centralized counterparts at the modest communication cost.
Peer-to-peer data mining, distributed K-means clustering.
Souptik Datta, Chris R. Giannella, Hillol Kargupta, "Approximate Distributed K-Means Clustering over a Peer-to-Peer Network", IEEE Transactions on Knowledge & Data Engineering, vol.21, no. 10, pp. 1372-1388, October 2009, doi:10.1109/TKDE.2008.222