This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Approximate Distributed K-Means Clustering over a Peer-to-Peer Network
October 2009 (vol. 21 no. 10)
pp. 1372-1388
Souptik Datta, University of Maryland, Baltimore
Chris R. Giannella, Loyola College, Baltimore
Hillol Kargupta, University of Maryland, Baltimore
Data intensive Peer-to-Peer (P2P) networks are finding increasing number of applications. Data mining in such P2P environments is a natural extension. However, common monolithic data mining architectures do not fit well in such environments since they typically require centralizing the distributed data which is usually not practical in a large P2P network. Distributed data mining algorithms that avoid large-scale synchronization or data centralization offer an alternate choice. This paper considers the distributed K-means clustering problem where the data and computing resources are distributed over a large P2P network. It offers two algorithms which produce an approximation of the result produced by the standard centralized K-means clustering algorithm. The first is designed to operate in a dynamic P2P network that can produce clusterings by “local” synchronization only. The second algorithm uses uniformly sampled peers and provides analytical guarantees regarding the accuracy of clustering on a P2P network. Empirical results show that both the algorithms demonstrate good performance compared to their centralized counterparts at the modest communication cost.

[1] H. Ang, V. Gopalkrishnan, S. Hoi, and W. Ng, “Cascade RSVM in Peer-to-Peer Networks,” Proc. European Conf. Principles of Data Mining and Knowledge Discovery (PKDD '08), pp. 55-70, 2008.
[2] P. Luo, H. Xiong, K. Lu, and Z. Shi, “Distributed Classification in Peer-to-Peer Networks,” Proc. ACM Workshop Knowledge Discovery from Sensor Data (KDD '07), pp. 968-976, 2007.
[3] A. Vlachou, C. Doulkeridis, K. Norvag, and M. Vazirgiannis, “On Efficient Top-K Query Processing in Highly Distributed Environments,” Proc. ACM SIGMOD, pp. 753-764, 2008.
[4] S. Datta, C. Giannella, and H. Kargupta, “K-Means Clustering over a Large, Dynamic Network,” Proc. SIAM Int'l Conf. Data Mining, pp. 153-164, 2006.
[5] K. Liu, K. Bhaduri, K. Das, P. Nguen, and H. Kargupta, “Client-Side Web Mining for Community Formation in Peer-to-Peer Environments,” SIGKDD Explorations, vol. 8, pp. 11-20, 2006.
[6] M. Bawa, A. Gionis, H. Garcia-Molina, and R. Motwani, “The Price of Validity in Dynamic Networks,” J. Computer and System Sciences, vol. 73, no. 3, pp. 245-264, 2007.
[7] W. Muller, M. Eisenhart, and A. Henrich, “Efficient Content-Based P2P Image Retrieval Using Peer Content Descriptions,” Proc. Internet Imaging V, pp. 57-68, 2004.
[8] S. Bandyopadhyay, C. Giannella, U. Maulik, H. Kargupta, K. Liu, and S. Datta, “Clustering Distributed Data Streams in Peer-to-Peer Environments,” Information Sciences, vol. 176, no. 14, pp. 1952-1985, 2006.
[9] H. Kargupta, W. Huang, K. Sivakumar, and E. Johnson, “Distributed Clustering Using Collective Principal Component Analysis,” Knowledge and Information Systems, vol. 3, pp. 422-448, 2001.
[10] H. Kargupta and K. Sivakumar, “Existential Pleasures of Distributed Data Mining,” Data Mining: Next Generation Challenges and Future Directions, AAAI Press, 2004.
[11] I. Dhillon and D. Modha, “A Data-Clustering Algorithm on Distributed Memory Multiprocessors,” Proc. KDD Workshop High Performance Knowledge Discovery, pp. 245-260, 1999.
[12] G. Forman and B. Zhang, “Distributed Data Clustering Can Be Efficient and Exact,” SIGKDD Explorations, vol. 2, no. 2, pp. 34-38, 2000.
[13] D. Kempe, A. Dobra, and J. Gehrke, “Computing Aggregate Information Using Gossip,” Proc. IEEE Symp. Foundations of Computer Science (FoCS '03), pp. 482-491, 2003.
[14] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Gossip Algorithms: Design, Analysis, and Applications,” Proc. IEEE INFOCOM, vol. 3, pp. 1653-1664, 2005.
[15] W. Kowalczyk, M. Jelasity, and A. Eiben, “Towards Data Mining in Large and Fully Distributed Peer-to-Peer Overlay Networks,” Proc. Belgium-Netherlands Artificial Intelligence Conf. (BNAIC '03), pp.203-210, 2003.
[16] R. Wolff and A. Schuster, “Association Rule Mining in Peer-to-Peer Systems,” IEEE Trans. Systems, Man, and Cybernetics, Part B, vol. 34, no. 6, pp. 2426-2438, Dec. 2004.
[17] D. Krivitski, A. Schuster, and R. Wolff, “A Local Facility Location Algorithm for Large-Scale Distributed Systems,” J. Grid Computing, vol. 5, no. 4, pp. 361-378, 2007.
[18] J. Branch, B. Szymanski, C. Giannella, R. Wolff, and H. Kargupta, “In-Network Outlier Detection in Wireless Sensor Networks,” Proc. IEEE Int'l Conf. Distributed Computing Systems (ICDCS '06), p.51, 2006.
[19] N. Palatin, A. Leizarowitz, and A. Schuster, “Mining for Misconfigured Machines in Grid Systems,” Proc. ACM Workshop Knowledge Discovery from Sensor Data (KDD '06), pp. 687-692, 2006.
[20] K. Bhaduri, R. Wolff, C. Giannella, and H. Kargupta, “Distributed Decision Tree Induction in Peer-to-Peer Systems,” Statistical Analysis and Data Mining, vol. 1, no. 2, pp. 85-103, 2008.
[21] J. Clemente, X. Defago, and K. Satou, “Asynchronous Peer-to-Peer Communication for Failure Resilient Distributed Genetic Algorithms,” Proc. IASTED Int'l Conf. Parallel and Distributed Computing and Systems (PDCS '03), pp. 769-773, 2003.
[22] I. Sharfman, A. Schuster, and D. Keren, “A Geometric Approach to Monitoring Threshold Functions over Distributed Data Streams,” ACM Trans. Database Systems, vol. 32, no. 4, pp. 23:1-23:29, 2007.
[23] K. Bhaduri and H. Kargupta, “An Efficient Local Algorithm for Distributed Multivariate Regression in Peer-to-Peer Networks,” Proc. SIAM Int'l Conf. Data Mining, pp. 153-164, 2008.
[24] R. Wolff, K. Bhaduri, and H. Kargupta, “Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems,” Proc. 2006 SIAM Int'l Conf. Data Mining, 2006.
[25] C. Tang, Z. Xu, and S. Dwarkadas, “Peer-to-Peer Information Retrieval Using Self-Organizing Semantic Overlay Networks,” Proc. ACM SIGCOMM, pp. 175-186, 2004.
[26] P. Cao and Z. Wang, “Efficient Top-K Query Calculation in Distributed Networks,” Proc. ACM Symp. Principles of Distributed Computing (PODC '04), pp. 206-215, 2004.
[27] S. Michel, P. Triantafillou, and G. Weikum, “KLEE: A Framework for Distributed Top-K Query Algorithms,” Proc. Int'l Conf. Very Large Data Bases (VLDB '05), pp. 637-648, 2005.
[28] S. Shi, J. Yu, G. Yang, and D. Wang, “Distributed Page Ranking in Structured P2P Networks,” Proc. Int'l Conf. Parallel Processing (ICPP '03), pp. 179-186, 2003.
[29] W.T. Balke, W. Nejdl, W. Siberski, and U. Thaden, “Progressive Distributed Top-k Retrieval in Peer-to-Peer Networks,” Proc. Int'l Conf. Data Eng. (ICDE '05), pp. 174-185, 2005.
[30] P. Domingos and G. Hulten, “A General Method for Scaling Up Machine Learning Algorithms and Its Application to Clustering,” Proc. Int'l Conf. Machine Learning (ICML '01), pp. 106-113, 2001.
[31] W. Cochran, Sampling Techniques. John Wiley & Sons, Inc., 1977.
[32] V. Lo, D. Zhou, Y. Liu, C. GauthierDickey, and J. Li, “Scalable Supernode Selection in Peer-to-Peer Overlay Networks,” Proc. Int'l Workshop Hot Topics in Peer-to-Peer Systems (HOT-P2P), 2005.
[33] C. Gkantsidis, M. Mihail, and A. Saberi, “Random Walks in Peer-to-Peer Networks,” Proc. IEEE INFOCOM, 2004.
[34] M. Zhong, K. Shen, and J. Seiferas, “Non-Uniform Random Membership Management in Peer-to-Peer Networks,” Proc. IEEE INFOCOM, 2005.
[35] N. Metropolis, A.W. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller, “Equations of State Calculations by Fast Computing Machines,” J. Chemical Physics, vol. 21, no. 2, pp. 1087-1092, 1953.
[36] A. Awan, R. Ferreira, A. Grama, and S. Jagannathan, “Distributed Uniform Sampling in Unstructured Peer-to-Peer Networks,” Proc. Int'l Conf. System Sciences, 2006.
[37] S. Datta and H. Kargupta, “Uniform Sampling from a Peer-to-Peer Network,” Proc. IEEE Int'l Conf. Distributed Computing Systems (ICDCS '07), p. 50, 2007.
[38] Z. Shen, “Average Diameter of Network Structures and Its Estimation,” Proc. ACM Symp. Applied Computing (SAC '98), pp.593-597, 1998.

Index Terms:
Peer-to-peer data mining, distributed K-means clustering.
Citation:
Souptik Datta, Chris R. Giannella, Hillol Kargupta, "Approximate Distributed K-Means Clustering over a Peer-to-Peer Network," IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 10, pp. 1372-1388, Oct. 2009, doi:10.1109/TKDE.2008.222
Usage of this product signifies your acceptance of the Terms of Use.