Subscribe
Issue No.04 - April (2009 vol.21)
pp: 465-478
Ran Wolff , Haifa University, Haifa
Kanishka Bhaduri , University of Maryland, Baltimore County, Baltimore
Hillol Kargupta , University of Maryland, Baltimore County, Baltimore
ABSTRACT
In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system's functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the \emph{model} of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, $k$-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient \emph{local} algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its $k$-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.
INDEX TERMS
Data mining, Mining methods and algorithms, Distributed databases, Peer to Peer Data Mining, Distributed systems, Systems and Software, Information Storage and Retrieval, Information Technology
CITATION
Ran Wolff, Kanishka Bhaduri, Hillol Kargupta, "A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems", IEEE Transactions on Knowledge & Data Engineering, vol.21, no. 4, pp. 465-478, April 2009, doi:10.1109/TKDE.2008.169
REFERENCES
 [1] R. Wolff and A. Schuster, “Association Rule Mining in Peer-to-Peer Systems,” Proc. Third IEEE Int'l Conf. Data Mining (ICDM '03), pp. 363-370, 2003. [2] D. Krivitski, A. Schuster, and R. Wolff, “A Local Facility Location Algorithm for Sensor Networks,” Proc. First IEEE Int'l Conf. Distributed Computing in Sensor Systems (DCOSS '05), pp. 368-375, 2005. [3] J. Branch, B. Szymanski, R. Wolff, C. Gianella, and H. Kargupta, “In-Network Outlier Detection in Wireless Sensor Networks,” Proc. 26th IEEE Int'l Conf. Distributed Computing Systems (ICDCS'06), pp. 51-58, 2006. [4] R. Wolff, K. Bhaduri, and H. Kargupta, “Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems,” Proc. SIAM Conf. Data Mining (SDM '06), pp. 428-439, 2006. [5] P. Luo, H. Xionga, K. Lu, and Z. Shi, “Distributed Classification in Peer-to-Peer Networks,” Proc. ACM SIGKDD '07, pp. 968-976, 2007. [6] K. Bhaduri and H. Kargupta, “A Scalable Local Algorithm for Distributed Multivariate Regression,” Statistical Analysis and Data Mining J., vol. 1, no. 3, pp. 177-194, Nov. 2008. [7] N. Li, J.C. Hou, and L. Sha, “Design and Analysis of an MST-Based Topology Control Algorithm,” IEEE Trans. Wireless Comm., vol. 4, no. 3, pp. 1195-1206, 2005. [8] Y. Birk, L. Liss, A. Schuster, and R. Wolff, “A Local Algorithm for Ad Hoc Majority Voting via Charge Fusion,” Proc. 18th Int'l Symp. Distributed Computing (DISC '04), pp. 275-289, 2004. [9] K. Bhaduri, “Efficient Local Algorithms for Distributed Data Mining in Large Scale Peer to Peer Environments: A Deterministic Approach,” PhD dissertation, Univ. of Maryland, Baltimore County, Baltimore, May 2008. [10] K. Das, K. Bhaduri, K. Liu, and H. Kargupta, “Distributed Identification of Top-$l$ Inner Product Elements and Its Application in a Peer-to-Peer Network,” IEEE Trans. Knowledge and Data Eng., vol. 20, no. 4, pp. 475-488, Apr. 2008. [11] S. Bandyopadhyay, C. Giannella, U. Maulik, H. Kargupta, K. Liu, and S. Datta, “Clustering Distributed Data Streams in Peer-to-Peer Environments,” Information Science, vol. 176, no. 14, pp. 1952-1985, 2006. [12] W. Kowalczyk, M. Jelasity, and A.E. Eiben, “Towards Data Mining in Large and Fully Distributed Peer-to-Peer Overlay Networks,” Proc. Belgium-Netherlands Conf. Artificial Intelligence (BNAIC '03), pp. 203-210, 2003. [13] S. Datta, C. Giannella, and H. Kargupta, “$k$ -Means Clustering overLarge, Dynamic Networks,” Proc. SIAM Conf. Data Mining (SDM '06), pp. 153-164, 2006. [14] M. Rabbat and R. Nowak, “Distributed Optimization in Sensor Networks,” Proc. Third Int'l Symp. Information Processing in Sensor Networks (IPSN '04), pp. 20-27, 2004. [15] N. Jain, D. Kit, P. Mahajan, P. Yalagandula, M. Dahlin, and Y. Zhang, “STAR: Self-Tuning Aggregation for Scalable Monitoring,” Proc. 33rd Int'l Conf. Very Large Data Bases (VLDB '07), pp. 962-973, Sept. 2007. [16] R. van Renesse, K.P. Birman, and W. Vogels, “Astrolabe: A Robust and Scalable Technology for Distributed System Monitoring, Management, and Data Mining,” ACM Trans. Computer Systems, vol. 21, no. 2, pp. 164-206, 2003. [17] D. Kempe, A. Dobra, and J. Gehrke, “Computing Aggregate Information Using Gossip,” Proc. 44th Ann. IEEE Symp. Foundations of Computer Science (FOCS '03), pp. 482-491, 2003. [18] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Gossip Algorithms: Design, Analysis and Applications,” Proc. IEEE INFOCOM '05, pp. 1653-1664, 2005. [19] M. Jelasity, A. Montresor, and O. Babaoglu, “Gossip-Based Aggregation in Large Dynamic Networks,” ACM Trans. Computer Systems, vol. 23, no. 3, pp. 219-252, 2005. [20] Y. Afek, S. Kutten, and M. Yung, “Local Detection for Global Self-Stabilization,” Theoretical Computer Science, vol. 186, nos. 1-2, pp.199-230, 1997. [21] N. Linial, “Locality in Distributed Graph Algorithms,” SIAM J. Computing, vol. 21, no. 1, pp. 193-2010, 1992. [22] M. Naor and L. Stockmeyer, “What Can Be Computed Locally?” Proc. 25th Ann. ACM Symp. Theory of Computing (STOC '93), pp.184-193, 1993. [23] S. Kutten and D. Peleg, “Fault-Local Distributed Mending,” Proc. 14th Ann. ACM Symp. Principles of Distributed Computing (PODC'95), pp. 20-27, 1995. [24] K. Bhaduri, R. Wolff, C. Giannella, and H. Kargupta, “Distributed Decision Tree Induction in Peer-to-Peer Systems,” Statistical Analysis and Data Mining J., vol. 1, no. 2, pp. 85-103, 2008. [25] M. Bawa, A. Gionis, H. Garcia-Molina, and R. Motwani, “The Price of Validity in Dynamic Networks,” Proc. ACM SIGMOD '04, pp. 515-526, 2004.