This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
2007 IEEE 23rd International Conference on Data Engineering
Conquering the Divide: Continuous Clustering of Distributed Data Streams
Istanbul, Turkey
April 15-April 20
ISBN: 1-4244-0802-4
Graham Cormode, AT&T Labs?Research, graham@research.att.com
S. Muthukrishnan, Rutgers University, muthu@cs.rutgers.edu
Wei Zhuang, Rutgers University, weiz@cs.rutgers.edu
Data is often collected over a distributed network, but in many cases, is so voluminous that it is impractical and undesirable to collect it in a central location. Instead, we must perform distributed computations over the data, guaranteeing high quality answers even as new data arrives. In this paper, we formalize and study the problem of maintaining a clustering of such distributed data that is continuously evolving. In particular, our goal is to minimize the communication and computational cost, still providing guaranteed accuracy of the clustering. We focus on the k-center clustering, and provide a suite of algorithms that vary based on which centralized algorithm they derive from, and whether they maintain a single global clustering or many local clusterings that can be merged together. We show that these algorithms can be designed to give accuracy guarantees that are close to the best possible even in the centralized case. In our experiments, we see clear trends among these algorithms, showing that the choice of algorithm is crucial, and that we can achieve a clustering that is as good as the best centralized clustering, with only a small fraction of the communication required to collect all the data in a single location.
Citation:
Graham Cormode, S. Muthukrishnan, Wei Zhuang, "Conquering the Divide: Continuous Clustering of Distributed Data Streams," icde, pp.1036-1045, 2007 IEEE 23rd International Conference on Data Engineering, 2007
Usage of this product signifies your acceptance of the Terms of Use.