loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
21st International Conference on Data Engineering (ICDE'05)
Finding (Recently) Frequent Items in Distributed Data Streams
Tokyo, Japan
April 05-April 08
ISBN: 0-7695-2285-8
Amit Manjhi, Carnegie Mellon University
Vladislav Shkapenyuk, Carnegie Mellon University
Kedar Dhamdhere, Carnegie Mellon University
Christopher Olston, Carnegie Mellon University
We consider the problem of maintaining frequency counts for items occurring frequently in the union of multiple distributed data streams. Na??ve methods of combining approximate frequency counts from multiple nodes tend to result in excessively large data structures that are costly to transfer among nodes. To minimize communication requirements, the degree of precision maintained by each node while counting item frequencies must be managed carefully. We introduce the concept of a precision gradient for managing precision when nodes are arranged in a hierarchical communication structure. We then study the optimization problem of how to set the precision gradient so as to minimize communication, and provide optimal solutions that minimize worst-case communication load over all possible inputs. We then introduce a variant designed to perform well in practice, with input data that does not conform to worst-case characteristics. We verify the effectiveness of our approach empirically using real-world data, and show that our methods incur substantially less communication than na??ve approaches while providing the same error guarantees on answers.
Citation:
Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston, "Finding (Recently) Frequent Items in Distributed Data Streams," icde, pp.767-778, 21st International Conference on Data Engineering (ICDE'05), 2005
Usage of this product signifies your acceptance of the Terms of Use.