The Community for Technology Leaders
2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) (2014)
Chicago, IL, USA
May 26, 2014 to May 29, 2014
ISBN: 978-1-4799-2784-5
pp: 927-932
ABSTRACT
In this paper we present the scaling of BTWorld, our MapReduce-based approach to observing and analyzing the global BitTorrent network which we have been monitoring for the past 4 years. BTWorld currently provides a comprehensive and complex set of queries implemented in Pig Latin, with data dependencies between them, which translate to several MapReduce jobs that have a heavy-tailed distribution with respect to both execution time and input size characteristics. Processing BitTorrent data in excess of 1 TB with our BTWorld workflow required an in-depth analysis of the entire software stack and the design of a complete optimization cycle. We analyze our system from both theoretical and experimental perspectives and we show how we attained a 15 times larger scale of data processing than our previous results.
INDEX TERMS
Runtime, Optimization, Peer-to-peer computing, Software, Data mining, Big data, Monitoring
CITATION

B. Ghit, M. Capota, T. Hegeman, J. Hidders, D. Epema and A. Iosup, "V for Vicissitude: The Challenge of Scaling Complex Big Data Workflows," 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)(CCGRID), Chicago, IL, USA, 2014, pp. 927-932.
doi:10.1109/CCGrid.2014.97
92 ms
(Ver 3.3 (11022016))