The Community for Technology Leaders
2013 IEEE 33rd International Conference on Distributed Computing Systems (2007)
Toronto, Canada
June 25, 2007 to June 27, 2007
ISBN: 0-7695-2837-3
pp: 58
Arthur B. Maccabe , University of New Mexico
Patrick G. Bridges , University of New Mexico
Wenbin Zhu , University of New Mexico
<p>For large-scale parallel applications, lightweight online monitoring can enable a wide range of online adaptations, including load balancing, power management, and progress monitoring. The processing and monitoring overhead of centralized global tracing techniques make them unsuitable for such tasks. Purely local tools, on the other hand, fail to provide the global information necessary for many desirable online adaptations of large-scale applications.</p> <p>In this paper, we describe a novel distributed online measurement method for large-scale applications called Embedded Gossip (EG). EG works by piggybacking performance information about application behavior on existing application messages and merging received information with previously known data in a fashion customized to the needs of a particular monitoring task. EG thus provides each process with both local and global views of application behavior with low overhead.</p> <p>To illustrate the capabilities of Embedded Gossip, we also show that it disseminates global information in a timely fashion for a wide range of monitoring tasks, including critical path profiling, workload imbalance monitoring, and progress monitoring. This global information has a wide range of potential uses, including imbalance detection for load balancing and energy management tools, progress monitoring for batch schedulers, and a wide range of other performance debugging and optimization techniques.</p>
Arthur B. Maccabe, Patrick G. Bridges, Wenbin Zhu, "Embedded Gossip: Lightweight Online Measurement for Large-Scale Applications", 2013 IEEE 33rd International Conference on Distributed Computing Systems, vol. 00, no. , pp. 58, 2007, doi:10.1109/ICDCS.2007.107
85 ms
(Ver 3.3 (11022016))