High-level parallel programming models supporting dynamic fine-grained threads in a global object space, are becoming increasingly popular for expressing irregular applications based on sophisticated adaptive algorithms and pointer-based data structures. However, implementing these multithreaded computations on scalable parallel machines poses significant challenges, particularly with respect to load-balancing. Load-balancing techniques must simultaneously incur low overhead to support fine-grained threads as well as be sophisticated enough to preserve data locality and thread execution priority.
This paper presents a hierarchical framework which addresses these conflicting goals by viewing the computation as being made up of different thread subsets, each of which are load-balanced independently. In contrast to previous processor-centric approaches that have advocated the use of a uniform policy for load-balancing all threads in a computation, our framework allows each thread subset to be load-balanced using a policy most suited to its characteristics (e.g., locality or priority sensitivity). The framework consists of two parts: (i) language support which permits a programmer to tag different thread subsets with appropriate policies, and (ii) run-time support which synthesizes overall application load-balance by composing these individual policies.
This framework has been implemented in the Illinois Concert runtime system, an execution platform for fine-grained concurrent object-oriented languages. Results for four large irregular applications on the Cray T3D and the SGI Origin 2000 demonstrate advantages of the hierarchical framework: performance improves by up to an order of magnitude as compared to using a uniform load-balancing policy.