2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) (2016)
Sept. 11, 2016 to Sept. 15, 2016
Miquel Pericas , Chalmers University of Technology, SE-412 96 Göteborg, Sweden
We have analyzed the ξ-TAO model and runtime with three benchmarks: a parallel hybrid quicksort/mergesort, a 2D Jacobi stencil, and the Unbalanced Tree Search (UTS) benchmark. We run ξ-TAO implementations of these benchmarks on a Dell PowerEdge R815 server with four AMD Opteron 6348 processors, totalling 8 NUMA nodes and 48 cores. Figure 2 shows the scalability of UTS+ξ-TAO compared to thread-centric runtimes based on work stealing (MassiveThreads , Intel TBB) and hierarchical WS+PDF (Qthreads ). UTS was implemented in ξ-TAO by grouping sibling nodes into a TAO and attaching a static scheduler. UTS has a very small working set, hence the best performance is achieved when each TAO is mapped to a single core (ξ-TAO-w1). The combination of tight reuse, pre-built task groups and static scheduling results in high scalability for UTS+ξ-TAO. Unlike UTS, the parallel sorting and 2D Jacobi benchmarks are memory intensive benchmarks. By selecting assemblies of width two (i.e., core-width of the L2 caches) and six (i.e., core-width of the L3 cache) ξ-TAO is able to outperform competing approaches thanks to better management of available memory bandwidth and shared cache capacity.
Processor scheduling, Runtime, Bandwidth, Scheduling, Multicore processing, Benchmark testing
M. Pericas, "POSTER: ξ-TAO: A cache-centric execution model and runtime for deep parallel multicore topologies," 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT), Haifa, Israel, 2016, pp. 429-431.