The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.12 - Dec. (2013 vol.24)
pp: 2334-2343
Quan Chen , Shanghai Jiao Tong University quan chen, Shanghai
Minyi Guo , Shanghai Jiao Tong University, Shanghai
Zhiyi Huang , University of Otago, Dunedin
ABSTRACT
Modern multicore computers often adopt a multisocket multicore architecture with shared caches in each socket. However, traditional work-stealing schedulers tend to pollute the shared cache and incur more cache misses due to their random stealing. To relieve this problem, this paper proposes an Adaptive Cache-Aware Bi-tier work-stealing (A-CAB) scheduler. A-CAB improves the performance of memory-bound applications by reducing memory footprint and cache misses of tasks running inside the same CPU socket. A-CAB adaptively uses a DAG partitioner to divide an execution Directed Acyclic Graph (DAG) into the intersocket tier and the intrasocket tier. Tasks in the intersocket tier are scheduled across sockets while tasks in the intrasocket tier are scheduled within the same socket. Experimental results tell us that A-CAB can improve the performance of memory-bound applications up to 74.4 percent compared with the traditional work-stealing.
INDEX TERMS
Cache memory, Multicore processing, Iterative methods, Instruction sets, Optimal scheduling,divide-and-conquer programs, Cache aware, work-stealing, multisocket multicore architectures
CITATION
Quan Chen, Minyi Guo, Zhiyi Huang, "Adaptive Cache Aware Bitier Work-Stealing in Multisocket Multicore Architectures", IEEE Transactions on Parallel & Distributed Systems, vol.24, no. 12, pp. 2334-2343, Dec. 2013, doi:10.1109/TPDS.2012.322
REFERENCES
[1] U. Acar, G. Blelloch, and R. Blumofe, "The Data Locality of Work Stealing," Theory of Computing Systems, vol. 35, no. 3, pp. 321-347, 2002.
[2] E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang, "The Design of OpenMP Tasks," IEEE Trans. Parallel and Distributed Systems, vol. 20, no. 3, pp. 404-418, Mar. 2009.
[3] R. Azimi, M. Stumm, and R. Wisniewski, "Online Performance Analysis by Statistical Sampling of Microprocessor Performance Counters," Proc. 19th Ann. Int'l Conf. Supercomputing, pp. 101-110, 2005.
[4] M. Berger and J. Oliger, "Adaptive Mesh Refinement for Hyperbolic Partial Differential Equations," J. Computational Physics, vol. 53, no. 3, pp. 484-512, 1984.
[5] G. Blelloch, R. Chowdhury, P. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch, "Provably Good Multicore Cache Performance for Divide-and-Conquer Algorithms," Proc. 19th Ann. ACM-SIAM Symp. Discrete Algorithms, pp. 501-510, 2008.
[6] G. Blelloch, J. Fineman, P. Gibbons, and H.V. Simhadri, "Scheduling Irregular Parallel Computations on Hierarchical Caches," Proc. 20th ACM Symp. Parallel Algorithms and Architectures, June 2011.
[7] G. Blelloch, P. Gibbons, and H. Simhadri, "Low Depth Cache-Oblivious Algorithms," Proc. 22nd ACM Symp. Parallelism in Algorithms and Architectures, pp. 189-199, 2010.
[8] R.D. Blumofe, "Executing Multithreaded Programs Efficiently," PhD thesis, Dept. of Electrical Eng. and Computer Science, Massachusetts Inst. of Technology, Technical Report MIT/LCS/TR-677, MIT Laboratory for Computer Science, Sept. 1995.
[9] R.D. Blumofe, C.F. Joerg, B.C. Kuszmaul, C.E. Leiserson, K.H. Randall, and Y. Zhou, "Cilk: An Efficient Multithreaded Runtime System," J. Parallel and Distributed computing, vol. 37, no. 1, pp. 55-69, Aug. 1996.
[10] D. Butenhof, Programming with POSIX Threads. Addison-Wesley Longman Publishing Co., Inc., 1997.
[11] D. Chase and Y. Lev, "Dynamic Circular Work-Stealing Deque," Proc. 17th Ann. ACM Symp. Parallelism Algorithms and Architectures, pp. 21-28, 2005.
[12] Q. Chen, M. Guo, and Z. Huang, "Cats: Cache Aware Task-Stealing Based on Online Profiling in Multi-Socket Multi-Core Architectures," Proc. 26th Int'l Conf. Supercomputing, pp 163-172, 2012.
[13] Q. Chen, Z. Huang, M. Guo, and J. Zhou, "CAB: CachE-Aware Bi-Tier Task-Stealing In Multi-Socket Multi-Core Architecture," Proc. 40th Int'l Conf. Parallel Processing, pp. 722-732, 2011.
[14] S. Chen et al., "Scheduling Threads for Constructive Cache Sharing on CMPs," Proc. 19th Ann. ACM Symp. Parallel Algorithms and Architectures, pp. 105-115, 2007.
[15] R. Cole and V. Ramachandran, "Analysis of Randomized Work Stealing with False Sharing," ArXiv e-prints, Mar. 2011.
[16] X. Ding, K. Wang, and X. Zhang, "ULCC: A User-Level Facility for Optimizing Shared Cache Performance on Multicores," Proc. ACM SIGPLAN Symp. Principles and Practice Parallel Programming, pp. 103-112, 2011.
[17] A. Gerasoulis and T. Yang, "A Comparison of Clustering Heuristics for Scheduling Directed Acyclic Graphs on Multiprocessors," J. Parallel and Distributed Computing, vol. 16, no. 4, pp. 276-291, 1992.
[18] W. Gropp, E. Lusk, and A. Skjellum, Using MPI: Portable Parallel Programming with the Message Passing Interface. MIT Press, 1999.
[19] Y. Guo, R. Barik, R. Raman, and V. Sarkar, "Work-First and Help-First Scheduling Policies for Async-Finish Task Parallelism," Proc. IEEE 23th Int'l Parallel and Distributed Processing Symp., pp. 1-12, 2009.
[20] Y. Guo, J. Zhao, V. Cave, and V. Sarkar, "Slaw: A Scalable Locality-Aware Adaptive Work-Stealing Scheduler," Proc. IEEE 24th Int'l Parallel and Distributed Processing Symp., pp. 1-12, 2010.
[21] D. Hendler, Y. Lev, M. Moir, and N. Shavit, "A Dynamic-Sized Nonblocking Work Stealing Deque," Technical Report TR-2005-144, Sun Microsystems, Inc., p. 69, 2005.
[22] D. Hendler and N. Shavit, "Non-Blocking Steal-Half Work Queues," Proc. 21th Ann. Symp. Principles Distributed Computing, pp. 280-289, 2002.
[23] D. Lea, "A Java Fork/Join Framework," Proc. ACM Conf. Java Grande, pp. 36-43, 2000.
[24] J. Lee and J. Palsberg, "Featherweight X10: A Core Calculus for Async-Finish Parallelism," Proc. 15th ACM SIGPLAN Symp. Principles and Practice of Parallel Computing, pp. 25-36, 2010.
[25] D. Leijen, W. Schulte, and S. Burckhardt, "The Design of a Task Parallel Library," ACM SIGPLAN Notices, vol. 44, no. 10, pp. 227-242, 2009.
[26] C. Leiserson, "The Cilk++ Concurrency Platform," Proc. 46th Ann. Design Automation Conf., pp. 522-527, 2009.
[27] M.M. Michael, M.T. Vechev, and V.A. Saraswat, "Idempotent Work Stealing," Proc. 14th ACM SIGPLAN Symp. Principles and Practice Parallel Programming, pp. 45-54, 2009.
[28] J.-N. Quintin and F. Wagner, "Hierarchical Work-Stealing," Proc. 16th Int'l Euro-Par Conf. Parallel processing: Part I, pp. 217-229, 2010.
[29] J. Reinders, Intel Threading Building Blocks. O'Reilly, 2007.
[30] L. Wang, H. Cui, Y. Duan, F. Lu, X. Feng, and P. Yew, "An Adaptive Task Creation Strategy for Work-Stealing Scheduling," Proc. IEEE/ACM Eighth Ann. Int'l Symp. Code Generation and Optimization, pp. 266-277, 2010.
[31] J. Zhang, Z. Huang, W. Chen, Q. Huang, and W. Zheng, "Maotai: View-Oriented Parallel Programming on CMT Processors," Proc. 37th Int'l Conf. Parallel Processing, pp. 636-643, 2008.
60 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool