The Community for Technology Leaders
RSS Icon
Issue No.01 - January/February (2011 vol.28)
pp: 39-50
Ryan Newton , Intel
Mark Hampton , Intel
Geoff Lowney , Intel
In the era of multicores, many applications that require substantial computing power and data crunching can now run on desktop PCs. However, to achieve the best possible performance, developers must write applications in a way that exploits both parallelism and cache locality. This article proposes one such approach for x86-based architectures that uses cache-oblivious techniques to divide a large problem into smaller subproblems, which are mapped to different cores or threads. The authors then use the compiler to exploit SIMD parallelism within each subproblem. Finally, they use autotuning to pick the best parameter values throughout the optimization process. The authors have implemented this approach with the Intel compiler and the newly developed Intel Software Autotuning Tool. Experimental results collected on a dual-socket quad-core Nehalem show that the approach achieves an average speed up of almost 20x over the best serial cases for an important set of computational kernels.
multicore, throughput computing, cache-oblivious algorithms, parallelization, simdization, vectorization, autotuning, Intel Nehalem
Chi-Keung Luk, Ryan Newton, William Hasenplaugh, Mark Hampton, Geoff Lowney, "A Synergetic Approach to Throughput Computing on x86-Based Multicore Desktops", IEEE Software, vol.28, no. 1, pp. 39-50, January/February 2011, doi:10.1109/MS.2011.2
1. M. Frigo et al., "Cache Oblivious Algorithms," Proc. 40th Ann. Symp. Foundations of Computer Science, ACM Press, 1999, pp. 285–298.
2. P. Kumar, Cache Oblivious Algorithms, LNCS 2625, Springer, 2003, pp. 193–212.
3. H. Prokop, "Cache-Oblivious Algorithms," master's thesis, Laboratory for Computer Science, Massachusetts Inst. of Technology, June 1999.
4. A.J. Bik, The Software Vectorization Handbook, Intel Press, 2006.
5. D. Bailey et al., "PERI Auto-Tuning," J. Physics: Conference Series (SciDAC 2008), vol. 125, no. 1, 2008;
6. R.C. Whaley, A. Petitet, and J.J. Dongarra, "Automated Empirical Optimization of Software and the ATLAS Project," Parallel Computing, vol. 27, nos. 1–2, 2001, pp. 3–35.
7. V. Strassen, "Gaussian Elimination Is Not Optimal," Numerical Mathematics, vol. 13, 1969, pp. 354–356.
8. "Using Parallelism: (CEAN) C/C++ Extension for Array Notation," Intel, Mar. 2010.
9. J. Reinders, Intel Threading Building Blocks, O'Reilly, 2007.
10. R. Chandra et al., Parallel Programming in OpenMP, Morgan Kaufmann, 2001.
11. R.D. Blumofe et al., "Cilk: An Efficient Multithreaded Runtime System," Proc. 5th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP), ACM Press, 1995, pp. 207–216.
12. M. Hall et al, "Autotuning and Specialization: Speeding up Nek5000 with Compiler Technology," Proc. Int'l Conf. Supercomputing, ACM Press, 2010, pp. 253–262.
13. M. Frigo and V. Strumpen, "Cache Oblivious Stencil Computations," Proc. 2005 Int'l Conf. Supercomputing, ACM Press, 2005, pp. 361–366.
14. J.L. Henning, "SPEC CPU2006 Benchmark Descriptions," Computer Architecture News, vol. 34, no. 4, 2006, pp. 1–17.
15. M.A. Bender, E.D. Demaine, and M. Farach-Colton, "Cache-Oblivious B-trees," Proc. 41st Ann. Symp. Foundations of Computer Science, ACM Press, 2000, pp. 399–409.
16. J. Chhugani et al., "Efficient Implementation of Sorting on Multi-core SIMD CPU Architecture," Proc. VLDB Endowment, vol. 1, no. 2, 2008, pp. 1313–1324.
17. S.G. Akl and N. Santoro, "Optimal Parallel Merging and Sorting without Memory Conflicts," IEEE Trans. Computers, vol. 36, no. 11, 1987, pp. 1367–1369.
18. V.W. Lee et al., "Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU," Proc. 37th Ann. Int'l Symp. Computer Architecture (ISCA), ACM Press, 2010, pp. 451–460.
16 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool