This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
A Synergetic Approach to Throughput Computing on x86-Based Multicore Desktops
January/February 2011 (vol. 28 no. 1)
pp. 39-50
In the era of multicores, many applications that require substantial computing power and data crunching can now run on desktop PCs. However, to achieve the best possible performance, developers must write applications in a way that exploits both parallelism and cache locality. This article proposes one such approach for x86-based architectures that uses cache-oblivious techniques to divide a large problem into smaller subproblems, which are mapped to different cores or threads. The authors then use the compiler to exploit SIMD parallelism within each subproblem. Finally, they use autotuning to pick the best parameter values throughout the optimization process. The authors have implemented this approach with the Intel compiler and the newly developed Intel Software Autotuning Tool. Experimental results collected on a dual-socket quad-core Nehalem show that the approach achieves an average speed up of almost 20x over the best serial cases for an important set of computational kernels.

1. M. Frigo et al., "Cache Oblivious Algorithms," Proc. 40th Ann. Symp. Foundations of Computer Science, ACM Press, 1999, pp. 285–298.
2. P. Kumar, Cache Oblivious Algorithms, LNCS 2625, Springer, 2003, pp. 193–212.
3. H. Prokop, "Cache-Oblivious Algorithms," master's thesis, Laboratory for Computer Science, Massachusetts Inst. of Technology, June 1999.
4. A.J. Bik, The Software Vectorization Handbook, Intel Press, 2006.
5. D. Bailey et al., "PERI Auto-Tuning," J. Physics: Conference Series (SciDAC 2008), vol. 125, no. 1, 2008; www.mcs.anl.gov/uploads/cels/papersP1517.pdf.
6. R.C. Whaley, A. Petitet, and J.J. Dongarra, "Automated Empirical Optimization of Software and the ATLAS Project," Parallel Computing, vol. 27, nos. 1–2, 2001, pp. 3–35.
7. V. Strassen, "Gaussian Elimination Is Not Optimal," Numerical Mathematics, vol. 13, 1969, pp. 354–356.
8. "Using Parallelism: (CEAN) C/C++ Extension for Array Notation," Intel, Mar. 2010.
9. J. Reinders, Intel Threading Building Blocks, O'Reilly, 2007.
10. R. Chandra et al., Parallel Programming in OpenMP, Morgan Kaufmann, 2001.
11. R.D. Blumofe et al., "Cilk: An Efficient Multithreaded Runtime System," Proc. 5th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP), ACM Press, 1995, pp. 207–216.
12. M. Hall et al, "Autotuning and Specialization: Speeding up Nek5000 with Compiler Technology," Proc. Int'l Conf. Supercomputing, ACM Press, 2010, pp. 253–262.
13. M. Frigo and V. Strumpen, "Cache Oblivious Stencil Computations," Proc. 2005 Int'l Conf. Supercomputing, ACM Press, 2005, pp. 361–366.
14. J.L. Henning, "SPEC CPU2006 Benchmark Descriptions," Computer Architecture News, vol. 34, no. 4, 2006, pp. 1–17.
15. M.A. Bender, E.D. Demaine, and M. Farach-Colton, "Cache-Oblivious B-trees," Proc. 41st Ann. Symp. Foundations of Computer Science, ACM Press, 2000, pp. 399–409.
16. J. Chhugani et al., "Efficient Implementation of Sorting on Multi-core SIMD CPU Architecture," Proc. VLDB Endowment, vol. 1, no. 2, 2008, pp. 1313–1324.
17. S.G. Akl and N. Santoro, "Optimal Parallel Merging and Sorting without Memory Conflicts," IEEE Trans. Computers, vol. 36, no. 11, 1987, pp. 1367–1369.
18. V.W. Lee et al., "Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU," Proc. 37th Ann. Int'l Symp. Computer Architecture (ISCA), ACM Press, 2010, pp. 451–460.
19. NVIDIA, "CUDA SDK," www.nvidia.com/object/cuda_get.html.
1. V.W. Lee et al., "Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU," Proc. 37th Ann. Int'l Symp. Computer Architecture (ISCA), ACM Press, 2010, pp. 451–460.
2. M. Frigo et al., "Cache Oblivious Algorithms," Proc. 40th Ann. Symp. Foundations of Computer Science, ACM Press, 1999, pp. 285–298.
3. M. Frigo and V. Strumpen, "Cache Oblivious Stencil Computations," Proc. 2005 Int'l Conf. Supercomputing, ACM Press, 2005, pp. 361–366.
4. P. Kumar Cache Oblivious Algorithms, LNCS 2625, Springer, 2003, pp. 193–212.
5. D. Bailey et al., "PERI Auto-Tuning," J. Physics: Conference Series (SciDAC 2008), vol. 125, no. 1, 2008; www.mcs.anl.gov/uploads/cels/papersP1517.pdf.
6. R.C. Whaley, A. Petitet, and J.J. Dongarra, "Automated Empirical Optimization of Software and the ATLAS Project," Parallel Computing, vol. 27, nos. 1–2, 2001, pp. 3–35.
7. J. Datta et al., "Stencil Computation Optimization and Auto-tuning on State-of-the-Art Multicore Architectures," Proc. 2008 ACM/IEEE Conf. Supercomputing, ACM Press, 2008, article no. 4.

Index Terms:
multicore, throughput computing, cache-oblivious algorithms, parallelization, simdization, vectorization, autotuning, Intel Nehalem
Citation:
Chi-Keung Luk, Ryan Newton, William Hasenplaugh, Mark Hampton, Geoff Lowney, "A Synergetic Approach to Throughput Computing on x86-Based Multicore Desktops," IEEE Software, vol. 28, no. 1, pp. 39-50, Jan.-Feb. 2011, doi:10.1109/MS.2011.2
Usage of this product signifies your acceptance of the Terms of Use.