Issue No. 01 - January/February (2011 vol. 28)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MS.2011.2
Chi-Keung Luk , Intel
Ryan Newton , Intel
William Hasenplaugh , Intel
Mark Hampton , Intel
Geoff Lowney , Intel
In the era of multicores, many applications that require substantial computing power and data crunching can now run on desktop PCs. However, to achieve the best possible performance, developers must write applications in a way that exploits both parallelism and cache locality. This article proposes one such approach for x86-based architectures that uses cache-oblivious techniques to divide a large problem into smaller subproblems, which are mapped to different cores or threads. The authors then use the compiler to exploit SIMD parallelism within each subproblem. Finally, they use autotuning to pick the best parameter values throughout the optimization process. The authors have implemented this approach with the Intel compiler and the newly developed Intel Software Autotuning Tool. Experimental results collected on a dual-socket quad-core Nehalem show that the approach achieves an average speed up of almost 20x over the best serial cases for an important set of computational kernels.
multicore, throughput computing, cache-oblivious algorithms, parallelization, simdization, vectorization, autotuning, Intel Nehalem
G. Lowney, R. Newton, M. Hampton, C. Luk and W. Hasenplaugh, "A Synergetic Approach to Throughput Computing on x86-Based Multicore Desktops," in IEEE Software, vol. 28, no. , pp. 39-50, 2011.