The Community for Technology Leaders
2014 23rd International Conference on Parallel Architecture and Compilation (PACT) (2014)
Edmonton, Canada
Aug. 23, 2014 to Aug. 27, 2014
ISBN: 978-1-5090-6607-0
pp: 127-138
Wookeun Jung , Department of Computer Science and Engineering, Seoul National University, Seoul 151-744, Korea
Jongsoo Park , Parallel Computing Lab, Intel Corporation, 2200 Mission College Blvd., Santa Clara, California 95054, USA
Jaejin Lee , Department of Computer Science and Engineering, Seoul National University, Seoul 151-744, Korea
ABSTRACT
Histograms are used in various fields to quickly profile the distribution of a large amount of data. However, it is challenging to efficiently utilize abundant parallel resources in modern processors for histogram construction. To make matters worse, the most efficient implementation varies depending on input parameters (e.g., input distribution, number of bins, and data type) or architecture parameters (e.g., cache capacity and SIMD width). This paper presents versatile histogram methods that achiev competitive performance across a wide range of input types and target architectures. Our open source implementations are highly optimized for various cases and are scalable for more threads and wider SIMD units. We also show that histogram construction can be significantly accelerated by Intel® Xeon Phi coprocessors for common input data sets because of their compute power from many cores and instructions for efficient vectorization, such as gather-scatter. For histograms with 256 fixed-width bins, a dual-socket 8-core Intel® Xeon® E5-2690 achieves 13 billion bin updates per second (GUPS), while a 60-core Intel® Xeon Phi 5110P coprocessor achieves 18 GUPS for a skewed input. For histograms with 256 variable-width bins, the Xeon processor achieves 4.7 GUPS, while the Xeon Phi coprocessor achieves 9.7 GUPS for a skewed input. For text histogram, or word count, the Xeon processor achieves 342.4 million words per seconds (MWPS). This is 4.12×, 3.46× faster than PHOENIX and TBB. The Xeon phi processor achieves 401.4 MWPS, which is 1.17× faster than the Xeon processor. Since histogram construction captures essential characteristics of more general reduction-heavy operations, our approach can be extended to other settings.
INDEX TERMS
Histograms, Instruction sets, Computer architecture, Hardware, Parallel processing, Coprocessors,Multi-core, Histogram, Algorithms, Performance, SIMD
CITATION
Wookeun Jung, Jongsoo Park, Jaejin Lee, "Versatile and scalable parallel histogram construction", 2014 23rd International Conference on Parallel Architecture and Compilation (PACT), vol. 00, no. , pp. 127-138, 2014, doi:10.1145/2628071.2628108
97 ms
(Ver 3.3 (11022016))