The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - March (2012 vol.23)
pp: 397-404
Nan Zhang , Xi'an Jiaotong-Liverpool University, Suzhou
ABSTRACT
We present a novel parallel algorithm for computing the scan operations on x86 multicore processors. The existing best known parallel scan for the same platform requires the number of processors to be a power of two. But this constraint is removed from our proposed method. In the design of the algorithm architectural considerations for x86 multicore processors are given so that the rate of cache misses is reduced and the cost of thread synchronization and management is minimized. Results from tests made on a machine with dual-socket \times quad-core Intel Xeon E5405 showed that the proposed solution outperformed the best known parallel reference. A novel approach to sparse matrix-vector multiplication (SpMV) based on the proposed scan is then explained. The approach, unlike the existing ones that make use of backward segmented operations, uses forward ones for more efficient caching. An implementation of the proposed SpMV was tested against the SpMV in Intel's Math Kernel Library (MKL) and merits were found in the proposed approach.
INDEX TERMS
Parallel algorithms, parallel scan, prefix sum, multicore computing, sparse matrix-vector multiplication.
CITATION
Nan Zhang, "A Novel Parallel Scan for Multicore Processors and Its Application in Sparse Matrix-Vector Multiplication", IEEE Transactions on Parallel & Distributed Systems, vol.23, no. 3, pp. 397-404, March 2012, doi:10.1109/TPDS.2011.174
REFERENCES
[1] N. Zhang, "A Novel Parallel Prefix Sum Algorithm and Its Implementation on Multi-Core Platforms," Proc. Second Int'l Conf. Computer Eng. and Technology, vol. 2, pp. 66-70, Apr. 2010.
[2] G.E. Blelloch, "Prefix Sums and Their Applications," Technical Report CMU-CS-90-190, School of Computer Science, Carnegie Mellon Univ., http://www.cs.cmu.edu/~guyb/papersBle93. pdf , Nov. 1990.
[3] W.D. Hillis and G.L. SteeleJr, "Data Parallel Algorithms," Comm. ACM, vol. 29, no. 12, pp. 1170-1183, Dec. 1986.
[4] K.E. Iverson, A Programming Language. John Wiley & Sons, Inc, Dec. 1962.
[5] G.E. Blelloch, "Scans as Primitive Parallel Operations," IEEE Trans. Computers, vol. 38, no. 11, pp. 1526-1538, Nov. 1989.
[6] G.E. Blelloch, "NESL: A Nested Data-Parallel Language (Version 2.6)," Technical Report CMU-CS-93-129, School of Computer Science, Carnegie Mellon Univ., 1993.
[7] W.D. Hillis, The Connection Machine. The MIT Press, 1985.
[8] G.E. Blelloch, J.C. Hardwick, J. Sipelstein, M. Zagha, and S. Chatterjee, "Implementation of a Portable Nested Data-Parallel Language," J. Parallel and Distributed Computing, vol. 21, no. 1, pp. 4-14, Apr. 1994.
[9] D. Horn, "Stream Reduction Operations for GPGPU Applications," GPU Gems 2, M. Pharr and R. Fernando, eds., ch. 36, pp. 573-589, Addison-Wesley Professional, 2005.
[10] S. Sengupta, A.E. Lefohn, and J.D. Owens, "A Work-Efficient Step-Efficient Prefix Sum Algorithm," Proc. Workshop Edge Computing Using New Commodity Architectures, pp. D-26-D-27, May 2006.
[11] M. Harris, S. Sengupta, and J.D. Owens, "Parallel Prefix Sum (Scan) with CUDA," GPU Gems 3, H. Nguyen, ed., ch. 39, Addison-Wesley, Aug. 2007.
[12] G.E. Blelloch, M.A. Heroux, and M. Zagha, "Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors," Technical Report CMU-CS-93-173, School of Computer Science, Carnegie Mellon Univ. and Cray Research, Inc., Aug. 1993.
[13] S. Sengupta, M. Harris, Y. Zhang, and J.D. Owens, "Scan Primitives for GPU Computing," Proc. 22nd ACM SIGGRAPH/EUROGRAPHICS Symp. Graphics Hardware, pp. 97-106, 2007.
[14] Intel 64 and IA-32 Architectures Optimization Reference Manual, Intel Corporation, 248966-018, Mar. 2009.
[15] R.K. Malladi, "Using Intel VTune Performance Analyzer Events/Ratios and Optimizing Applications," http:/software.intel.com, Jan. 2009.
[16] R.E. Bryant and D.R. O'Hallaron, Computer Systems: A Programmer's Perspective, ch. 9, p. 671. Prentice Hall, 2002.
[17] J.T. Schwartz, "Ultra-Computers," ACM Trans. Programming Languages and Systems, vol. 2, no. 4, pp. 484-521, Oct. 1980.
[18] S. Sengupta, M. Harris, and M. Garland, "Efficient Parallel Scan Algorithms for GPUs," Technical report, NVIDIA Corporation, Dec. 2008.
[19] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms," Parallel Computing, vol. 35, no. 3, pp. 178-194, Mar. 2009.
[20] M. Krotkiewski and M. Dabrowski, "Parallel Symmetric Sparse Matrix-Vector Product on Scalar Multi-Core CPUs," Parallel Computing, vol. 36, no. 4, pp. 181-198, Apr. 2010.
[21] R. Vuduc1, J.W. Demmel, and K.A. Yelick, "OSKI: A Library of Automatically Tuned Sparse Matrix Kernels," J. Physics: Conf. Series, vol. 16, pp. 521-530, 2005.
[22] T.A. Davis and Y. Hu, "The University of Florida Sparse Matrix Collection," Submitted to ACM Trans. Math. Software. http://www.cise.ufl.edu/~davis/techreports matrices.pdf, 2010.
19 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool