The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.02 - February (2009 vol.20)
pp: 261-274
Guangming Tan , Institute of Computing Technology, Chinese Academy of Sciences, Beijing
Ninghui Sun , Institute of Computing Technology, Chinese Academy of Sciences, Beijing
Guang R. Gao , University of Delaware, Newark
ABSTRACT
Dynamic programming (DP) is a popular technique which is used to solve combinatorial search and optimization problems. This paper focuses on one type of DP, which is called nonserial polyadic dynamic programming (NPDP). Owing to the nonuniform data dependencies of NPDP, it is difficult to exploit either parallelism or locality. Worse still, the emerging multi/many-core architectures with small on-chip memory make these issues more challenging. In this paper, we address the challenges of exploiting the fine grain parallelism and locality of NPDP on multicore architectures. We describe a latency-tolerant model and a percolation technique for programming on multicore architectures. On an algorithmic level, both parallelism and locality do benefit from a specific data dependence transformation of NPDP. Next, we propose a parallel pipelining algorithm by decomposing computation operators and percolating data through a memory hierarchy to create just-in-time locality. In order to predict the execution time, we formulate an analytical performance model of the parallel algorithm. The parallel pipelining algorithm achieves not only high scalability on the 160-core IBM Cyclops64, but portable performance as well, across the 8-core Sun Niagara and quad-cores Intel Clovertown.
INDEX TERMS
Dynamic programming, memory hierarchy, latency tolerant, percolation, multicore.
CITATION
Guangming Tan, Ninghui Sun, Guang R. Gao, "Improving Performance of Dynamic Programming via Parallelism and Locality on Multicore Architectures", IEEE Transactions on Parallel & Distributed Systems, vol.20, no. 2, pp. 261-274, February 2009, doi:10.1109/TPDS.2008.78
REFERENCES
[1] A. Grama, A. Gupta, G. Karypis, and V. Kumar, Introduction to Parallel Computing. Addison Wesley, 2003.
[2] T. Smith and M. Waterman, “Identification of Common Molecular Subsequences,” J. Molecular Biology, vol. 147, no. 1, pp. 195-197, 1981.
[3] R.B. Lyngso and M. Zuker, “Fast Evaluation of Internal Loops in RNA Secondary Structure Prediction,” Bioinformatics, vol. 15, no. 6, pp. 440-445, 1999.
[4] V. Viswannathan, S. Huang, and H. Liu, “Parallel Dynamic Programming,” Proc. Second IEEE Symp. Parallel and Distributed Processing (SPDP '90), pp. 497-500, 1990.
[5] H. Xu, F. Hanson, and S. Chung, “Data Parallel Solutions of Dimensionality Problems in Stochastic Dynamic Programming,” Proc. 30th IEEE Conf. Decision and Control (CDC '91), pp. 1717-1723, 1991.
[6] G. Karypis and V. Kumar, “Efficient Parallel Mappings of a Dynamic Programming Algorithm: A Summary of Results,” Proc. Seventh Int'l Conf. Parallel Processing (ICPP '93), pp. 563-568, 1993.
[7] S. Huang, H. Liu, and V. Viswannathan, “Parallel Dynamic Programming,” IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 3, pp. 326-328, Mar. 1994.
[8] G. Lewandowski, A. Condon, and E. Bach, “Asynchronous Analysis of Parallel Dynamic Programming Algorithms,” IEEE Trans. Parallel and Distributed Systems, vol. 7, no. 4, pp. 425-438, Apr. 1996.
[9] G.R. Gao and K.B. Theobald, “Programming Models and System Software for Future High-End Computing Systems: Work-in-Progress,” Proc. 17th IEEE Symp. Parallel and Distributed Processing (SPDP), 2003.
[10] A. Aggarwal and J.S. Vitter, “The Input/Output Complexity of Sorting and Related Problems,” Comm. ACM, vol. 31, no. 9, pp. 1116-1127, 1998.
[11] D.A. Patterson and J.L. Hennessy, Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1999.
[12] M. Frigo, C.E. Leiserson, H. Prokop, and S. Ramachandran, “Cache-Oblivious Algorithms,” Proc. 40th Ann. Symp. Foundations of Computer Sciences (FOCS '99), pp. 285-297, 1999.
[13] J.M. Hill, B. McColl, D. Stefanescu, M.W. Goudreau, K. Lang, S.B. Rao, T. Suel, T. Tsantilas, and R. Bisseling, “BSPLIB: The BSP Programming Library,” Parallel Computing, 1998.
[14] D. Culler, R. Karp, D.A. Patterson, A. Sahay, K. Schauser, E. Santos, R. Subramonian, and T. Eicken, “LOGP: A Practical Model of Parallel Computation,” Comm. ACM, vol. 39, no. 11, pp. 78-85, 1996.
[15] F. Dehne, A. Fabri, and A. Rau-Chaplin, “Scalable Parallel Computational Geometry for Coarse-Grained Multicomputers,” Proc. Ninth Symp. Computational Geometry (SCG '93), pp. 298-307, 1993.
[16] F.G. Gustavson, A. Henriksson, I. Jonsson, B. Km, and P. Ling, “Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms,” Proc. Fourth Int'l Workshop Applied Parallel Computing (PARA '98), pp. 195-206, http://citeseer.ist.psu.edugustavson98recursive.html , 1998.
[17] S. Coleman and K.S. McKinley, “Tile Size Selection Using Cache Organization and Data Layout,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI), 1995.
[18] M. Wolfe, “Iteration Space Tiling for Memory Hierarchies,” Parallel Processing for Scientific Computing, pp. 357-361, 1987.
[19] J. Ramanujam and P. Sadayappan, “Tiling Multidimensional Iteration Spaces for Nonshared Memory Machines,” Proc. ACM/IEEE Conf. Supercomputing (Supercomputing '91), pp. 111-120, 1991.
[20] J. Xue and C. Huang, “Reuse Driven Tiling for Improving Data Locality,” Int'l J. Parallel Programming, vol. 26, no. 6, pp. 671-696, 1998.
[21] J. Hong and H. Kong, “I/O Complexity: The Red Blue Pebble Game,” Proc. 13th Ann. ACM Symp. Theory of Computing (STOC), 1981.
[22] J. Cuvillo, W. Zhu, Z. Hu, and G.R. Gao, “Fast: A Functionally Accurate Simulation Toolset for the Cyclops-64 Cellular Architecture,” Proc. First Ann. Workshop Modeling, Benchmarking and Simulation (MoBS '05), held in conjunction with the 32nd Ann. Int'l Symp. Computer Architecture (ISCA), 2005.
[23] J. Cuvillo, Z. Hu, W. Zhu, and G.R. Gao, “Towards a Software Infrastructure for the Cyclops-64 Cellular Architecture,” Proc. 20th Int'l Symp. High Performance Computing Systems and Applications (HPCS), 2006.
[24] J. Cuvillo, W. Zhu, Z. Hu, and G.R. Gao, “Tiny Threads: A Thread Virtual Machine for the Cyclops-64 Cellular Architecture,” Proc. Fifth Workshop Massively Parallel Processing (WMPP '05), held in conjunction with the 19th Int'l Parallel and Distributed Processing System, 2005.
[25] Mpi-2 Specification, http:/www.mpi-forum.org/, 2008.
[26] Berkeley UPC—Unified Parallel c, http:/upc.lbl.gov/, 2008.
[27] P.G. Bradford, “Efficient Parallel Dynamic Programming,” Proc. 30th Ann. Allerton Conf. Comm. Control and Computing, pp. 185-194, 1992.
[28] P. Edmonds, E. Chu, and A. George, “Dynamic Programming on a Shared Memory Multiprocessor,” Parallel Computing, vol. 19, no. 1, pp. 9-22, 1993.
[29] Z. Galil and K. Park, “Parallel Algorithm for Dynamic Programming Recurrences with More Than O(1) Dependency,” J. Parallel and Distributed Computing, vol. 21, pp. 213-222, 1994.
[30] L. Guibas, H. Kung, and C. Thomson, “Direct VLSI Implementation of Combinatorial Algorithms,” Proc. First Caltech Conf. VLSI, pp. 509-525, 1979.
[31] B. Louka and M. Tchuente, “Dynamic Programming on Two-Dimensional Systolic Arrays,” Information Processing Letters, vol. 29, pp. 97-104, 1988.
[32] J.H. Chen, S.Y. Le, B.A. Shapiro, and J.V. Maizel, “Optimization of an RNA Folding Algorithm for Parallel Architectures,” Parallel Computing, vol. 24, pp. 1617-1634, 1998.
[33] I.H.M. Fekete and P. Stadler, “Prediction of RNA Base Pairing Possibilities for RNA Secondary Structure,” Biopolymers, vol. 9, pp. 1105-1119, 1990.
[34] B.A. Shapiro, J.C. Wu, D. Bengali, and M.J. Potts, “The Massively Parallel Genetic Algorithm for RNA Folding: MIMD Implementation and Population Variation,” Bioinformatics, vol. 17, no. 2, pp. 137-148, 2001.
[35] F. Almeida, R. Andonov, and D. Gonzalez, “Optimal Tiling for RNA Base Pairing Problem,” Proc. 14th Ann. ACM Symp. Parallel Architecture and Algorithm (SPAA '02), pp. 173-182, 2002.
[36] G. Tan, S. Feng, and N. Sun, “Locality and Parallelism Optimization for Dynamic Programming Algorithm in Bioinformatics,” Proc. ACM/IEEE Conf. Supercomputing (Supercomputing), 2006.
[37] W. Zhou and D.K. Lowenthal, “A Parallel, Out-of-Core Algorithm for RNA Secondary Structure Prediction,” Proc. 35th Int'l Conf. Parallel Processing (ICPP '06), pp. 74-81, 2006.
[38] J.S. Vitter, “External Memory Algorithms and Data Structures: Dealing with Massive Data,” ACM Computing Surveys, vol. 33, no. 2, pp. 209-271, 2001.
[39] G. Tan, S. Feng, and N. Sun, “Load Balancing Algorithm in Cluster-Based RNA Secondary Structure Prediction,” Proc. Fourth Int'l Symp. Parallel and Distributed Computing (ISPDC '05), pp. 91-96, 2005.
[40] M.W. Hall, J.M. Anderson, S.P. Amarasinghe, B.R. Murph, S.W. Liao, E. Bugnion, and M.S. Lam, “Maximizing Multiprocessor Performance with SUIF Compiler,” Computer, 1996.
28 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool