The Community for Technology Leaders
RSS Icon
Issue No.07 - July (2011 vol.22)
pp: 1178-1191
Shannon Kuntz , University of Notre Dame, Notre Dame
Sheng Li , Hewlett-Packard Labs and University of Notre Dame, Notre Dame
Peter M. Kogge , University of Notre Dame, Notre Dame
Irregular and dynamic applications, such as graph problems and agent-based simulations, often require fine-grained parallelism to achieve good performance. However, current multicore processors only provide architectural support for coarse-grained parallelism, making it necessary to use software-based multithreading environments to effectively implement fine-grained parallelism. Although these software-based environments have demonstrated superior performance over heavyweight, OS-level threads, they are still limited by the significant overhead involved in thread management and synchronization. In order to address this, we propose a Lightweight Chip Multi-Threaded (LCMT) architecture that further exploits thread-level parallelism (TLP) by incorporating direct architectural support for an “unlimited” number of dynamically created lightweight threads with very low thread management and synchronization overhead. The LCMT architecture can be implemented atop a mainstream architecture with minimum extra hardware to leverage existing legacy software environments. We compare the LCMT architecture with a Niagara-like baseline architecture. Our results show up to 1.8X better scalability, 1.91X better performance, and more importantly, 1.74X better performance per watt, using the LCMT architecture for irregular and dynamic benchmarks, when compared to the baseline architecture. The LCMT architecture delivers similar performance to the baseline architecture for regular benchmarks.
Multithreaded processors, multicore processors, unlimited multithreading, irregular applications.
Shannon Kuntz, Sheng Li, Peter M. Kogge, "Lightweight Chip Multi-Threading (LCMT): Maximizing Fine-Grained Parallelism On-Chip", IEEE Transactions on Parallel & Distributed Systems, vol.22, no. 7, pp. 1178-1191, July 2011, doi:10.1109/TPDS.2010.169
[1] HPC Challenge Awards: Class 2 Specification, http:/, 2010.
[2] F. Massaioli, F. Castiglione, and M. Bernaschi, "OpenMP Parallelization of Agent-Based Models," Parallel Computing, vol. 31, nos. 10-12, pp. 1066-1081, 2005.
[3] N. Anastopoulos and N. Koziris, "Facilitating Efficient Synchronization of Asymmetric Threads on Hyper-Threaded Processors," Proc. IEEE Int'l Symp. Parallel and Distributed Processing (IPDPS), 2008.
[4] Arvind and V.C. Nikhil, "Executing a Program on the MIT Tagged-Token Dataflow Architecture," IEEE Trans. Computers, vol. 39, no. 3, pp. 300-318, Mar. 1990.
[5] D.A. Bader, G. Cong, and J. Feo, "On the Architectural Requirements for Efficient Execution of Graph Algorithms," Proc. Int'l Conf. Parallel Processing (ICPP '05), pp. 547-556, 2005.
[6] J. Barnes and P. Hut, "A Hierarchical O(NlogN) Force-Calculation Algorithm," Nature, vol. 324, no. 4, pp. 446-449, Dec. 1986.
[7] D.R. Butenhof, Programming with POSIX Threads. Addison-Wesley, 1997.
[8] Cray Corporation, Cray MTA-2 System.
[9] D.E. Culler, A. Dusseau, S.C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. Yelick, "Parallel Programming in Split-C," Proc. Supercomputing '93, pp. 262-273, 1993.
[10] L. Dagum and R. Menon, "OpenMP: An Industry-Standard API for Shared-Memory Programming," IEEE Computational Science and Eng., vol. 5, no. 1 pp. 46-55, Jan.-Mar. 1998.
[11] D.R. Ditzel and H.R. McLellan, "Register Allocation for Free: The C Machine Stack Cache," Proc. First Int'l Symp. Architectural Support for Programming Languages and Operating Systems (ASPLOS-I), pp. 48-56, 1982.
[12] T. El-Ghazawi, W. Carlson, T. Sterling, and K. Yelick, UPC: Distributed Shared Memory Programming. John Wiley and Sons, 2005.
[13] Y. Feldman, N. Dershowitz, and Z. Hanna, "Parallel Multithreaded Satisfiability Solver: Design and Implementation," Electronic Notes in Theoretical Computer Science, vol. 128, no. 3, pp. 75-90, 2005.
[14] J. Feo, D. Harper, S. Kahan, and P. Konecny, "ELDORADO," Proc. Second Conf. Computing Frontiers, pp. 28-34, 2005.
[15] M. Frigo, C.E. Leiserson, and K.H. Randall, "The Implementation of the Cilk-5 Multithreaded Language," Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, pp. 212-223, 1998.
[16] M. Huguet and T. Lang, "Architectural Support for Reduced Register Saving/Restoring in Single-Window Register Files," ACM Trans. Computer Systems, vol. 9, no. 1, pp. 66-97, 1991.
[17] N. Jennings and M. Wooldridge, Agent Technology: Foundations, Applications, and Markets. Springer, 1998.
[18] T. Johnson and U. Nawathe, "An 8-Core, 64-Thread, 64-Bit Power Efficient Sparc SoC (Niagara2)," Proc. Int'l Symp. Physical Design (ISPD), 2007.
[19] P.M. Kogge and L. Yerosheva, "Towards Non-Copying, Highly Multi-Threaded BCPs," Cascade technical report, Univ. of Notre Dame, 2006.
[20] C.E. Leiserson, "Multithreaded Programming in Cilk," Proc. Supercomputing '07, 2007.
[21] S. Li, J. Ho Ahn, R.D. Strong, J.B. Brockman, D.M. Tullsen, and N.P. Jouppi, "McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures," Proc. 42nd Ann. IEEE/ACM Int'l Symp. Microarchitecture (MICRO 42), pp. 469-480, 2009.
[22] K. Lim, P. Ranganathan, J. Chang, C. Patel, T. Mudge, and S. Reinhardt, "Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments," Proc. 35th Int'l Symp. Computer Architecture (ISCA '08), 2008.
[23] P.R. Luszczek, D.H. Bailey, J.J. Dongarra, J. Kepner, R.F. Lucas, R. Rabenseifner, and D. Takahashi, "The HPC Challenge (HPCC) Benchmark Suite," Proc. IEEE/ACM Conf. Supercomputing (SC '06), p. 213, 2006.
[24] S.S. Newmawarkar and G.R. Gao, "Measurement and Modeling of EARTH-MANNA Multithreaded Architecture," Proc. Fourth Int'l Workshop Modeling, Analysis, and Simulation of Computer and Telecomm. Systems, pp. 109-114, Feb. 1996.
[25] R.S. Nikhil and Arvind, "Can Dataflow Subsume Von Neumann Computing?" Proc. 16th Ann. Int'l Symp. Computer Architecture, pp. 262-272, June 1989.
[26] P.R. Nuth and W.J. Dally, "The Named-State Register File: Implementation and Performance," Proc. First IEEE Symp. High-Performance Computer Architecture (HPCA '95), pp. 4-13, 1995.
[27] D.W. Oehmke, N.L. Binkert, T. Mudge, and S.K. Reinhardt, "How to Fake 1000 Registers," Proc. 38th Ann. IEEE/ACM Int'l Symp. Microarchitecture (MICRO 38), 2005.
[28] G.M. Papadopoulos and D.E. Culler, "Monsoon: An Explicit Token-Store Architecture," Proc. 17th Ann. Int'l Symp. Computer Architecture, pp. 82-91, 1990.
[29] P.M. Kogge et al., Computer Systems with Lightweight Multithreaded Architectures, U.S. Patent 7,584,332.
[30] S. Phillips, "Victoria Falls: Scaling Highly-Threaded Processor Cores," Proc. Hot Chips 19, 2007.
[31] J. Reinders, Intel Threading Building Blocks. O'Reilly Media, Inc., 2007.
[32] A. Rodrigues, Structural Simulation Toolkit (SST), http://www.cs.sandia.govsst/, 2010.
[33] B.J. Smith, "A Pipelined, Shared Resource MIMD Computer," Proc. Int'l Conf. Parallel Processing, pp. 6-8, 1978.
[34] OpenSPARC T1 Microarchitecture Specification, technical report, Sun Microsystems, Inc., July 2005.
[35] Cilk 5.3.6 Reference Manual, Supercomputing Technologies Group, Massachusetts Inst. of Technology Laboratory for Computer Science, 2007.
[36] S. Thoziyoor, J. Ahn, M. Monchiero, J. Brockman, and N. Jouppi, "A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies," Proc. 35th Int'l Symp. Computer Architecture (ISCA), 2008.
[37] M.N. Velev, VLIW-UNSAT-4.0, mvelev, 2009.
[38] W. Zhu, V.C. Sreedhar, Z. Hu, and G.R. Gao, "Synchronization State Buffer: Supporting Efficient Fine-Grain Synchronization on Many-Core Architectures," Proc. 34th Int'l Symp. Computer Architecture (ISCA '07), 2007.
[39] S. Li et al., "A Heterogeneous Lightweight Multithreaded Architecture," Proc. Int'l Parallel and Distributed Processing Symp. (IPDPS), 2007.
26 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool