This Article 
 Bibliographic References 
 Add to: 
CSMT: Simultaneous Multithreading for Clustered VLIW Processors
March 2010 (vol. 59 no. 3)
pp. 385-399
Manoj Gupta, Universitat Politècnica de Catalunya, Barcelona
Fermín Sánchez, Universitat Politècnica de Catalunya, Barcelona
Josep Llosa, Universitat Politècnica de Catalunya, Barcelona
Simultaneous MultiThreading (SMT) is a well-known technique that improves resource utilization by exploiting thread-level parallelism at the instruction grain level. However, implementing SMT for VLIWs requires complex structures, which is contrary to the VLIW philosophy of hardware simplicity. In this paper, we propose Cluster-level Simultaneous MultiThreading (CSMT) to allow some degree of SMT in clustered VLIW processors with low hardware cost and complexity. CSMT considers the set of operations that execute simultaneously in a given cluster as the assignment unit. To minimize cluster conflicts between threads, a very simple hardware-based cluster renaming mechanism is proposed. The hardware required to implement CSMT is cheap, realistic, and practical for a clustered VLIW processor. An analysis of the hardware required to implement CSMT shows that it is quite scalable, with up to eight threads easily supported at low hardware cost. The experimental results show that CSMT significantly improves performance when compared with other multithreading approaches suited for VLIW. For instance, with four threads, CSMT shows an average speedup of 110 percent over a single-thread VLIW architecture and 40 percent over Interleaved MultiThreading (IMT). In some cases, speedup can be as high as 225 percent over single-thread architecture and 84 percent over IMT.

[1] R.P. Colwell, R.P. Nix, J.J. O'Donnell, D.B. Papworth, and P.K. Rodman, “A VLIW Architecture for a Trace Scheduling Compiler,” Proc. Architectural Support for Programming Languages and Operating Systems (ASPLOS II), 1987.
[2] B.R. Rau, D.W.L. Yen, W.C. Yen, and R.A. Towle, “The Cydra 5 Departmental Supercomputer: Design Philosophies, Decisions, and Trade Offs,” Computer, vol. 22, no. 1, pp. 12-35, Jan. 1989.
[3] J. Huck, D. Morris, J. Ross, A. Knies, H. Mulder, and R. Zahir, “Introducing the IA-64 Architecture,” IEEE Micro, vol. 20, no. 5, pp. 12-23, Sept./Oct. 2000.
[4] P. Faraboschi, G. Brown, J.A. Fisher, G. Desoli, and F. Homewood, “Lx: A Technology Platform for Customizable VLIW Embedded Processing,” Proc. Int'l Symp. Computer Architecture (ISCA), 2000.
[5] N. Seshan, “High VelociTI Processing,” IEEE Signal Processing Magazine, vol. 15, no. 2, pp. 86-101, 117, Mar. 1998.
[6] F. Homewood and P. Faraboschi, “ST200: A VLIW Architecture for Media-Oriented Applications,” Proc. Microprocessor Forum, 2000.
[7] S. Rixner, W.J. Dally, B. Khailany, P.R. Mattson, U.J. Kapasi, and J.D. Owens, “Register Organization for Media Processing,” Proc. High-Performance Computer Architecture (HPCA), 2000.
[8] W.-D. Weber and A. Gupta, “Exploring the Benefits of Multiple Hardware Contexts in a Multiprocessor Architecture: Preliminary Results,” Proc. Int'l Symp. Computer Architecture (ISCA), 1989.
[9] A. Mikschl and W. Damm, “MSparc: A Multithreaded Sparc,” Proc. Euro-Par, vol. 2, 1996.
[10] R.J. Eickemeyer, R.E. Johnson, S.R. Kunkel, B.-H. Lim, M.S. Squillante, and C.-F.E. Wu, “Evaluation of Multithreaded Processors and Thread-Switch Policies,” Proc. Int'l Symp. High Performance Computing (ISHPC), 1997.
[11] M. Farrens and A. Pleszkun, “Strategies for Achieving Improved Processor Throughput,” Proc. Int'l Symp. Computer Architecture (ISCA), pp. 362-369, 1991.
[12] M. Tremblay, J. Chan, S. Chaudhry, A.W. Conigliaro, and S.S. Tse, “The MAJC Architecture: A Synthesis of Parallelism and Scalability,” IEEE Micro, vol. 20, no. 6, pp. 12-25, Nov./Dec. 2000.
[13] J. Borkenhagen, R. Eickemeyer, R. Kalla, and S. Kunkel, “A Multithreaded PowerPC Processor for Commercial Servers,” IBM J. Research and Development, vol. 44, no. 6, pp. 885-898, 2000.
[14] C. McNairy and R. Bhatia, “Montecito: A Dual-Core, Dual-Thread Itanium Processor,” IEEE Micro, vol. 25, no. 2, pp. 10-20, Mar./Apr. 2005.
[15] B.J. Smith, “Architecture and Applications of the HEP Multiprocessor Computer System,” Proc. SPIE, pp. 241-248, 1981.
[16] M.R. Thistle and B.J. Smith, “A Processor Architecture for Horizon,” Proc. Supercomputing (SC), pp. 35-41, 1988.
[17] R.H. HalsteadJr. and T. Fujita, “MASA: A Multithreaded Processor Architecture for Parallel Symbolic Computing,” Proc. Int'l Symp. Computer Architecture (ISCA), pp. 443-451, 1988.
[18] D.M. Tullsen, S.J. Eggers, and H.M. Levy, “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” Proc. Int'l Symp. Computer Architecture (ISCA), 1995.
[19] R. Kalla, B. Sinharoy, and J. Tendler, “SMT Implementation in POWER 5,” Proc. Hot Chips, vol. 15, 2003.
[20] D. Koufaty and D. Marr, “Hyperthreading Technology in the Netburst Microarchitecture,” IEEE Micro, vol. 23, no. 2, pp. 56-65, Mar./Apr. 2003.
[21] S. Kaxiras, G. Narlikar, A. Berenbaum, and Z. Hu, “Comparing Power Consumption of an SMT and a CMP DSP for Mobile Phone Workloads,” Proc. Int'l Conf. Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2001.
[22] B. Iyer, S. Srinivasan, and B.L. Jacob, “Extended Split Issue: Enabling Flexibility in the Hardware Implementation of NUAL VLIW DSPs,” Proc. Int'l Symp. Computer Architecture (ISCA), 2004.
[23] E. Ozer and T. Conte, “High-Performance and Low-Cost Dual-Thread VLIW Processor Using Weld Architecture Paradigm,” IEEE Trans. Parallel and Distributed Systems, vol. 16, no. 12, pp.1132-1142, Dec. 2005.
[24] G. Sohi, S. Breach, and T. Vijaykumar, “Multiscalar Processors,” Proc. Int'l Symp. Computer Architecture (ISCA), pp. 414-425, 1995.
[25] P. Marcuello, A. Gonzalez, and J. Tubella, “Speculative Multithreaded Processors,” Proc. Int'l Conf. Supercomputing (ICS), pp.77-84, 1998.
[26] M. Fillo, S. Keckler, W. Dally, N. Carter, A. Chang, Y. Gurevich, and W. Lee, “The M-Machine Multicomputer,” Proc. Int'l Symp. Microarchitecture (MICRO), pp. 146-156, 1995.
[27] A. Wolfe and J. Shen, “A Variable Instruction Stream Extension to the VLIW Architecture,” Proc. Architectural Support for Programming Languages and Operating Systems (ASPLOS), 1991.
[28] D. Barretta, W. Fornaciari, M. Sami, and D. Bagni, “Multithreaded Extension to Multicluster VLIW Processors for Embedded Applications,” Proc. Design, Automation, and Test in Europe (DATE), 2005.
[29] M. Gupta, F. Sánchez, and J. Llosa, “Cluster-Level Simultaneous MultiThreading for VLIW Processors,” Proc. Int'l Conf. Computer Design (ICCD), 2007.
[30] M. Gupta, F. Sánchez, and J. Llosa, “Merge Logic for Clustered Multithreaded VLIW Processors,” Proc. EUROMICRO Conf. Digital System Design, 2007.
[31] VEX Toolchain,, 2009.
[32] P.G. Lowney, S.M. Freudenberger, T.J. Karzes, W.D. Lichtenstein, R.P. Nix, J.S. O'Donnell, and J.C. Ruttenberg, “The Multiflow Trace Scheduling Compiler,” J. Supercomputing, vol. 7, nos. 1/2, pp. 51-142, 1993.
[33] J.A. Fisher, “Trace Scheduling: A Technique for Global Microcode Compaction,” IEEE Trans. Computers, vol. 30, no. 7, pp. 478-490, July 1981.
[34] J.R. Ellis, Bulldog: A Compiler for VLSI Architectures. MIT Press, 1986.
[35] “Colorspace Conversion Program Used in High Performance Printers,” Personal Communication.
[36] J.L. Henning, “SPEC CPU2000: Measuring CPU Performance in the New Millennium,” Computer, vol. 33, no. 7, pp. 28-35, July 2000.
[37] C. Lee, M. Potkonjak, and W.H. Mangione-Smith, “MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems,” Proc. Int'l Symp. Microarchitecture (MICRO), 1997.
[38] Inverse Discrete Cosine Transform, Taken from ffmpeg, http:/, Last Consult June 2009.
[39] x264—A Free h264/avc Encoder, , Last Consult June 2009.

Index Terms:
ILP, VLIW architectures, clustered VLIW architectures, multithreaded processors, simultaneous multithreading.
Manoj Gupta, Fermín Sánchez, Josep Llosa, "CSMT: Simultaneous Multithreading for Clustered VLIW Processors," IEEE Transactions on Computers, vol. 59, no. 3, pp. 385-399, March 2010, doi:10.1109/TC.2009.96
Usage of this product signifies your acceptance of the Terms of Use.