The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.02 - Feb. (2013 vol.24)
pp: 392-405
Davor Capalija , University of Toronto, Toronto
Tarek S. Abdelrahman , University of Toronto, Toronto
ABSTRACT
We explore the design, implementation, and evaluation of a coarse-grain superscalar processor in the context of the microarchitecture of the Control Processor (CP) of the Multilevel Computing Architecture (MLCA), a novel architecture targeted for multimedia multicore systems. The MLCA augments a traditional multicore architecture (called the lower level) with a CP (called the top-level), which automatically extracts parallelism among coarse-grain units of computation (tasks), synchronizes these tasks and schedules them for execution on processors. It does so in a fashion similar to how instruction-level parallelism is extracted by superscalar processors, i.e., using register renaming, Out-of-Order Execution (OoOE) and scheduling. The coarse-grain nature of tasks imposes challenging constraints on the direct use of these techniques, but also offers opportunities for simpler designs. We analyze the impact of these constraints and opportunities and present novel microarchitectural mechanisms for coarse-grain superscalar execution, including register renaming, task queue, dynamic out-of-order scheduling and task-issue. We design an MLCA system around our CP microarchitecture and implement it on an FPGA. We evaluate the system using multimedia applications and show good scalability for eight processors, limited by the memory bandwidth of the FPGA platform. Furthermore, we show that the CP introduces little overhead in terms of resource usage. Finally, we show scalability beyond eight processors using cycle-accurate RTL-level simulation with an idealized memory subsystem. We demonstrate that the CP poses no performance bottlenecks and is scalable up to 32 processors.
INDEX TERMS
Registers, Microarchitecture, Parallel processing, Throughput, Programming, Clocks, Dynamic scheduling, register renaming, Coarse-grain parallelism, task-level superscalar execution, out-of-order execution
CITATION
Davor Capalija, Tarek S. Abdelrahman, "Microarchitecture of a Coarse-Grain Out-of-Order Superscalar Processor", IEEE Transactions on Parallel & Distributed Systems, vol.24, no. 2, pp. 392-405, Feb. 2013, doi:10.1109/TPDS.2012.135
REFERENCES
[1] D. Geer, "Industry Trends: Chip Makers Turn to Multicore Processors," Computer, vol. 38, no. 5, pp. 11-13, 2005.
[2] picoChip, http:/www.picochip.com/, 2012.
[3] S. Dutta et al., "Viper: A multiprocessor SOC for Advanced Set-Top Box and Digital TV Systems," IEEE Design & Test of Computers, vol. 18, no. 5, pp. 21-31, Sep./Oct. 2001.
[4] F. Karim et al., "A Multi-Level Computing Architecture for Embedded Multimedia Applications," IEEE Micro, vol. 24, no. 3, pp. 56-66, May 2004.
[5] J. Perez et al., "A Dependency-Aware Task-Based Programming Environment for Multi-Core Architectures," Proc. IEEE Int'l Conf. Cluster Computing, 2008.
[6] M.D. McCool, "Data-Parallel Programming on the Cell BE and the GPU Using the RapidMind Development Platform," Proc. GSPx Multicore Applications Conf., 2006.
[7] J.C. Jenista et al., "OoOJava: An Out-of-Order Approach to Parallel Programming," Proc. Second USENIX Conf. Hot Topics in Parallelism (HotPar '10), 2010.
[8] D. Capalija and T. Abdelrahman, "An Architecture for Exploiting Coarse-Grain Parallelism on FPGAs," Proc. Int'l Conf. Field-Programmable Technology (FPT), pp. 285-291, 2009.
[9] K. Stewart and T. Abdelrahman, "Automatic Task Generation for the Multi-Level Computing Architecture," Proc. 19th IASTED Int'l Conf. Parallel and Distributed Computing and Systems (PDCS), 2007.
[10] U. Aydonat and T. Abdelrahman, "Parallelization of Multimedia Applications on the MLCA," Proc. 19th IASTED Int'l Conf. Parallel and Distributed Computing and Systems (PDCS), 2006.
[11] D. Capalija, "Microarchitecture and FPGA Implementation of the Multi-Level Computing Architecture," Master's thesis, Univ. of Toronto, 2008.
[12] D. Sima, "The Design Space of Register Renaming Techniques," IEEE Micro, vol. 20, no. 5, pp. 70-83, Sept./Oct. 2000.
[13] J. Shen and M. Lipasti, Modern Processor Design: Fundamentals of Superscalar Processors. McGraw-Hill, 2004.
[14] G. Hinton et al., "The Microarchitecture of the Pentium 4 Processor," Intel Technology J., vol. 1, 2001.
[15] R.E. Kessler, "The Alpha 21264 Microprocessor," IEEE Micro, vol. 19, no. 2, pp. 24-36, Mar./Apr. 1999.
[16] J.E. Smith and G.S. Sohi, "The Microarchitecture of Superscalar Processors," Proc. IEEE, vol. 83, pp. 1609-1624, Dec. 1995.
[17] H. Akkary et al., "An Analysis of a Resource Efficient Checkpoint Architecture," ACM Trans. Architecture and Code Optimization, vol. 1, no. 4, pp. 418-444, 2004.
[18] M. Moudgill et al., "Register Renaming and Dynamic Speculation: An Alternative Approach," Proc. 26th Ann. Int'l Symp. Microarchitecture (MICRO), 1993.
[19] T.N. Buti et al., "Organization and Implementation of the Register-Renaming Mapper for Out-of-Order IBM POWER4 Processors," IBM J. Research and Development, vol. 49, no. 1, pp. 167-188, 2005.
[20] S. Palacharla et al., "Complexity-Effective Superscalar Processors," Proc. 24th Ann. Int'l Symp. Computer Architecture (ISCA '97), pp. 206-218, 1997.
[21] M.A. Ramirez et al., "Direct Instruction Wakeup for Out-of-Order Processors," Proc. Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA '04), pp. 2-9, 2004.
[22] P.G. Sassone et al., "Matrix Scheduler Reloaded," Proc. 34th Ann. Int'l Symp. Computer Architecture (ISCA '07), 2007.
[23] M. Goshima et al., "A High-Speed Dynamic Instruction Scheduling Scheme for Superscalar Processors," Proc. ACM/IEEE 34th Ann. Int'l Symp. Microarchitecture (MICRO), 2001.
[24] Altera, "Avalon Interface Specifications," www.altera.com/ literature/manualmnl_avalon_spec.pdf , 2012.
[25] U. Aydonat, "Compiler Support for a System-On-Chip Multimedia Architecture," Master's thesis, Univ. of Toronto, 2005.
[26] MAD: MPEG Audio Decoder, www.underbit.com/productsmad/, 2012.
[27] J. Degener and C. Bormann, GSM 06.10 Lossy Speech Compression, http://cs.tu-berlin.de/~juttatoast.html, 2012.
[28] Independent JPEG Group, http:/www.ijg.org, 2012.
[29] I. Matosevic et al., "Power Optimization for the MLCA Using DVS," Proc. Workshop Software and Compilers for Embedded Systems (SCOPES), 2005.
[30] F. Karim et al., "The Hyperprocessor: A Template System-on-Chip Architecture for Embedded Multimedia Applications," Proc. Workshop Application Specific Processors (WASP '03), 2003.
[31] T. Monreal et al., "Late Allocation and Early Release of Physical Registers," IEEE Trans. Computer, vol. 53, no. 10, pp. 1244-1259, Oct. 2004.
[32] S.E. Raasch et al., "A Scalable Instruction Queue Design Using Dependence Chains," SIGARCH Computer Architecture News, vol. 30, no. 2, pp. 318-329, May 2002.
[33] C.H. Chen and K.S. Hsiao, "Scalable Dynamic Instruction Scheduler through Wake-Up Spatial Locality," IEEE Trans. Computer, vol. 56, no. 11, pp. 1534-1548, Nov. 2007.
[34] S. Kumar et al., "Carbon: Architectural Support for Fine-Grained Parallelism on Chip Multiprocessors," Proc. 34th Ann Int'l Symp. Computer Architecture (ISCA '07), pp. 162-173, 2007.
[35] D. Sanchez et al., "Flexible Architectural Support for Fine-Grain Scheduling," Proc. 15th Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems (ASPLOS '10), pp. 311-322, 2010.
[36] Y. Etsion et al., "Task Superscalar: An Out-of-Order Task Pipeline," Proc. IEEE/ACM 43rd Ann. Int'l Symp. Microarchitecture (MICRO), pp. 89-100, 2012.
[37] Y. Jin et al., "An Automated Exploration Framework for FPGA-Based Soft Multiprocessor System," Proc. IEEE/ACM/IFIP Third Int'l Conf. Hardware/Software Codesign and System Synthesis (CODES+ISSS '05), 2005.
[38] M. Saldana et al., "The Routability of Multiprocessor Network Topologies in FPGAs," Proc. Int'l Workshop System-Level Interconnect Prediction (SLIP), 2006.
[39] J. Cong et al., "Synthesis of an Application-Specific Soft Multiprocessor System," Proc. ACM/SIGDA 15th Int'l Symp. Field Programmable Gate Arrays (FPGA '07), pp. 99-107, 2007.
[40] B. Fort et al., "A Multithreaded Soft Processor for SoPC Area Reduction," Proc. 14th Int'l Symp. Field-Programmable Custom Computing Machines (FCCM '06), pp. 131-142, 2006.
[41] M. Labrecque and G. Steffan, "Improving Pipelined Soft Processors with Multithreading," Proc. Int'l Conf. Field Programmable Logic and Applications (FPL '07), 2007.
[42] M. Labrecque et al., "Scaling Soft Processor Systems," Proc. 16th Int'l Symp. Field-Programmable Custom Computing Machines (FCCM), 2008.
[43] A. Kulmala et al., "Instruction Memory Architecture Evaluation on Multiprocessor FPGA MPEG-4 Encoder," Proc. IEEE Workshop Design and Diagnostics of Electronic Circuits and Systems (DDECS '07), 2007.
[44] O. Lehtoranta et al., "A Parallel MPEG-4 Encoder for FPGA Based Multiprocessor SoC," Proc. Int'l Conf. Field Programmable Logic and Applications (FPL '05), 2005.
[45] K. Ravindran et al., "An FPGA-Based Soft Multiprocessor for IPv4 Packet Forwarding," Proc. Int'l Conf. Field Programmable Logic and Applications (FPL '05), 2005.
[46] J.G. Steffan et al., "A Scalable Approach to Thread-Level Speculation," Proc. 27th Ann. Int'l Symp. Computer Architecture (ISCA '00), pp. 1-12, 2000.
[47] G.S. Sohi et al., "Multiscalar Processors," Proc. 22nd Ann. Int'l Symp. Computer Architecture (ISCA '95), pp. 414-425, 1995.
51 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool