The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.04 - April (2008 vol.19)
pp: 529-544
ABSTRACT
The increase in the complexity of a wide-issue processor with its pipeline width is one of the primary concerns of the processor designers. In the conventional design, hardware in the processor core is laid out to handle multiple instructions with two-source operands in each pipeline stage. However, analysis of SPEC2000 programs reveals that an integer program on average constitutes 25.2% of two-op (both source registers) integer instructions and 72.5% one-op/zero-op integer instructions. Floating-point programs (FP) are found to constitute on average 15.8% of two-op integer instructions and 44.1% one-op/zero-op integer instructions. The analysis observes that the hardware laid out for worst case requirements in the integer pipeline is highly under-utilized for a significant portion of time. To alleviate the complexity issues we propose split pipeline architecture, a novel technique to distinguish and process instructions based on their source operand requirements. The conventional pipeline is split into two after the decode stage, and the two pipelines are again converged at the execution stage. This leads to a capability of processing instructions at a higher clock rate and at almost the same IPC, as compared to a conventional processor. Various flavors of the proposed architecture are simulated and analyzed in this paper, with a circuit level analysis to determine the impact on the critical path delays. Results show that a processor that can fetch, decode, and commit eight instructions each cycle and with split pipelines of two two-source integer instruction and six zero/one-source integer instruction can achieve a clock rate that is 15.8% faster than an 8-wide conventional processor while losing the IPC throughput by only 0.7% for SPEC2000 benchmarks. Similarly, in a 4-wide processor and with split pipelines of one two-source integer instruction and three zero/one-source integer instruction can achieve a clock rate that is 19.69% faster than a 4-wide conventional processor while losing the IPC throughput by only 1.9%
CITATION
Rama Sangireddy, Jatan Shah, "Operand-Load-Based Split Pipeline Architecture for High Clock Rate and Commensurable IPC", IEEE Transactions on Parallel & Distributed Systems, vol.19, no. 4, pp. 529-544, April 2008, doi:10.1109/TPDS.2007.70742
REFERENCES
[1] S. Palacharla, N.P. Jouppi, and J.E. Smith, “Complexity-Effective Superscalar Processors,” Proc. 24th Ann. Int'l Symp. Computer Architecture (ISCA '97), pp. 206-218, 1997.
[2] I. Kim and M.H. Lipasti, “Half-Price Architecture,” Proc. 30th Ann. Int'l Symp. Computer Architecture, pp. 28-38, 2003.
[3] D. Burger and T.M. Austin, “The SimpleScalar Tool Set Version 2.0,” Technical Report 1342, Computer Sciences Dept., Univ. of Wisconsin-Madison, June 1997.
[4] E. Perelman, G. Hamerly, M.V. Biesbrouck, T. Sherwood, and B. Calder, “Using Simpoint for Accurate and Efficient Simulation,” Proc. ACM SIGMETRICS '03, pp. 318-319, 2003.
[5] Y.N. Patt, S.J. Patel, M. Evers, D.H. Friendly, and J. Stark, “One Billion Transistors, One Uniprocessor, One Chip,” IEEE Trans. Computers, vol. 30, no. 9, pp. 67-76, Sept. 1997.
[6] J.E. Smith and G. Sohi, “The Microarchitecture of Superscalar Processors,” Proc. IEEE, vol. 83, no. 12, Dec. 1995.
[7] D. Sima, “The Design Space of Register Renaming Techniques,” IEEE Micro, vol. 20, no. 5, pp. 70-83, Sept.-Oct. 2000.
[8] A. De Gloria and M. Olivieri, “An Application Specific Multi-Port RAM Cell Circuit for Register Renaming Units in High Speed Microprocessors,” Proc. IEEE Int'l Symp. Circuits and Systems (ISCAS), 2001.
[9] D. Ernst and T. Austin, “Efficient Dynamic Scheduling through Tag Elimination,” Proc. 29th Ann. Int'l Symp. Computer Architecture (ISCA), 2002.
[10] S.E. Wilton and N.P. Jouppi, An Enhanced Access and Cycle Time Model for On-Chip Caches, DEC WRL research 93/5, July 1994.
[11] K.I. Farkas, P. Chow, N.P. Jouppi, and Z. Vranesic, “The Multicluster Architecture: Reducing Cycle Time through Partitioning,” Proc. 30th Ann. Int'l Symp. Microarchitecture (MICRO '97), pp. 149-159, Dec. 1997.
[12] R. Canal, J.-M. Parcerisa, and A. Gonzalez, “Dynamic Cluster Assignment Mechanisms,” Proc. Sixth Int'l Symp. High-Performance Computer Architecture (HPCA '00), pp. 132-142, Jan. 2000.
[13] J. Stark, M. Brown, and Y. Patt, “On Pipelining Dynamic Instruction Scheduling Logic,” Proc. 33rd Ann. Int'l Symp. Microarchitecture (MICRO), 2000.
[14] J. Stark, M. Brown, and Y. Patt, “Select-Free Instruction Scheduling Logic,” Proc. 34th Ann. Int'l Symp. Microarchitecture (MICRO), 2001.
[15] K.I. Farkas, N.P. Jouppi, and P. Chow, “Register File Design Considerations in Dynamically Scheduled Processors,” Proc. Second Int'l Symp. High-Performance Computer Architecture (HPCA '96), pp. 40-51, 1996.
[16] R.E. Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro, vol. 19, no. 2, pp. 24-36, Mar.-Apr. 1999.
[17] J.H. Tseng and K. Asanovic, “Banked Multiported Register Files for High-Frequency Superscalar Microprocessors,” Proc. 30th Ann. Int'l Symp. Computer Architecture (ISCA), 2003.
[18] J.L. Cruz, A. Gonzalez, M. Valero, and N.P. Topham, “Multiple-Banked Register File Architectures,” Proc. 27th Ann. Int'l Symp. Computer Architecture (ISCA '00), pp. 316-325, 2000.
[19] I. Park, M.D. Powell, and T.N. Vijayakumar, “Reducing Register Ports for Higher Speed and Lower Energy,” Proc. 35th Ann. Int'l Symp. Microarchitecture (MICRO), 2002.
[20] A. Seznec, E. Toullec, and O. Rochecouste, “Register Write Specialization Register Read Specialization: A Path to Complexity-Effective Wide-Issue Superscalar Processors,” Proc. 35th Ann. Int'l Symp. Microarchitecture (MICRO '02), pp. 383-394, 2002.
[21] N.S. Kim and T. Mudge, “The Microarchitecture of a Low Power Register File,” Proc. IEEE Int'l Symp. Low Power Electronics and Design (ISLPED '03), pp. 384-389, 2003.
[22] N.S. Kim and T. Mudge, “Reducing Register Ports Using Delayed Write-Back Queues and Operand Prefetch,” Proc. 17th Ann. ACM Int'l Conf. Supercomputing (ICS), 2003.
[23] A. Aggarwal and M. Franklin, “Energy Efficient Asymmetrically Ported Register Files,” Proc. 21st Int'l Conf. Computer Design (ICCD '03), pp. 2-7, 2003.
[24] R. Sangireddy, “Reducing Rename Logic Complexity for High-Speed and Low Power Front-End Architectures,” IEEE Trans. Computers, vol. 55, no. 6, pp. 672-685, June 2006.
[25] A. Moshovos, “Power-Aware Register Renaming,” Technical Report 01-08-02, Computer Eng. Group, Univ. of Toronto, 2002.
[26] V. Sankaranarayanan and A. Tyagi, “A Hierarchical Dependence Check and Folded Rename Mapping Based Scalable Dispatch Stage,” Proc. 19th Int'l Conf. Computer Design (ICCD '01), pp. 249-254, 2001.
[27] E. Sprangle and D. Carmean, “Increasing Processor Performance by Implementing Deeper Pipelines,” Proc. 29th Ann. Int'l Symp. Computer Architecture (ISCA '02), pp. 25-34, 2002.
[28] E. Sprangle and Y. Patt, “Facilitating Superscalar Processing via a Combined Static/Dynamic Register Renaming Scheme,” Proc. 27th Ann. Int'l Symp. Microarchitecture (MICRO '94), pp. 143-147, 1994.
[29] S. Nadathur and A. Tyagi, “A Dependence Driven Efficient Dispatch Scheme,” Proc. 21st Int'l Conf. Computer Design (ICCD '03), pp. 299-306, 2003.
[30] T.N. Buti, R.G. McDonald, Z. Khwaja, A. Amdedkar, H.Q. Le, W.E. Burky, and B. Williams, “Organization and Implementation of the Register Renaming Mapper for Out-of-Order IBM Power4 Processors,” IBM J. Research and Development, vol. 49, no. 1, pp.167-188, Jan. 2005.
36 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool