| | This Article | |
| |
| |
| | Share | |
| |
| |
| | Bibliographic References | |
| |
| |
| | Add to: | |
| |
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
| |
| | Search | |
| |
| |
| | |
Custom Wide Counterflow Pipelines for High-Performance Embedded Applications
February 2004 (vol. 53 no. 2)
pp. 141-158
Abstract—Application-specific instruction set processor (ASIP) design is a promising technique to meet the performance and cost goals of high-performance systems. ASIPs are especially valuable for embedded computing applications (e.g., digital cameras, color printers, cellular phones, etc.) where a small increase in performance and decrease in cost can have a large impact on a product's viability. Sutherland, Sproull, and Molnar originally proposed a processor organization called the counterflow pipeline (CFP) as a general-purpose architecture. We observed that the CFP is appropriate for ASIP design due to its simple and regular structure, local control and communication, and high degree of modularity. This paper describes a new CFP architecture, called the wide counterflow pipeline (WCFP), that extends the original proposal to be better suited for custom embedded instruction-level parallel processors. This work presents a novel and practical application of the CFP to automatic and quick turnaround design of ASIPs. The paper introduces the WCFP architecture and describes several microarchitecture capabilities needed to get good performance from custom WCFPs. We demonstrate that custom WCFPs have performance that is up to four times better than that of ASIPs based on the CFP. Using an analytic cost model, we show that custom WCFPs do not unduly increase the cost of the original counterflow pipeline architecture, yet they retain the simplicity of the CFP. We also compare custom WCFPs to custom VLIW architectures and demonstrate that the WCFP is performance competitive with traditional VLIWs without requiring complicated global interconnection of functional devices.
[1] 141 R.F. Sproull, I.E. Sutherland, and C.E. Molnar, "The Counterflow Pipeline Processor Architecture," IEEE Design and Test of Computers, Fall 1994, pp. 48-59.[2] S.P.E. Corp., Cpu2000 Benchmark Suite www.spec.org, 2000.[3] M. Schlett, "Trends in Embedded-Microprocessor Design," Computer, vol. 31, no. 8, Aug. 1998, pp. 44-49.[4] H. Corporaal, Microprocessor Architecture from VLIW to TTA. John Wiley&Sons, 1998.[5] M.J. Flynn et al., "Deep-Submicron Microprocessor Design Issues," IEEE Micro, Vol. 19 No. 4, July/Aug. 1999, pp. 11-22.[6] B.R. Childers and J.W. Davidson, A Design Environment for Counterflow Pipeline Synthesis Proc. ACM SIGPLAN Workshop Languages, Compilers, and Tools for Embedded Systems, June 1998.[7] B.R. Childers and J.W. Davidson, Architectural Considerations for Application Specific Counterflow Pipelines Proc. 20th Conf. Advanced Research in VLSI, Mar. 1999.[8] B.R. Childers and J.W. Davidson, Automatic Architectural Design of Wide-Issue Counterflow Pipelines Proc. Workshop Compilers and Architectures for Embedded Systems, Oct. 1999.[9] B.R. Childers and J.W. Davidson, An Infrastructure for Designing Custom Embedded Counterflow Pipelines Proc. Hawaii Int'l Conf. System Sciences, Jan. 2000.[10] C. Lee, M. Potkonjak, and W.H. Mangione-Smith, MediaBench: A Tool For Evaluating and Synthesizing Multimedia and Communications Systems Proc. 30th Ann. IEEE/ACM Int'l Symp. Microarchitecture, pp. 330-335, 1997.[11] C. Ebeling, D.C. Cronquiest, and P. Franklin, Mapping Applications to the Rapid Configurable Architecture Proc. Fifth Ann. Symp. Field-Programmable Custom Computing Machines, pp. 106-115, Apr. 1997.[12] S. Goldstein et al., "PipeRench: A Coprocessor for Streaming Multimedia Acceleration," Proc. 26th Int'l Symp. Computer Architecture (ISCA 99), IEEE CS Press, Los Alamitos, Calif., 1999, pp. 28-39.[13] W.H. Korver and I.M. Nedelchev, Asynchronous Implementation of the SCPP: A Counterflow Pipeline Processor IEE Proc. Computers and Digital Techniques, vol. 143, no. 5, pp. 287-294, 1996.[14] S.A. Mahlke et al., "Characterizing the Impact of Predicated Execution on Branch Prediction," Proc. 27th Int'l Symp. Microarchitecture, IEEE Computer Society Press, Los Alamitos, Calif., Dec. 1994, pp. 217-227.[15] S.A. Mahlke, R.E. Hank, and J.E. McCormick, A Comparison of Full and Partial Predicated Execution Support for ILP Processors Proc. 22nd Int'l Symp. Computer Architecture, pp. 138-149, June 1995.[16] D.I. August, K.M. Crozier, and P.R. Sias, The Impact Epic 1.0 Architecture and Instruction Set Reference Manual Technical Report IMPACT-98-04, Univ. of Illi nois, Feb. 1998.[17] V. Kathail, M.S. Schlansker, and B.R. Rau, HPL Playdoh Architecture Specification: Version 1 Technical Report HPL-93-80, HP Labs, Feb. 1994.[18] H. Chang, L. Cooke, M. Hunt, G. Martin, A. McNelly, and L. Todd, Surviving the SOC Revoluation: A Guide to Platform-Based Design. Kluwer Academic, 1999.[19] M. Keating and P. Bricaud, Reuse Methodology Manual: For Systems-on-a-Chip Designs. Academic Publishers, 1999.[20] M.E. Benitez and J.W. Davidson, A Portable Global Optimizer and Linker Proc. ACM Programming Language Design and Implementation, pp. 329-338, June 1988.[21] B.R. Rau, "Iterative Modulo Scheduling: An Algorithm for Software Pipelined Loops," Proc. 27th Ann. Int'l Symp. Microarchitecture,San Jose, Calif., Dec. 1994.[22] I.E. Sutherland, E-mail communication and discussion on CFP implementation, 1994.[23] I.J. Huang and A.M. Despain, Synthesis of Application Specific Instruction Sets IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 14, no. 6, pp. 663-675, June 1995.[24] J. Hauser and J. Wawrzynek, “Garp: A MIPS Processor with a Reconfigurable Coprocessor,” Proc. IEEE Symp. FPGAs for Custom Computing Machines, pp. 12-27, Apr. 1997.[25] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal, “Baring It All to Software: Raw Machines,” Computer, pp. 86-93, Sept. 1997.[26] R. Razdan and M.D. Smith, "A High-Performance Microarchitecture with Hardware-Programmable Functional Units," Proc. Micro-27, IEEE Computer Society, 1994, pp. 172-180.[27] R. Razdan, PRISC: Programmable Reduced Instruction Set Computers PhD dissertation, Harvard Univ., May 1994.[28] M.J. Wirthlin and B.L. Hutchings, “A Dynamic Instruction Set Computer,” Proc. IEEE Workshop FPGAs for Custom Computing Machines, pp. 99-107, Apr. 1995.[29] D.C. Cronquist, P. Franklin, S.G. Berg, and C. Ebeling, “Specifying and Compiling Applications for RaPiD,” Proc. Field-Programmable Custom Computing Machines 1998, Apr. 1998.[30] C. Ebeling, D.C. Cronquist, P. Franklin, and C. Fisher, Rapid: A Configurable Computing Architecture for Compute-Intensive Applications Proc. Field Programmable Logic and Applications, pp. 126-135, 1997.[31] A. Marshall, T. Stansfeld, and I. Kostarnov, A Reconfigurable Arithmetic Array for Multimedia Applications Proc. Seventh Int'l Symp. Field-Programmable Gate Arrays, pp. 135-143, Feb. 1999.[32] J.M. Arnold, D.A. Buell, and E.G. Davis, Splash 2 Proc. Fourth Ann. ACM Symp. Parallel Algorithms and Architectures, pp. 316-324, June 1992.[33] J.M. Arnold, "The Splash 2 Software Environment," D.A. Buell and K.L. Pocek, eds., Proc. IEEE Workshop on FPGAs for Custom Computing Machines,Napa, Calif., Apr. 1993, pp. 88-93.[34] D.A. Buell, J.M. Arnold, and W.J. Kleinfelder, Splash 2: FPGAs in Custom Computing Machines. IEEE CS Press, 1996.[35] P. Bertin and H. Tovati, Pam Programming Environments: Practice and Experience Proc. IEEE Workshop FPGAs for Custom Computing Machines, pp. 133-138, Apr. 1994.[36] S. Cadambi, J. Weener, and S.C. Goldstein, Managing Pipeline Reconfigurable FPGAs Proc. Sixth Int'l Symp. Field-Programmable Gate Arrays, pp. 55-64, Feb. 1998.[37] V. Kathail, S. Aditya, R. Schreiber, B. Rau, D. Cronquist, and M. Sivaraman, Pico: Automatically Designing Custom Computers Computer, vol. 35, no. 9, pp. 39-47, Sept. 2002.[38] S. Aditya, B.R. Rau, and V. Kathail, "Automatic Architectural Synthesis of VLIW and EPIC Processors," IEEE/ACM Int'l Symp. System Synthesis, IEEE CS Press, 1999, pp. 107-113.[39] R. Schreiber et al., "High-level Synthesis of Nonprogrammable Hardware Accelerators," Proc. Int'l Conf. Application-Specific Systems, Architectures, and Processors(ASAP 2000), IEEE CS Press, Los Alamitos, Calif., 2000, pp. 113-124.[40] S. Aditya and M. Schlansker, Shiftq: A Buffered Interconnect for Custom Loop Accelerators Proc. Conf. Compilers, Architectures, and Synthesis for Embedded Systems, pp. 158-167, Nov. 2001.[41] S. Mahlke et al., "Bitwidth Cognizant Architecture Synthesis of Custom Hardware Accelerators," IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 20, no. 11, Nov. 2001, pp. 1355-1371.[42] J. Villasenor and W.H. Mangione-Smith, Configurable Computing Scientific Am., vol. 276, no. 6, pp. 54-59, June 1997.[43] M. Auguin, F. Boeri, and C. Carriere, Automatic Exploration of VLIW Processor Architectures from a Designers Experience Based Specification Proc. Workshop Hardware Software Codesign, pp. 108-115, Sept. 1994.[44] A. DeGloria and P. Faraboschi, An Evaluation System for Application Specific Architectures Proc. 23rd Int'l Workshop Microarchitecture and Microprogramming, pp. 80-89, Nov. 1990.[45] B. Holmer and A. Despain, Viewing Instruction Set Design as an Optimization Problem Proc. 24th Int'l Workshop Microprogramming and Microarchitecture, pp. 153-162, Nov. 1991.[46] J.A. Fisher, P. Faraboschi, and G. Desoli, Custom-Fit Processors: Letting Applications Define Architectures Technical Report HP-45-96-144, HP Labs, 1996.[47] J.M. Mulder, R.J. Portier, A. Srivastava, and R. Velt, An Architecture Framework for Application-Specific and Scalable Architectures Proc. 16th Int'l Symp. Computer Architecture, pp. 362-369, May 1989.[48] A. Wu and W. Wolf, Data-Path Synthesis of VLIW Video Signal Processors Proc. 11th Int'l Symp. System Synthesis, pp. 96-101, Dec. 1998.[49] H. Corporaal and J. Hoogerbrugge, Cosynthesis with the Move Framework Proc. Symp. Modelling, Analysis, and Simulation, pp. 184-189, July 1996.[50] B. Coates, J. Ebergen, and J. Lexau, A Counterflow Pipeline Experiment Proc. Int'l Conf Advanced Research in Asychronous Circuits and Systems, pp. 161-172, Apr. 1999.[51] P.N. Loewenstein, Formal Verification of Counterflow Pipeline Architecture Proc. Eighth Int'l Workshop Higher Order Logic Theorem Proving and Its Applications, pp. 261-276, Sept. 1995.[52] M.B. Josephs, P.G. Lucassen, J.T. Udding, and T. Verhoeff, Formal Design of an Asynchronous DSP Counterflow Pipeline: A Case Study in Handshake Algebra Proc. Int'l Symp. Advanced Research in Asychronous Circuits and Systems, pp. 206-215, Nov. 1994.[53] P.G. Lucassen and J.T. Udding, On the Correctness of the Sproull Counterflow Pipeline Processor Proc. Symp. Advanced Research in Asychronous Circuits and Systems, pp. 112-120, Mar. 1996.[54] A. Yakovlev, Designing Control Logic for Counterflow Pipeline Processor Using Petri Nets Formal Methods in System Design, vol. 12, no. 1, pp. 39-71, Jan. 1998.[55] M.F. Miller, K. Janik, and S.L. Lu, Non-Stalling Counterflow Architecture Proc. Fourth Symp. High-Performance Computer Architecture, pp. 334-341, Feb. 1998.[56] K. Janik, SL. Lu, and M.F. Miller, Advances of the Counterflow Pipeline Microarchitecture proc. Int'l Symp. High-Performance Computer Architecture, pp. 230-236, Feb. 1997.
Index Terms:
Counterflow pipelines, application-specific processors, automatic architectural synthesis.
Citation:
Bruce R. Childers, Jack W. Davidson, "Custom Wide Counterflow Pipelines for High-Performance Embedded Applications," IEEE Transactions on Computers, vol. 53, no. 2, pp. 141-158, Feb. 2004, doi:10.1109/TC.2004.1261825