The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.02 - February (2008 vol.57)
pp: 200-214
ABSTRACT
Looping operations impose a significant bottleneck to achieving better computational efficiency for embedded applications. In this paper, a novel zero-overhead loop controller (ZOLC) supporting arbitrary loop structures with multiple-entry and multiple-exit nodes is described and utilized to enhance embedded RISC processors. A graph formalism is introduced for representing the loop structure of application programs, which can assist in ZOLC code synthesis. Also, a portable description of a ZOLC component is given in detail, which can be exploited in the scope of RTL synthesis for enabling its utilization. This description is designed to be easily retargetable to single-issue RISC processors, requiring only minimal effort for this task. The ZOLC unit has been incorporated to different RISC processor models and research ASIPs at different abstraction levels (RTL VHDL and ArchC) to provide effective means for low-overhead looping without negative impact to the processor cycle time. Average performance improvements of 25.5% and 44% are feasible for a set of kernel benchmarks on an embedded RISC and an application-specific processor, respectively. A corresponding 10% speedup is achieved on the same RISC for a subset of MiBench applications, not necessarily featuring the examined performance-critical kernels.
INDEX TERMS
Microprocessors, Control design, Pipeline processors, Optimization, Real-time and embedded systems, Hardware description languages
CITATION
Nikolaos Kavvadias, Spiridon Nikolaidis, "Elimination of Overhead Operations in Complex Loop Structures for Embedded Microprocessors", IEEE Transactions on Computers, vol.57, no. 2, pp. 200-214, February 2008, doi:10.1109/TC.2007.70790
REFERENCES
[1] ARM, http:/www.arm.com, 2007.
[2] MIPS Technologies, http:/www.mips.com, 2007.
[3] DSP56300 24-Bit Digital Signal Processor Family Manual, third ed. Motorola, Dec. 2000.
[4] ST120 DSP-MCU Core Reference Guide, first ed., release 1.3. STMicroelectronics, Dec. 200.
[5] D.F. Bacon, S.L. Graham, and O.J. Sharp, “Compiler Transformations for High-Performance Computing,” ACM Computing Surveys, vol. 26, no. 4, pp. 345-420, Dec. 1994.
[6] ARC Cores, http:/www.arccores.com, 2007.
[7] R. Gonzalez, “Xtensa: A Configurable and Extensible Processor,” IEEE Micro, vol. 20, no. 2, pp. 60-70, Mar./Apr. 2000.
[8] D. Talla, L.K. John, and D. Burger, “Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements,” IEEE Trans. Computers, vol. 52, no. 8, pp. 1015-1031, Aug. 2003.
[9] F. Campi, R. Canegallo, and R. Guerrieri, “IP-Reusable 32-Bit VLIW RISC Core,” Proc. 27th European Solid-State Circuits Conf., pp. 456-459, Sept. 2001.
[10] J. Kang, J. Lee, and W. Sung, “A Compiler-Friendly RISC-Based Digital Processor Synthesis and Performance Evaluation,” J. VLSI Signal Processing, vol. 27, pp. 297-312, 2001.
[11] J.-Y. Lee and I.-C. Park, “Loop and Address Code Optimization for Digital Signal Processors,” IEICE Trans. Fundamentals of Electronics, Comm., and Computer Science, vol. E85-A, no. 6, pp.1408-1415, June 2002.
[12] M. Kuulusa, J. Nurmi, J. Takala, P. Ojala, and H. Herranen, “A Flexible DSP Core for Embedded Systems,” IEEE Design and Test of Computers, vol. 3, no. 4, pp. 60-68, Oct. 1997.
[13] L. Carter, J. Ferrante, and C. Thomborson, “Folklore Confirmed: Reducible Flow Graphs Are Exponentially Larger,” Proc. ACM Symp. Principles of Programming Languages (POPL '03), pp. 106-114, Jan. 2003.
[14] J. Janssen and H. Corporaal, “Making Graphs Reducible with Controlled Node Splitting,” ACM Trans. Programming Languages and Systems, vol. 19, no. 6, pp. 1031-1052, Nov. 1997.
[15] S. Unger and F. Mueller, “Handling Irreducible Loops: Optimized Node Splitting vs. DJ-Graphs,” ACM Trans. Programming Languages and Systems, vol. 24, no. 4, pp. 299-333, 2002.
[16] G.-R. Uh, Y. Wang, D. Whalley, S. Jinturkar, Y. Paek, V. Cao, and C. Burns, “Compiler Transformations for Effectively Exploiting a Zero Overhead Loop Buffer,” Software: Practice and Experience, vol. 35, no. 4, pp. 393-412, Apr. 2005.
[17] A.V. Aho, R. Sethi, and J.D. Ullman, Compilers: Principles, Techniques, and Tools. Addison-Wesley, 1986.
[18] MPEG-4 Video Verification Model Version 18.0. Int'l Organization of Standardization Working Group on Coding of Moving Pictures and Audio, Jan. 2001.
[19] Graphviz, http://www.research.att.com/sw/toolsgraphviz /, 2007.
[20] N. Kavvadias and S. Nikolaidis, “Zero-Overhead Loop Controller that Implements Multimedia Algorithms,” IEE Proc.—Computers and Digital Techniques, vol. 152, no. 4, pp. 517-526, July 2005.
[21] S. Kumar, L. Pires, S. Ponnuswamy, C. Nanavati, J. Golusky, M. Vojta, S. Wadi, D. Pandalai, and H. Spaanenburg, “A Benchmark Suite for Evaluating Configurable Computing Systems—Status, Reflections, and Future Directions,” Proc. ACM/SIGDA Int'l Symp. Field Programmable Gate Arrays, Feb. 2000.
[22] C. Pan, N. Bagherzadeh, A.H. Kamalizad, and A. Koohi, “Design and Analysis of a Programmable Single-Chip Architecture for DVB-T Base-Band Receiver,” Proc. Design, Automation and Test in Europe Conf., pp. 468-473, Mar. 2003.
[23] M.E. Benitez and J.W. Davidson, “A Portable Global Optimizer and Linker,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, pp. 329-338, 1988.
[24] M.D. Smith and G. Holloway, “An Introduction to Machine SUIF and its Portable Libraries for Analysis and Optimization,” MachSUIF documentation 2.02.07.15 ed., Division of Eng. and Applied Sciences, Harvard Univ., 2002.
[25] R.M. Stallman, “GNU Compiler Collection Internals for GCC 3.3.2,” Free Software Foundation, http://gcc.gnu.org/online docsgccint, Dec. 2002.
[26] N. Kavvadias, “Hardware Looping Unit,” http://www.open cores.org/projects.cgi/web/ hwluoverview/, 2007.
[27] “Machine-SUIF Research Compiler,” http://www.eecs.harvard. edu/hube/research machsuif.html, 2007.
[28] E. Rohou, F. Bodin, A. Seznec, G.L. Fol, F. Charot, and F. Raimbault, “SALTO: System for Assembly-Language Transformation and Optimization,” Technical Report 2980, Institut Nattional de Recherche en Informatique et en Automatique, Sept. 1996.
[29] The ArchC Resource Center, http:/www.archc.org, 2007.
[30] A. Brown, S.Z. Guyer, and D.A. Jiménez, “The C-Breeze Compiler Infrastructure,” technical report, Dept. of Computer Sciences, Univ. of Texas, Austin, July 2004.
[31] L. George and A.W. Appel, “Iterated Register Coalescing,” ACM Trans. Programming Languages and Systems, vol. 18, no. 3, pp. 300-324, May 1996.
[32] J. Eyre and J. Bier, “Evolution of DSP Processors,” IEEE Signal Processing Magazine, vol. 17, no. 2, pp. 43-51, Mar. 2000.
[33] J. Ferrante, K.J. Ottenstein, and J.D. Warren, “The Program Dependence Graph and Its Use in Optimization,” ACM Trans. Programming Languages and Systems, vol. 9, no. 3, pp. 319-349, July 1987.
[34] T.P. Shevlin, “Composed Control Dependence Graph Generator,” master's thesis, Dept. of Computer Science, Tufts Univ., Sept. 2004.
[35] A. Aggarwal and M. Franklin, “Energy Efficient Asymmetrically Ported Register Files,” Proc. 21st Int'l Conf. Computer Design, pp. 2-7, Oct. 2003.
[36] WCET Project Benchmarks, http://www.mrtc.mdh.se/projects/wcetbenchmarks.html , 2007.
[37] R. Azevedo, S. Rigo, M. Bartholomeu, G. Araujo, C. Araujo, and E. Barros, “The ArchC Architecture Description Language and Tools,” Int'l J. Parallel Programming, vol. 33, no. 5, pp. 453-484, Oct. 2005.
5 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool