This Article 
 Bibliographic References 
 Add to: 
Clustered Loop Buffer Organization for Low Energy VLIW Embedded Processors
June 2005 (vol. 54 no. 6)
pp. 672-683
Current loop buffer organizations for very large instruction word processors are essentially centralized. As a consequence, they are energy inefficient and their scalability is limited. To alleviate this problem, we propose a clustered loop buffer organization, where the loop buffers are partitioned and functional units are logically grouped to form clusters, along with two schemes for buffer control which regulate the activity in each cluster. Furthermore, we propose a design-time scheme to generate clusters by analyzing an application profile and grouping closely related functional units. The simulation results indicate that the energy consumed in the clustered loop buffers is, on average, 63 percent lower than the energy consumed in an uncompressed centralized loop buffer scheme, 35 percent lower than a centralized compressed loop buffer scheme, and 22 percent lower than a randomly clustered loop buffer scheme.

[1] M.F. Jacome and G. de Veciana, “Design Challenges for New Application-Specific Processors,” IEEE Design & Test of Computers, special issue on design of embedded systems, Apr.-June 2000.
[2] Texas Instruments Inc., TMS320C6000 Power Consumption Summary, http:/, Nov. 1999.
[3] L. Benini, D. Bruni, M. Chinosi, C. Silvano, and V. Zaccaria, “A Power Modeling and Estimation Framework for VLIW-Based Embedded System,” ST J. System Research, vol. 3, pp. 110-118, Apr. 2002.
[4] R.S. Bajwa, M. Hiraki, H. Kojima, D.J. Gorny, K. Nitta, A. Shridhar, K. Seki, and K. Sasaki, “Instruction Buffering to Reduce Power in Processors for Signal Processing,” IEEE Trans. Very Large Scale Integration (VLSI) Systems, vol. 5, pp. 417-424, Dec. 1997.
[5] L.H. Lee, W. Moyer, and J. Arends, “Instruction Fetch Energy Reduction Using Loop Caches for Embedded Applications with Small Tight Loops,” Proc. Int'l Symp. Low Power Electronic Design (ISLPED), Aug. 1999.
[6] A. Gordon-Ross, S. Cotterell, and F. Vahid, “Exploiting Fixed Programs in Embedded Systems: A Loop Cache Example,” Proc. IEEE Computer Architecture Letters, Jan. 2002.
[7] N. Bellas, I. Hajj, C. Polychronopoulos, and G. Stamoulis, “Architectural and Compiler Support for Energy Reduction in the Memory Hierarchy of High Performance Microprocessors,” Proc. Int'l Symp. Low Power Electronic Design (ISLPED), Aug. 1998.
[8] J.W. Sias, H.C. Hunter, and W.M.W. Hwu, “Enhancing Loop Buffering of Media and Telecommunications Applications Using Low-Overhead Predication,” Proc. 34th Ann. Int'l Symp. Microarchitecture (MICRO), Dec. 2001.
[9] Texas Instruments Inc., TMS320C6000 CPU and Instruction Set Reference Guide, http:/, Oct. 2000.
[10] N. Liveris, N.D. Zervas, D. Soudris, and C.E. Goutis, “A Code Transformation-Based Methodology for Improving I-Cache Performance of DSP Applications,” Proc. Design Automation and Test in Europe (DATE), Mar. 2002.
[11] Trimaran: An Infrastructure for Research in Instruction-Level Parallelism, http:/, 1999.
[12] C. Lee et al., “Mediabench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems,” Proc. Int'l Symp. Microarchitecture, pp. 330-335, 1997.
[13] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: A Framework for Architectural-Level Power Analysis and Optimizations,” Proc. 27th Int'l Symp. Computer Architecture (ISCA), pp. 83-94, June 2000.
[14] S.V. Adve, D. Burger, R. Eigenmann, A. Rawsthorne, M.D. Smith, C.H. Gebotys, M.T. Kandemir, D.J. Lilja, A.N. Choudhary, J.Z. Fang, and P.-C. Yew, “Changing Interaction of Compiler And Architecture,” Computer, vol. 30, no. 12, pp. 51-58, Dec. 1997.
[15] C. Lee, J.K. Lee, and T. Hwang, “Compiler Optimization on Instruction Scheduling for Low Power,” Proc. Int'l Symp. System Synthesis (ISSS), Sept. 2000.
[16] M. Mahendale, S.D. Sherlekar, and G. Venkatesh, “Extensions to Programmable DSP Architectures for Reduced Power Dissipation,” Proc. VLSI Design, Jan. 1998.
[17] W.-C. Cheng and M. Pedram, “Power-Aware Bus Encoding Techniques for I/O and Data Busses in an Embedded System,” J. Circuits, Systems, and Computers, vol. 11, pp. 351-364, Aug. 2002.
[18] L. Benini, A. Macii, E. Macii, and M. Poncino, “Selective Instruction Compression for Memory Energy Reduction in Embedded Systems,” Proc. Int'l Symp. Low Power Electronic Design (ISLPED), Aug. 1999.
[19] P. Centoducatte, G. Araujo, and R. Pannain, “Compressed Code Execution on DSP Architectures,” Proc. Int'l Symp. System Synthesis (ISSS), Nov. 1999.
[20] H. Lekatsas, J. Henkel, and W. Wolf, “Code Compression for Low Power Embedded System Design,” Proc. Design Automation Conf. (DAC), June 2000.
[21] S. Debray, W. Evans, R. Muth, and B.D. Sutter, “Compiler Techniques for Code Compaction,” ACM Trans. Programming Languages and Systems (TOPLAS), vol. 22, pp. 378-415, Mar. 2000.
[22] A. Halambi, A. Shrivastava, P. Biswas, N. Dutt, and A. Nicolau, “An Efficient Compiler Technique for Code Size Reduction Using Reduced Bit-Width ISAs,” Proc. Design Automation Conf. (DAC), Mar. 2002.
[23] T. Ishihara and H. Yasuura, “A Power Reduction Technique with Object Code Merging for Application Specific Embedded Processors,” Proc. Design Automation and Test in Europe (DATE), Mar. 2000.
[24] S. Steinke, L. Wehmeyer, B.-S. Lee, and P. Marwedel, “Assigning Program and Data Objects to Scratchpad for Energy Reduction,” Proc. Design Automation and Test in Europe (DATE), Mar. 2002.
[25] S. Parameswaran and J. Henkel, “I-Copes: Fast Instruction Code Placement for Embedded Systems to Improve Performance and Energy Efficiency,” Proc. Int'l Conf. Computer Aided Design (ICCAD), Nov. 2001.
[26] N.P. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” Proc. Int'l Symp. Computer Architecture (ISCA), May 1990.
[27] J.D. Bunda, “Instruction-Processing Optimization Technique for VLSI Microprocessors,” PhD dessertation, Univ. of Texas at Austin, May 1993.
[28] J. Kin, M. Gupta, and W.H. Mangione-Smith, “Filtering Memory References to Increase Energy Efficiency,” IEEE Trans. Computers, vol. 49, no. 1, pp. 1-15, Jan. 2000.
[29] W. Tang, R. Gupta, and A. Nicolau, “Design of a Predictive Filter Cache for Energy Savings in High Performance Processor Architectures,” Proc. Int'l Conf. Computer Design (ICCD), Sept. 2001.
[30] T. Anderson and S. Agarwala, “Effective Hardware-Based Two-Way Loop Cache for High Performance Low Power Processors,” Proc. Int'l Conf. Computer Design (ICCD), Sept. 2000.
[31] A. Gordon-Ross and F. Vahid, “Dynamic Loop Caching Meets Preloaded Loop Caching— A Hybrid Approach,” Proc. Int'l Conf. Computer Design (ICCD), Sept. 2002.
[32] W.-T. Shiue and C. Chakrabarti, “Memory Exploration for Low Power Embedded Systems,” Proc. Design Automation Conf. (DAC), June 1999.
[33] T.M. Conte, S. Banerjia, S.Y. Larin, and K.N. Menezes, “Instruction Fetch Mechanisms for VLIW Architectures with Compressed Encodings,” Proc. 29th Int'l Symp. Microarchitecture (MICRO), Dec. 1996.
[34] M.D. Powell et al., “Reducing Set-Associative Cache Energy via Way-Prediction and Selective Direct-Mapping,” Proc. 34th Int'l Symp. Microarchitecture (MICRO), Nov. 2001.
[35] S. Kim, N. Vijaykrishnan, M. Kandemir, A. Sivasubramaniam, M.J. Irwin, and E. Geethanjali, “Power-Aware Partitioned Cache Architectures,” Proc. ACM/IEEE Int'l Symp. Low Power Electronics (ISLPED), Aug. 2001.
[36] R. Colwell, R. Nix, J. O'Donnell, D. Papworth, and P. Rodman, “A VLIW Architecture for a Trace Scheduling Compiler,” IEEE Trans. Computers, vol. 37, no. 8, pp. 967-979, Aug. 1988.
[37] V. Lapinskii, M.F. Jacome, and G. de Veciana, “High Quality Operation Binding for Clustered VLIW Datapaths,” Proc. IEEE/ACM Design Automation Conf. (DAC), June 2001.
[38] P. Faraboschi, G. Brown, J. Fischer, G. Desoli, and F. Homewood, “Lx: A Technology Platform for Customizable VLIW Embedded Processing,” Proc. 27th Int'l Symp. Computer Architecture (ISCA), June 2000.
[39] J. Sánchez and A. González, “Modulo Scheduling for a Fully-Distributed Clustered VLIW Architectures,” Proc. 29th Int'l Symp. Microarchitecture (MICRO), Dec. 2001.
[40] M.J. Flynn, P. Hung, and K.W. Rudd, “Deep-Submicron Microprocessor Design Issues,” IEEE MICRO, vol. 19, no. 4, July-Aug. 1999.
[41] V.V. Zyuban and P.M. Kogge, “Inherently Lower-Power High-Performance Superscalar Architectures,” IEEE Trans. Computers, vol. 50, no. 3, pp. 268-285, Mar. 2001.
[42] M. Franklin, “The Multiscalar Architecture,” PhD dessertation, Univ. of Wisconsin Madison, Nov. 1993.
[43] S. Palacharla, N. Jouppi, and J. Smith, “Complexity-Effective Superscalar Processor,” Proc. Int'l Symp. Computer Architecture (ISCA), June 1997.

Index Terms:
RISC/CISC, VLIW architectures, real-time and embedded systems, memory management, memory design, low-power design.
Murali Jayapala, Francisco Barat, Tom Vander Aa, Francky Catthoor, Henk Corporaal, Geert Deconinck, "Clustered Loop Buffer Organization for Low Energy VLIW Embedded Processors," IEEE Transactions on Computers, vol. 54, no. 6, pp. 672-683, June 2005, doi:10.1109/TC.2005.92
Usage of this product signifies your acceptance of the Terms of Use.