This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Distributed Data Cache Designs for Clustered VLIW Processors
October 2005 (vol. 54 no. 10)
pp. 1227-1241
Wire delays are a major concern for current and forthcoming processors. One approach to deal with this problem is to divide the processor into semi-independent units referred to as clusters. A cluster usually consists of a local register file and a subset of the functional units, while the L1 data cache typically remains centralized in what we call partially distributed architectures. However, as technology evolves, the relative latency of such a centralized cache will increase, leading to an important impact on performance. In this paper, we propose partitioning the L1 data cache among clusters for clustered VLIW processors. We refer to this kind of design as fully distributed processors. In particular, we propose and evaluate three different configurations: a snoop-based cache coherence scheme, a word-interleaved cache, and flexible L0 buffers managed by the compiler. For each alternative, instruction scheduling techniques targeted to cyclic code are developed. Results for the Mediabench suite show that the performance of such fully distributed architectures is always better than the performance of a partially distributed one with the same amount of resources. In addition, the key aspects of each fully distributed configuration are explored.

[1] V. Agarwal, M.S. Hrishikesh, S.W. Keckler, and D. Burger, “Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures,” Proc. 27th Int'l Symp. Computer Architecture, June 2000.
[2] A. Aggarwal and M. Franklin, “An Empirical Study of the Scalability Aspects of Instruction Distribution Algorithms for Clustered Processors,” Proc. Int'l Symp. Performance Analysis of Systems and Software, 2001.
[3] O. Avissar, R. Barua, and D. Stewart, “An Optimal Memory Allocation Scheme for Scratch-Pad-Based Embedded Systems,” ACM Trans. Embedded Computing Systems, 2002.
[4] R. Bahar, G. Albera, and S. Manne, “Power and Performance Tradeoffs Using Various Caching Strategies,” Proc. Int'l Symp. Low Power Electronics and Design, 1998.
[5] R. Balasubramonian, S. Dwarkadas, and D. Albonesi, “Dynamically Managing the Communication-Parallelism Trade-Off in Future Clustered Processors,” Proc. 30th Int'l Symp. Computer Architecture, June 2003.
[6] R. Canal, J.M. Parcerisa, and A. González, “Dynamic Cluster Assignment Mechanisms,” Proc. Sixth Int'l Symp. High-Performance Computer Architecture, Jan. 2000.
[7] P.P. Chang, S.A. Mahlke, W.Y. Chen, N.J. Water, and W.W. Hwu, “IMPACT: An Architectural Framework for Multiple-Instruction-Issue Processors,” Proc. 18th Int'l Symp. Computer Architecture, May 1991.
[8] A. Charlesworth, “An Approach to Scientific Array Processing: The Architectural Design of the AP120B/FPS-164 Family,” Computer, vol. 14, no. 9, Sept. 1981.
[9] B. Cheng, “Compile-Time Memory Disambiguation for C Programs,” PhD thesis, Dept. of Computer Science, Univ. of Illi nois, May 2000.
[10] J.M. Codina, J. Sánchez, and A. González, “A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, Sept. 2001.
[11] J.M. Codina, J. Llosa, and A. González, “A Comparative Study of Modulo Scheduling Techniques,” Proc. Int'l Conf. Supercomputing, June 2002.
[12] P. Faraboschi, G. Brown, J. Fisher, G. Desoli, and F. Homewood, “Lx: A Technology Platform for Customizable VLIW Embedded Processing,” Proc. 27th Int'l Symp. Computer Architecture, June 2000.
[13] J. Fridman and Z. Greefield, “The TigerSharc DSP Architecture,” IEEE Micro, Jan./Feb. 2000.
[14] E. Gibert, J. Sánchez, and A. González, “An Interleaved Cache Clustered VLIW Processor,” Proc. Int'l Conf. Supercomputing, June 2002.
[15] E. Gibert, J. Sánchez, and A. González, “Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor,” Proc. 35th Int'l Symp. Microarchitecture, Nov. 2002.
[16] E. Gibert, J. Sánchez, and A. González, “Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache,” Proc. First Int'l Symp. Code Generation and Optimization, Mar. 2003.
[17] E. Gibert, J. Sánchez, and A. González, “Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors,” Proc. 36th Int'l Symp. Microarchitecture, Dec. 2003.
[18] P.N. Glaskowsky, “MAP1000 Unfolds at Equator,” Microprocessor Report, vol. 16, no. 12, Dec. 1998.
[19] L. Gwennap, “Digital 21264 Sets New Standard,” Microprocessor Report, vol. 14, no. 10, Oct. 1996.
[20] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel, “The Microarchitecture of the Pentium 4 Processor,” Intel Technology J., Q1, Feb. 2001.
[21] R. Huff, “Lifetime-Sensitive Modulo Scheduling,” Proc. ACM SIGPLAN '93 Conf. Programming Languages Design and Implementation, 1993.
[22] K. Kailas, K. Ebcioglu, and A. Agrawala, “CARS: A New Code Generation Framework for Clustered ILP Processors,” Proc. Seventh Int'l Symp. High-Performance Computer Architecture, Jan. 2001.
[23] Y. Kang, W. Huang, S. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, and J. Torrellas, “FlexRAM: Toward an Advanced Intelligent Memory System,” Proc. Int'l Conf. Computer Design, Oct. 1999.
[24] J. Kin, M. Gupta, and W.H. Mangione-Smith, “The Filter Cache: An Energy Efficient Memory Structure,” Proc. 30th Int'l Symp. Microarchitecture, Dec. 1997.
[25] C.E. Kozyrakis, S. Perissakis, D. Patterson, T. Anderson, K. Asanovic, N. Cardwell, R. Fromm, J. Golbus, B. Gribstad, K. Keeton, R. Thomas, N. Treuhaft, and K. Yelick, “Scalable Processors in the Billion-Transistor Era: IRAM,” Computer, vol. 30, no. 9, Sept. 1997.
[26] C. Lee, M. Potkonjak, and W.H. Mangione-Smith, “MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communication Systems,” Proc. 30th Int'l Symp. Microarchitecture, Dec. 1997.
[27] J. Llosa, A. González, E. Ayguadé, and M. Valero, “Swing Modulo Scheduling,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, Oct. 1996.
[28] S.A. Mahlke, D.C. Lin, W.Y. Chen, R.E. Hank, and R.A. Bringmann, “Effective Compiler Support for Predicated Execution Using the Hyperblock,” Proc. 25th Int'l Symp. Microarchitecture, Dec. 1992.
[29] E. Nystrom and A.E. Eichenberger, “Effective Cluster Assignment for Modulo Scheduling,” Proc. 31st Int'l Symp. Microarchitecture, 1998.
[30] M. Oskin, F.T. Chong, and T. Sherwood, “Active Pages: A Computation Model for Intelligent Memory,” Proc. 25th Ann. Int'l Symp. Computer Architecture, June 1998.
[31] E. Özer, S. Banerjia, and T.M. Conte, “Unified Assign and Schedule: A New Approach to Scheduling for Clustered Register File Microarchitectures,” Proc. 31st Symp. Microarchitecture, Nov. 1998.
[32] S. Palacharla, N.P. Jouppi, and J.E. Smith, “Complexity-Effective Superscalar Processors,” Proc. 24th Int'l Symp. Computer Architecture, June 1997.
[33] P. Panda, N. Dutt, and A. Nicolau, “Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications,” Proc. European Design and Test Conf., Mar. 1997.
[34] P. Racunas and Y. Patt, “Partitioned First-Level Cache Design for Clustered Microarchitecture,” Proc. 17th Int'l Conf. Supercomputing, June 2003.
[35] B.R. Rau, “Iterative Modulo Scheduling: An Algorithm for Software Pipelining Loops,” Proc. 27th Int'l Symp. Microarchitecture, Nov. 1994.
[36] J. Sánchez and A. González, “Cache Sensitive Modulo Scheduling,” Proc. 30th Int'l Symp. Microarchitecture, Dec. 1997.
[37] J. Sánchez and A. González, “The Effectiveness of Loop Unrolling for Modulo Scheduling in Clustered VLIW Architectures,” Proc. 29th Int'l Conf. Parallel Processing, Aug. 2000.
[38] J. Sánchez and A. González, “Modulo Scheduling for a Fully-Distributed Clustered VLIW Architecture,” Proc. 33rd Int'l Symp. Microarchitecture, Dec. 2000.
[39] K. Sankaralingam, R. Nagarajan, H. Liu, J. Huh, C.K. Kim, D. Burger, S.W. Keckler, and C.R. Moore, “Exploiting ILP, TLP, and DLP Using Polymorphism in the TRIPS Architecture,” Proc. 30th Ann. Int'l Symp. Computer Architecture, June 2003.
[40] S. Swanson, K. Michelson, A. Schwerin, and M. Oskin, “Wavescalar,” Proc. 36th Int'l Symp. Microarchitecture, Dec. 2003.
[41] Texas Instruments Inc., “TMS320C62x/67x CPU and Instruction Set Reference Guide,” 1998.
[42] M. Tomasevic and V. Milutinovic, “Hardware Approaches to Cache Coherence in Shared-Memory Multiprocessors,” IEEE Micro, vol. 14, nos. 5-6, pp. 52-59, 61-66, Oct., Dec. 1994.
[43] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal, “Baring It All to Software: Raw Machines,” Computer, vol. 30, no. 9, Sept. 1997.
[44] Y. Wu, R. Rakvic, L. Chen, C. Miao, G. Chrysos, and J. Fang, “Compiler Managed Micro-Cache Bypassing for High Performance EPIC Processors,” Proc. 35th Int'l Symp. Microarchitecture, Nov. 2002.
[45] L. Zhang, Z. Fang, M. Parker, B. Mathew, L. Schaelicke, J. Carter, W. Hsieh, and S. McKee, “The Impulse Memory Controller,” IEEE Trans. Computers, special issue on advances in high-performance memory mystems, vol. 50, no. 11, Nov. 2001.
[46] V.V. Zyuban, “Inherently Lower-Power High-Performance Superscalar Architectures,” PhD thesis, Dept. of Computer Science and Eng., Univ. of Notre Dame, Mar. 2000.

Index Terms:
Index Terms- Single data stream architectures, design styles.
Citation:
Enric Gibert, Jes? S?nchez, Antonio Gonz?lez, "Distributed Data Cache Designs for Clustered VLIW Processors," IEEE Transactions on Computers, vol. 54, no. 10, pp. 1227-1241, Oct. 2005, doi:10.1109/TC.2005.163
Usage of this product signifies your acceptance of the Terms of Use.