This Article 
 Bibliographic References 
 Add to: 
Cost-Conscious Strategies to Increase Performance of Numerical Programs on Aggressive VLIW Architectures
October 2001 (vol. 50 no. 10)
pp. 1033-1051

Abstract—Loops are the main time-consuming part of numerical applications. The performance of the loops is limited either by the resources offered by the architecture or by recurrences in the computation. To execute more operations per cycle, current processors are designed with growing degrees of resource replication (replication technique) for memory ports and functional units. However, the high cost in terms of area and cycle time of this technique precludes the use of high degrees of replication. High values for the cycle time may clearly offset any gain in terms of number of execution cycles. High values for the area may lead to an unimplementable configuration. An alternative to resource replication is resource widening (widening technique), which has also been used in some recent designs in which the width of the resources is increased (i.e., a single operation is performed over multiple data). Moreover, several general-purpose superscalar microprocessors have been implemented with multiply-add fused floating-point units (fusion technique), which reduces the latency of the combined operation and the number of resources used. In this paper, we evaluate a broad set of VLIW processor design alternatives that combine the three techniques. We perform a technological projection for the next processor generations in order to foresee the possible implementable alternatives. From this study, we conclude that if the cost is taken into account, combining certain degrees of replication and widening in the hardware resources is more effective than applying only replication. Also, we confirm that multiply-add fused units will have a significant impact in raising the performance of future processors architectures with a reasonable increase in cost.

[1] V.H. Allan, R.B. Jones, R.M. Lee, and S.J. Allan, “Software Pipelining,” ACM Computing Surveys, vol. 27, no. 3, pp. 367-432, Sept. 1995.
[2] J.R. Allen,K. Kennedy,C. Porterfield,, and J. Warren,“Conversion of control dependence to data dependence,” Proc. 1983 Symp. Principles of Programming Languages, pp. 177-189, Jan. 1983.
[3] T.M. Austin and G.S. Sohi, “High-Bandwidth Address Translation for Multiple-Issue Processors,” Proc. 23rd Int'l Symp. Computer Architecture (ISCA-23), pp. 158-167, May 1996.
[4] E. Ayguadé, C. Barrado, A. González, J. Labarta, J. Llosa, D. López, S. Moreno, D. Padua, F. Reig, Q. Riera, and M. Valero, “Ictíneo: A Tool for Instruction-Level Parallelism Research,” Technical Report UPC-DAC-1996-61, Technical Univ. of Catalunya, Dec. 1996.
[5] M. Berry, D. Chen, P. Koss, and D. Kuck, “The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers,” Technical Report 827, CSRD, Univ. of Illinois at Urbana-Champaign, Nov. 1988.
[6] A. Capitanio, N. Dutt, and A. Nicolau, “Partitioned Register Files for VLIWs: A Preliminary Analysis of Tradeoffs,” Proc. 25th Int'l Symp. Microarchitecture (MICRO-25), pp. 292-300, Dec. 1992.
[7] J.C. Dehnert and R.A. Towle, “Compiling for Cydra 5,” J. Supercomputing, vol. 7 nos. 1/2, pp. 181-227, May 1993.
[8] K.I. Farkas, “Memory-System Design Considerations for Dynamically-Scheduled Microprocessors,” PhD dissertation, Univ. of Toronto, 1997.
[9] J.A. Fisher, P. Faraboschi, and G. Desoli, “Custom-Fit Processors: Letting Applications Define Architectures,” Proc. 29th Int'l Symp. Microarchitecture (MICRO-29), pp. 324-335, Dec. 1996.
[10] P.Y.T. Hsu, “Design of the FTP Microprocessor,” IEEE Micro, vol. 14, no. 2, pp. 23-33, Apr. 1994.
[11] IBM, Special Issue on the RS/6000, IBM J. Research and Development, vol. 34 no. 1, Jan. 1990.
[12] INTEL, “Pentium III Processor: Developer's Manual,” Intel Technical Report available at , 1999.
[13] J. Janssen and H. Corporaal, “Partitioned Register File for TTA,” Proc. 28th Int'l Symp. Microarchitecture (MICRO-28), pp. 303-312, Nov./Dec. 1995.
[14] R.M. Jessani and M. Putrino, “Comparison of Single- and Double-Pass Multiply-Add Fused Floating-Point Units,” IEEE Trans. Computers, vol. 47, no. 9, pp. 927-937, Sept. 1998.
[15] R. Jolly, “A 9-ns 1.4 Gigabyte 17-Ported CMOS Register File,” IEEE J. Solid-State Circuits, vol. 25, no. 10, pp. 1407-1412, Oct. 1991.
[16] R.B. Jones and V.H. Allan, “Software Pipelining: A Comparison and Improvement,” Proc. 23rd Int'l Symp. Microarchitecture (MICRO-23), pp. 46-46, Nov. 1990.
[17] T. Juan, J.J. Navarro, and O. Temam, “Data Caches for Superscalar Processors,” Proc. 11th. Int'l Conf. Supercomputing (ICS-11), pp. 60-67, July 1997.
[18] C.G. Lee, “Code Optimizers and Register Organizations for Vector Architectures,” PhD dissertation, Univ. of California at Berkeley, May 1992.
[19] J. Llosa, E. Ayguadé, and M. Valero, “Quantitative Evaluation of Register Pressure on Software Pipelined Loops,” Int'l J. Parallel Programming, vol. 26, no. 2, pp. 121-142, 1998.
[20] J. Llosa, M. Valero, and E. Ayguadé, “Heuristics for Register-Constrained Software Pipelining,” Proc. 29th Int'l Symp. Microarchitecture (MICRO-29), pp. 250-261, Dec. 1996.
[21] J. Llosa, M. Valero, E. Ayguadé, and A. González, “Modulo Scheduling with Reduced Register Pressure,” IEEE Trans. Computers, vol. 47, no. 6, pp. 625-638, June 1998.
[22] D. López, J. Llosa, E. Ayguadé, and M. Valero, “Impact on Performance of Fused Multiply-Add Units in Aggressive VLIW Architectures,” Proc. 1999 Int'l Conf. Parallel Processing (ICPP-99), pp. 22-29, Sept. 1999.
[23] D. López, J. Llosa, M. Valero, and E. Ayguadé, “Widening Resources: A Cost-Effective Technique for Aggressive ILP Architectures,” Proc. 31st Int'l Symp. Microarchitecture (MICRO-31), pp. 237-246, Nov.-Dec. 1998.
[24] D. López, M. Valero, J. Llosa, and E. Ayguadé, “Increasing Memory Bandwidth with Wide Buses: Compiler, Hardware and Performance Trade-Off,” Proc. 11th Int'l Conf. Supercomputing (ICS-11), pp. 12-19, July 1997.
[25] “Intel HP Make EPIC Disclosure,” Microprocessor Report, vol. 11, no. 14, Oct. 1997.
[26] “AltiVec Vectorizes PowerPC,” Microprocessor Report, vol. 12, no. 6, May 1998.
[27] “TI Aims for Floating-Point DSP Lead,” Microprocessor Report, vol. 12, no. 12, Sept. 1998.
[28] “MAP1000 Unfolds at Equator,” Microprocessor Report, vol. 12, no. 16, Dec. 1998.
[29] “MAJC Gives VLIW a New Twist,” Microprocessor Report, vol. 13, no. 12, Sept. 1999.
[30] “Merced Shows Innovative Design,” Microprocessor Report, vol. 13, no. 13, Oct. 1999.
[31] “Sun Makes MAJC with Mirrors,” Microprocessor Report, vol. 13, no. 14, Oct. 1999.
[32] K. Olukotun et al., "The Case for a Single-Chip Multiprocessor," Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems, ACM, 1996, pp. 2-11.
[33] B.R. Rau, "Iterative Modulo Scheduling: An Algorithm for Software Pipelined Loops," Proc. 27th Ann. Int'l Symp. Microarchitecture,San Jose, Calif., Dec. 1994.
[34] B.R. Rau and J. Fisher,“Instruction-level parallel processing: History, overview, and perspective,” J. SuperComputing, vol. 7, nos. 1/2, Jan. 1993.
[35] B.R. Rau and C.D. Glaeser,“Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientificcomputing,” Proc. 14th Ann. Workshop Microprogramming, pp. 183-198, Oct. 1981.
[36] B.R. Rau,P.P. Tirumalai,, and M.S. Schlansker,“Register allocation for software pipelined loops,” Proc. ACM SIGPLAN’92 Conf. Programming Language Design and Implementation, pp. 283-299, June 1992.
[37] Semiconductor Industry Assoc., “The National Technology Roadmap for Semiconductors,” San Jose, Calif., 1997.
[38] T. Watanabe, “The NEC SX-3 Supercomputer System,” Proc. CompCon91, pp. 303-308, 1991.
[39] S.W. White and S. Dhawan, “POWER2: Next Generation of the RISC System/6000 Family,” IBM J. Research and Development, vol. 38, no. 5, pp. 493-502, Sept. 1994.
[40] S.J.E. Wilton and N.P. Jouppi, Cacti: An Enhanced Cache Access and Cycle Time Model IEEE J. Solid-State Circuits, vol. 31, no. 5, pp. 677-688, May. 1996.
[41] M. Wolfe, High Performance Compilers for Parallel Computing. Addison-Wesley, 1996.
[42] K.C. Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, vol. 16, no. 2, pp. 28–40, Apr. 1996.

Index Terms:
VLIW processors, instruction level parallelism, software pipelining, numerical applications, performance/cost trade-off.
David López, Josep Llosa, Mateo Valero, Eduard Ayguadé, "Cost-Conscious Strategies to Increase Performance of Numerical Programs on Aggressive VLIW Architectures," IEEE Transactions on Computers, vol. 50, no. 10, pp. 1033-1051, Oct. 2001, doi:10.1109/12.956090
Usage of this product signifies your acceptance of the Terms of Use.