This Article 
 Bibliographic References 
 Add to: 
Evaluating Performance Tradeoffs Between Fine-Grained and Coarse-Grained Alternatives
January 1995 (vol. 6 no. 1)
pp. 17-27

Abstract—Recent simulation-based studies suggest that while superpipelines and superscalars are equally capable of exploiting fine-grained concurrency, multiprocessors are better at exploiting coarse-grained parallelism. An analytical model that is more flexible and less costly in terms of run time than simulation, is proposed as a tool for analyzing the tradeoff between superpipelined processors, superscalar processors, and multiprocessors. The duality of superpipelines and superscalars is examined in detail. The performance limit for these systems has been derived and it supports the fetch bottleneck observation of previous researchers. Common characteristics of utilization curves for such systems are examined. Combined systems, such as superpipelined multiprocessors and superscalar multiprocessors, are also analyzed.

The model shows that the number of pipelines (or processors) at which the maximum throughput is obtained is, as memory access time increases, increasingly sensitive to the ratio of memory access time to network access delay. Further, as a function of interiteration dependence distance, optimum throughput is shown to vary nonlinearly, whereas the corresponding optimum number of processors varies linearly. The predictions from the analytical model agree with similar results published using simulation-based techniques.

Index Terms—Fine-grain parallelism, coarse-grain parallelism, superscalar, superpipelined, multiprocessor, performance.

[1] R.D. Acosta et al., "An Instruction Issuing Approach to Enhancing Performance in Multiple Functional Unit Processors," IEEE Trans. Computers, Vol. C-35, No. 9, Sept. 1986, pp. 815-828.
[2] D. W. Anderson, F. J. Sparcio, and R. M. Tomasulo,“The IBM System/360 model 91: Machine philosophy and instruction handling,”IBM J. Res. and Dev. 11, pp. 8–24, Jan. 1967.
[3] M. Butleret al.,“Single instruction stream parallelism is greater than two,”inProc. 18th Int. Symp. on Comput. Architecture, May 1991, pp. 276–286.
[4] R. Cytron,“Doacross: beyond vectorization for multiprocessors,”inProc. 1986 Int. Conf. Parallel Processing, pp. 836–844.
[5] P.K. Dubey and M.J. Flynn, "Optimal Pipelining," J. Parallel and Distributed Computing, Vol. 8, No. 1, Jan. 1990, pp. 10-19.
[6] P. K. Dubey, G. B. Adams III, and M. J. Flynn,“Spectrum of choices: Superpipelined, superscalar, or multiprocessor?”inProc. 3rd. IEEE Symp. Parallel and Distrib. Processing, Dec. 1991, pp. 233–240.
[7] M. J. Flynn,“Some computer organizations and their effectiveness,”IEEE Trans. on Comput., C-21, 9, Sep. 1972, pp. 948–960.
[8] T. R. Gross and J. Hennessey,“Optimizing delayed branches,”inProc. 15th Workshop on Microprogramming, 1982.
[9] N.P. Jouppi and D.W. Wall,"Available Instruction-Level Parallelism for Superscalar and Superpipelined Machines," Proc. Third Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), Assoc. of Computing Machinery,N.Y., Apr. 1989, pp. 272-282.
[10] D. Kuck, Y. Muraoka, and S. Chen,“On the number of operations simultaneously executable in fortran-like programs and their resulting speedup,”IEEE Trans. Comput., C-21, Dec. 1972, pp. 1293–1310.
[11] M. S. Lam and R. P. Wilson,“Limits of control flow on parallelism,”Proc. 19th Int. Symp. Comput. Architecture, May 1992, pp. 46–57.
[12] D. J. Lilja and P. C. Yew,“The performance potential of fine-grain and coarse-grain parallel architectures,”inProc. 24th Hawaii Int. Conf. Syst. Sci., vol. 1, Architecture, Jan. 1991, pp. 324–333.
[13] A. Nicolau and J. Fisher,“Measuring the parallelism available for very long instruction word architectures,”IEEE Trans. Comput., C-33, pp. 968–976, Nov. 1984.
[14] A. Pleszkun and G. S. Sohi,“The performance potential of multiple functional unit processors,”inProc. 15th Int. Symp. Comput. Architecture, June 1988, pp. 37–44.
[15] C. D. Polychronopoulos,“On program restructuring, scheduling and communication for parallel processor systems,”Ph.D. dissertation, Dep. of Computer Science, Univ. of Illinois, Aug. 1986.
[16] B.R. Rau and C.D. Glaeser,“Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientificcomputing,” Proc. 14th Ann. Workshop Microprogramming, pp. 183-198, Oct. 1981.
[17] M.D. Smith, M. Johnson, and M. Horowitz, “Limits on Multiple Instruction Issue,” Proc. Third Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 290-302, Apr. 1989.
[18] G. S. Tjaden and M. J. Flynn,“Detection and parallel execution of independent instructions,”IEEE Trans. Comput., C-19, Oct. 1970, pp. 889–895.
[19] D.W. Wall, “Limits of Instruction-Level Parallelism,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 176-188, 8-11 Apr. 1991.

Pradeep K. Dubey, George B. Adams, Michael J. Flynn, "Evaluating Performance Tradeoffs Between Fine-Grained and Coarse-Grained Alternatives," IEEE Transactions on Parallel and Distributed Systems, vol. 6, no. 1, pp. 17-27, Jan. 1995, doi:10.1109/71.363414
Usage of this product signifies your acceptance of the Terms of Use.