This Article 
 Bibliographic References 
 Add to: 
Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation
August 2001 (vol. 50 no. 8)
pp. 834-846

Abstract—In this paper, the Scheduled Dataflow (SDF) architecture—a decoupled memory/execution, multithreaded architecture using nonblocking threads—is presented in detail and evaluated against Superscalar architecture. Recent focus in the field of new processor architectures is mainly on VLIW (e.g., IA-64), superscalar, and superspeculative designs. This trend allows for better performance, but at the expense of increased hardware complexity and, possibly, higher power expenditures resulting from dynamic instruction scheduling. Our research deviates from this trend by exploring a simpler, yet powerful execution paradigm that is based on dataflow and multithreading. A program is partitioned into nonblocking execution threads. In addition, all memory accesses are decoupled from the thread's execution. Data is preloaded into the thread's context (registers) and all results are poststored after the completion of the thread's execution. While multithreading and decoupling are possible with control-flow architectures, SDF makes it easier to coordinate the memory accesses and execution of a thread, as well as eliminate unnecessary dependencies among instructions. We have compared the execution cycles required for programs on SDF with the execution cycles required by programs on SimpleScalar (a superscalar simulator) by considering the essential aspects of these architectures in order to have a fair comparison. The results show that SDF architecture can outperform the superscalar. SDF performance scales better with the number of functional units and allows for a good exploitation of Thread Level Parallelism (TLP) and available chip area.

[1] A. Agarwal, R. Bianchini, D. Chaiken, K. Johnson, D. Kranz, J. Kubiatowicz, B.-J. Lim, K. Mackenzie, and D. Yeung, “The MIT Alewife Architecture and Performance,” Proc. 22nd Int'l Symp. Computer Architecture, pp. 2-13, June 1995.
[2] A. Agarwal, J. Kubiatowicz, D. Kranz, B.-H. Lim, D. Yeoung, G. D'Souza, and M. Parkinet, “Sparcle: An Evolutionary Processor Design for Multiprocessors,” IEEE Micro, pp. 48-61, June 1993.
[3] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith, “The TERA Computer System,” Proc. 1990 Int'l Conf. Supercomputing, pp. 1-6, July 1990.
[4] B.S. Ang, Arvind, and D. Chiou, “StarTThe Next Generation: Integrating Global Caches and Dataflow Architecture,” Technical Report 354, Laboratory for Computer Science, Massachusetts Inst. of Tech nology, Aug. 1994.
[5] Arvind and K.S. Pingali, “A Dataflow Architecture with Tagged Tokens,” Technical Memo 174, Laboratory for Computer Science, Massachusetts Inst. of Tech nology, Sept. 1980.
[6] Arvind, R.S. Nikhil, and K.S. Pingali, “I-Structures Data Structures for Parallel Computing,” ACM Trans. Programming Languages and Systems, vol. 11, no. 4, pp. 598-632, Oct. 1989.
[7] Arvind and R.S. Nikhil, “Executing a Program on the MIT Tagged-Token Dataflow Architecture,” IEEE Trans. Computers, vol. 39, no. 3, pp. 300-318, Mar. 1990.
[8] R.D. Blumofe, C.F. Joerg, B.C. Kuszmaul, C.E. Leiserson, K.H. Randall, and Y. Zhou, “Cilk: An Efficient Multithreaded Runtime System,” Proc. Fifth ACM Symp. Principles and Practice of Parallel Programming (PoPP), pp. 206-215, July 1995.
[9] A.D.W. Bohm, D.C. Cann, J.T. Feo, and R.R. Oldehoeft, “SISAL Reference Manual: Language Version 2.0,” Technical Report CS91-118, Computer Science Dept., Colorado State Univ., 1991.
[10] D. Burger and T.M. Austin, “The SimpleScalar Tool Set Version 2.0,” Technical Report #1342, Dept. of Computer Science, Univ. of Wisconsin, Madison, 1997.
[11] M. Butler et al., “Single Instruction Stream Parallelism Is Greater than Two,” Proc. 18th Int'l Symp. Computer Architecture (ISCA-18), pp. 276-286, May 1991.
[12] D.E. Culler and G.M. Papadopoulos, “The Explicit Token Store,” J. Parallel and Distributed Computing, vol. 10, no. 4, pp. 289-308, 1990.
[13] R. Cytron, J. Ferrante, B.K. Rosen, M.N. Wegman, and F.K. Zadeck, “Efficiently Computing Static Single Assignment Form and the Control Dependence Graph,” ACM Trans. Programming Languages and Systems, vol. 13, no. 4, pp. 451-490, Oct. 1991.
[14] J.B. Dennis, “Dataflow Supercomputers,” Computer, pp. 48-56, Nov. 1980.
[15] M. Edahiro, S. Matsushita, M. Yamashina, and N. Nishi, “Single-Chip Multiprocessor for Smart Terminals,” IEEE Micro, vol. 20, no. 4, pp. 12-20, July 2000.
[16] R. Govindarajan, S.S. Nemawarkar, and P. LeNir, “Design and Performance Evaluation of a Multithreaded Architecture,” Proc. First High Performance Computer Architecture (HPCA-1), pp. 298-307, Jan. 1995.
[17] W. Grunewald and T. Ungerer, “A Multithreaded Processor Design for Distributed Shared Memory System,” Proc. Int'l Conf. Advances in Parallel and Distributed Computing, pp. 206-213, Mar. 1997.
[18] J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach, second ed. Morgan Kaufmann, 1996.
[19] H.H.-J. Hum et al., “A Design Study of the EARTH Multiprocessor,” Proc. Conf. Parallel Architectures and Compilation Techniques (PACT), pp. 59-68, June 1995.
[20] R.A. Iannucci, “Toward a Dataflow/Von Neumann Hybrid Architecture,” Proc. 15th Symp. Computer Architecture (ISCA-15), pp. 131-140, May 1990.
[21] K.M. Kavi, H.S. Kim, and A.R. Hurson, “Scheduled Dataflow Architecture: A Synchronous Execution Paradigm for Dataflow,” IASTED J. Computers and Applications, vol. 21, no. 3, pp. 114-124, Oct. 1999.
[22] K.M. Kavi, J. Arul, and R. Giorgi, “Execution and Cache Performance of the Scheduled Dataflow Architecture,” J. Universal Computer Science, special issue on multithreaded and chip multiprocessors, vol. 6, no. 10, pp. 948-967, Oct. 2000.
[23] V. Krishnan and J. Torrellas, “Chip-Multiprocessor Architecture with Speculative Multithreading,” IEEE Trans. Computers, vol. 48, no. 9, pp. 866-880, Sept. 1999.
[24] M. Lam and R.P. Wilson, “Limits of Control Flow on Parallelism,” Proc. 19th Int'l Symp. Computer Architecture (ISCA-19), pp. 46-57, May 1992.
[25] J.L. Lo et al., “Converting Thread-Level Parallelism into Instruction-Level Parallelism via Simultaneous Multithreading,” ACM Trans. Computer Systems, pp. 322-354, Aug. 1997.
[26] N. Mitchell, L. Carter, J. Ferrante, and D. Tullsen, “Instruction Level Parallelism vs. Thread Level Parallelism on Simultaneous Multi-Threading Processors,” Proc. Supercomputing '99, papersmitchell.pdf, 1999.
[27] S. Onder and R. Gupta, “Superscalar Execution with Direct Data Forwarding,” Proc. Int'l Conf. Parallel Architectures and Compiler Technologies (PACT-98), pp. 130-135, Oct. 1998.
[28] G.M. Papadopoulos, “Implementation of a General Purpose Dataflow Multiprocessor,” Technical Report TR-432, Laboratory for Computer Science, Massachusetts Inst. of Tech nology, Aug. 1988.
[29] G.M. Papadopoulos and D.E. Culler, “Monsoon: An Explicit Token-Store Architecture,” Proc. 17th Int'l Symp. Computer Architecture (ISCA-17), pp. 82-91, May 1990.
[30] G.M. Papadopoulos and K.R. Traub, “Multithreading: A Revisionist View of Dataflow Architectures,” Proc. 18th Int'l Symp. Computer Architecture (ISCA-18), pp. 342-351, June 1991.
[31] S. Sakai et al., “Super-Threading: Architectural and Software Mechanisms for Optimizing Parallel Computations,” Proc. 1993 Int'l Conf. Supercomputing, pp. 251-260, July 1993.
[32] B. Shankar, L. Roh, W. Bohm, and W. Najjar, “Control of Parallelism in Multithreaded Code,” Proc. Int'l Conf. Parallel Architectures and Compiler Techniques (PACT-95), pp. 131-139, June 1995.
[33] B. Shankar and L. Roh, “MIDC Language Manual,” technical report, CS Dept., Colorado State Univ., July 1996, Manualsmanual.pdf.
[34] J.E. Smith, “Decoupled Access/Execute Computer Architectures,” Proc. Ninth Ann. Symp. Computer Architecture, pp. 112-119, May 1982.
[35] M. Takesue, “A Unified Resource Management and Execution Control Mechanism for Dataflow Machines,” Proc. 14th Int'l Symp. Computer Architecture (ISCA-14), pp. 90-97, June 1987.
[36] H. Terada, S. Miyata, and M. Iwata, “DDMP's: Self-Timed Super-Pipelined Data-Driven Multimedia Processor,” Proc. IEEE, pp. 282-296, Feb. 1999.
[37] R. Thekkath and S.J. Eggers, “The Effectiveness of Multiple Hardware Contexts,” Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 328-337, Oct. 1994.
[38] S.A. Thoreson and A.N. Long, “A Feasibility Study of a Memory Hierarchy in Data Flow Environment,” Proc. Int'l Conf. Parallel Conf., pp. 356-360, June 1987.
[39] M. Tokoro, J.R. Jagannathan, and H. Sunahara, “On the Working Set Concept for Data-Flow Machines,” Proc. 10th Ann. Symp. Computer Architecture (ISCA-10), pp. 90-97, July 1983.
[40] J.Y. Tsai, J. Huang, C. Amlo, D. Lilja, and P.C. Yew, “The Superthreaded Processor Architecture,” IEEE Trans. Computers, vol. 48, no. 9, pp. 881-902, Sept. 1999.
[41] D.M. Tullsen et al., “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” Proc. 22nd Int'l Symp. Computer Architecture, pp. 392-403, 1995.
[42] D.W. Wall, “Limits on Instruction-Level Parallelism,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS-4), pp. 176-188, Apr. 1991.
[43] K. Wilcox and S. Manne, “Alpha Processors: A History of Power Issue and a Look at the Future,” Cool Chips Tutorial in Conjunction with MICRO-32, Dec. 1999.
[44] V. Milutinovic, Microprocessor and Multiprocessor Systems. Wiley Int'l, 2000.

Index Terms:
Multithreaded architectures, dataflow architectures, superscalar, decoupled architectures, Thread Level Parallelism.
K.M. Kavi, R. Giorgi, J. Arul, "Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation," IEEE Transactions on Computers, vol. 50, no. 8, pp. 834-846, Aug. 2001, doi:10.1109/12.947003
Usage of this product signifies your acceptance of the Terms of Use.