This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
SimpleFit: A Framework for Analyzing Design Trade-Offs in Raw Architectures
July 2001 (vol. 12 no. 7)
pp. 730-742

Abstract—The semiconductor industry roadmap projects that advances in VLSI technology will permit more than one billion transistors on a chip by the year 2010. The MIT Raw microprocessor is a proposed architecture that strives to exploit these chip-level resources by implementing thousands of tiles, each comprising a processing element and a small amount of memory, coupled by a static two-dimensional interconnect. A compiler partitions fine-grain instruction-level parallelism across the tiles and statically schedules intertile communication over the interconnect. Because Raw microprocessors fully expose their internal hardware structure to the software, they can be viewed as a gigantic FPGA with coarse-grained tiles in which software orchestrates communication over static interconnections. One open challenge in Raw architectures is to determine their optimal grain size and balance. The grain size is the area of each tile and the balance is the proportion of area in each tile devoted to memory, processing, communication, and off-chip global I/O. If the total chip area is fixed, higher processing power per tile requires large tiles and hence reduces the total number of tiles on the chip. This paper presents SimpleFit, a novel analytical framework that designers can use to reason about the design space of Raw microprocessors. Our model is also generalizable to multiprocessors on a chip. Based on an architectural model, an application model, and a VLSI cost analysis, the framework computes the performance of applications and uses an optimization process to identify designs that will execute these applications most cost-effectively. Although the optimal machine configurations obtained vary for different applications, problem sizes, and budgets, the general trends for various applications are similar. Accordingly, for the applications studied, assuming a onr billion logic transistor equivalent area, we recommend building a Raw chip with approximately 1,000 tiles, 30 words/cycle global I/O, 20 Kbytes of local memory per tile, three to four words/cycle local communication bandwidth, and single-issue processors. This configuration will give performance near the global optimum for most applications.

[1] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal, “Baring It All to Software: Raw Machines,” Computer, pp. 86-93, Sept. 1997.
[2] D. Culler,R. Karp,D. Patterson,A. Sahay,K.E. Schauser,E. Santos,R. Subramonian,, and T. von Eicken,“LogP: Towards a realistic model of parallel computation,” Fourth Symp. Principles and Practices Parallel Programming, SIGPLAN’93, ACM, May 1993.
[3] A. Alexandrov, M. Ionescu, K.E. Schauser, and C. Scheiman, “LogGP: Incorporating Long Messages into the LogP Model,” Proc. Symp. Parallel Algorithms and Architectures '95, July 1995.
[4] J. Babb, R. Tessier, M. Dahl, S. Hanono, D. Hoki, and A. Agarwal, “Logic Emulation with Virtual Wires,” IEEE Trans. Computer-Aided Design, vol. 16, no. 6, pp. 609-626, June 1997.
[5] J.C. Eble III, “A Generic System Simulator with Novel On-Chip Cache and Throughput Models for Gigascale Integration,” PhD thesis, Georgia Inst. of Tech nology, Nov. 1998.
[6] J.M. Mulder, N.T. Quach, and M.J. Flynn, “An Area Model for On-Chip Memories and its Applications,” IEEE J. Solid State Circuits, vol. 26, no. 2, pp. 98-106, Feb. 1991.
[7] H.B. Bakoglu, Circuits, Interconnects and Packaging for VLSI. Reading, Mass.: Addison-Wesley, 1990.
[8] R. Jefferey and M. Berry, "A Framework for Evaluation and Prediction of Metrics Program Success," 1st Int'l Software Metrics Symp., IEEE Computer Soc. Press, Los Alamitos, Calif., 1993, pp. 28-39.
[9] C.A. Moritz and M.I. Frank, “LoGPC: Modeling Network Contention in Message-Passing Programs,” IEEE Trans. Parallel and Distributed Systems, vol. 12, no. 4, 2001.
[10] D. Brooks, V. Tiwari,, and M. Martonosi,"Wattch: A Framework for Architectural-Level Power Analysis and Optimizations," Proc. Int'l Symp. Computer Architecture (ISCA 00), ACM Press, 2000, pp. 83-94.
[11] A. Alexandrov, M. Ionescu, K.E. Schauser, and C. Scheiman, “LogGP: Incorporating Long Messages into the LogP Model,” Proc. Symp. Parallel Algorithms and Architectures '95, July 1995.
[12] D. Yeung, W.J. Dally, and A. Agarwal, “How to Choose the Grain Size of a Parallel Computer,” MIT/LCS Technical Report MIT-LCS-TR-739, Feb. 1994.
[13] C.L. Seitz, N.J. Boden, J. Seizovic, and W.-K. Su, “The Design of the Caltech Mosaic C Multicomputer,” Proc. 1993 Symp. Research on Integrated Systems, pp. 1-22, 1993.
[14] H. Nishi, K.-I. Anjo, T. Kudoh, and H. Amano, “The RDT Router Chip: A Versatile Router for Supporting Shared Memory,” Special Issue on Architecture, Algorithms and Networks for Massively Parallel Computing, IEICE, vol. E00-A, no. 1, Jan. 1997.
[15] CPU Info Center,http://bwrc.eecs.berkeley.eduCIC/, 2001.
[16] P.R. Nuth and W.J. Dally, “The J-Machine Network,” Proc. 1992 IEEE Int'l Conf. Computer Design: VLSI in Computers and Processors, pp. 420-423, Oct. 1992.
[17] C.L. Seitz and W.-K. Su, “A Family of Routing and Communication Chips Based on the Mosaic,” Proc. 1993 Symp. Research on Integrated Systems, pp. 320-337, 1993.
[18] H.T. Kung, “Memory Requirements for Balanced Computer Architectures,” Proc. Int'l Symp. Computer Architecture, ISCA, pp. 49-54, 1986.
[19] T.J. Holman and L. Snyder, “Architectural Tradeoffs in Parallel Computer Design,” Proc. 1989 Decennial Caltech Conf. Advanced Research in VLSI, pp. 317-334, Mar. 1989.
[20] Paul Chow, The MIPS-X RISC Microprocessor. Kluwer Academic, Aug. 1989.
[21] W.J. Dally et al., "Architecture of a Message-Driven Processor." Proc. 14th Int'l Symp. Computer Architecture, IEEE Computer Society Press, Los Alamitos, Calif., 1987, pp. 189-196.
[22] W.J. Dally and C.L. Seitz, “The Torus Routing Chip,” Distributed Computing, vol. 1, pp. 187-196, 1986.
[23] A. Agarwal, D. Chaiken, G. D'Souza, K. Johnson, D. Kranz, J. Kubiatowicz, K. Kurihara, B.-H. Lim, G. Maa, D. Nussbaum, M. Parkin, and D. Yeung, “The MIT Alewife Machine: A Large-Scale Distributed-Memory Multiprocessor,” Proc. Workshop Scalable Shared Memory Multiprocessors, 1991, (also appears as MIT/LCS Memo TM-454, 1991).
[24] W.J. Dally et al., “The J-Machine: A Fine-Grain Concurrent Computer,” Proc. Int'l Federation for Information Processing 11th World Congress, pp. 1147-1153, 1989.
[25] Thinking Machines Corp., CM5 Technical Summary. Cambridge, Mass., Oct. 1991.
[26] CRAY T3D System Architecture Overview, Revision 1.C. Cray Research, Inc., Sept. 1993.
[27] K. Diefendorff and M. Allen, “Organization of the Motorola 88110 Superscalar RISC Microprocessor,” IEEE Micro, vol. 2, no. 2, pp. 40-63, Apr. 1992.
[28] D. Allison and M. Slater, “National Unveils Superscalar RISC Processor,” Microprocessor Report, vol. 5, no. 3, Feb. 1991.
[29] D. Dobberpuhl et al., “A 200 Mhz 64b Dual-Issue Microprocessor,” Proc. IEEE Solid State Circuits Conf., vol. 35, pp. 106-107, Feb. 1992.

Index Terms:
Multiprocessors, microprocessors, modeling, architecture.
Citation:
Csaba Andras Moritz, Donald Yeung, Anant Agarwal, "SimpleFit: A Framework for Analyzing Design Trade-Offs in Raw Architectures," IEEE Transactions on Parallel and Distributed Systems, vol. 12, no. 7, pp. 730-742, July 2001, doi:10.1109/71.940747
Usage of this product signifies your acceptance of the Terms of Use.