Subscribe

Issue No.08 - Aug. (2012 vol.61)

pp: 1110-1126

A. G. Andreou , Dept. of Electr. & Comput. Eng., Johns Hopkins Univ., Baltimore, MD, USA

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TC.2011.169

ABSTRACT

Beginning with Amdahl's law, we derive a general objective function that links parallel processing performance gains at the system level, to energy and delay in the subsystem microarchitecture structures. The objective function employs parameterized models of computation and communication to represent the characteristics of processors, memories, and communications networks. The interaction of the latter microarchitectural elements defines global system performance in terms of energy-delay cost. Following the derivation, we demonstrate its utility by applying it to the problem of Chip Multiprocessor (CMP) architecture exploration. Given a set of application and architectural parameters, we solve for the optimal CMP architecture for six different architectural optimization examples. We find the parameters that minimize the total system cost, defined by the objective function under the area constraint of a single die. The analytical formulation presented in this paper is general and offers the foundation for the quantitative and rapid evaluation of computer architectures under different constraints including that of single die area.

INDEX TERMS

parallel processing, microprocessor chips, single die area, Amdahl law, general objective function, parallel processing performance, system level, subsystem microarchitecture structures, memories, communications networks, microarchitectural elements, global system performance, energy-delay cost, chip multiprocessor architecture exploration, architectural parameters, optimal CMP architecture, architectural optimization, computer architectures, Delay, Program processors, Computer architecture, Parallel processing, Hidden Markov models, Algorithm design and analysis, Optimization, chip-multiprocessor architecture., Modeling, evaluation, design exploration, and optimization of multiprocessor systems

CITATION

A. G. Andreou, "Beyond Amdahl's Law: An Objective Function That Links Multiprocessor Performance Gains to Delay and Energy",

*IEEE Transactions on Computers*, vol.61, no. 8, pp. 1110-1126, Aug. 2012, doi:10.1109/TC.2011.169REFERENCES

- [1] G.M. Amdahl, "Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities,"
Proc. AFIPS Spring Joint Computer Conf., 1967.- [2] J.L. Hennessy and D.A. Patterson,
Computer Architecture: A Quantitative Approach, second ed. Morgan Kaufmann Publishers, 2002.- [3] R.G. Brown, "Maximizing Beowulf Performance,"
Proc. Fourth Ann. Linux Showacase and Conf., 2000.- [4] J. Gustafson, "Reevaluating Amdahl's Law,"
Comm. ACM, vol. 31, no. 5, pp. 532-533, May 1988.- [5] S. Krishnaprasad, "Uses and Abuses of Amdahl's Law,"
J. Computing Sciences in Colleges, vol. 17, no. 2, pp. 288-293, Dec. 2001.- [6] G. Bell, J. Gray, and A. Szalay, "Petascale Computational Systems,"
Computer, vol. 39, no. 1, pp. 110-112, Jan. 2006.- [7] A. Szalay and G. Bell, "GrayWulf: Scalable Clustered Architecture for Data Intensive Computing,"
Proc. Hawaii Int'l Conf. System Sciences (HICSS), 2009.- [8] S. Borkar, "Thousand Core Chips: A Technology Perspective,"
Proc. ACM/IEEE Design Automation Conf. (DAC), 2007.- [9] M. Hill and M. Marty, "Amdahl's Law in the Multicore Era,"
Computer, vol. 41, no. 7, pp. 33-38, July 2008.- [10] J.M. Paul and B.H. Meyer, "Amdahl's Law Revisited for Single Chip Systems,"
Int'l J. Parallel Programming, vol. 35, no. 2, Apr. 2007.- [11] D.H. Woo and H. Lee, "Extending Amdahl's Law for Energy-Efficient Computing in the Many-Core Era,"
Computer, vol. 41, no. 12, pp. 24-31, Dec. 2008.- [12] K. Olukotun, B. Nayfeh, L. Hammond, K. Wilson, and K. Chang, "The Case for a Single-Chip Multiprocessor,"
Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII), 1996.- [13] L. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese, "Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing,"
Proc. Int'l Symp. Computer Architecture, 2000.- [14] C. Mead and L. Conway,
Introduction to VLSI Systems. Addison-Wesley Publishers, 1979.- [15] A.S. Cassidy and A.G. Andreou, "Analytical Methods for the Design and Optimization of Chip-Multiprocessor Architectures,"
Proc. 43rd Ann. Conf. Information Sciences and Systems, 2009.- [16] A.S. Cassidy, K. Yu, H. Zhou, and A.G. Andreou, "A High-Level Analytical Model for Application Specific CMP Design Exploration,"
Proc. Conf. Design Automation and Test in Europe (DATE), 2011.- [17] S. Young et al.,
The HTK Book. Univ. of Cambridge, 2009.- [18] J.A. Rice,
Mathematical Statistics and Data Analysis, third ed. Duxbury Press, 2006.- [19] P. Kogge et al., "ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems," DARPA IPTO report, 2008.
- [20] R. Gonzalez and M. Horowitz, "Energy Dissipation in General Purpose Microprocessors,"
IEEE J. Solid-State Circuits, vol. 31, no. 9, pp. 1277-1284 , Sept. 1996.- [21] S. Przybylski, M. Horowitz, and J. Hennessy, "Characteristics of Performance-Optimal Multi-Level Cache Hierarchies,"
Proc. Int'l Symp. Computer Architecture (ISCA), 1989.- [22] A. Hartstein, V. Srinivasan, T. Puzak, and P. Emma, "On the Nature of Cache Miss Behavior: Is It Square Root of 2?"
J. Instruction-Level Parallelism, vol. 10, 2008.- [23] L. Codrescu, M. Deb-Pant, T. Taha, J. Eble, S. Wills, and J. Meindl, "Exploring Microprocessor Architectures for Gigascale Integration,"
Proc. 20th Anniversary Conf. Advanced Research in VLSI, 1999.- [24] D. Brooks, V. Tiwari, and M. Martonosi, "Wattch: A Framework for Architectural-Level Power Analysis and Optimizations,"
Proc. Int'l Symp. Computer Architecture (ISCA), 2000.- [25] D. Brooks, P. Bose, V. Srinivasan, M. Gschwind, P. Emma, and M. Rosenfield, "New Methodology for Early-Stage, Microarchitecture-Level Power-Performance Analysis of Microprocessors,"
IBM J. Research and Development, 2003.- [26] N. Vijaykrishnan, M. Kandemir, M. Irwin, H. Kim, and W. Ye, "Energy-Driven Integrated Hardware-Software Optimizations Using SimplePower,"
ACM SIGARCH Computer Architecture News, vol. 28, no. 2, pp. 95-106, May 2000.- [27] S. Thoziyoor, N. Muralimanohar, and J. Ahn, "CACTI 5.1," Technical Report HPL-2008-20, HP Laboratories, 2008.
- [28] C.-L. Su and A. Despain, "Cache Design Trade-Offs for Power and Performance Optimization: A Case Study,"
Proc. Int'l Symp. Low Power Design (ISLPED), 1995.- [29] M. Kamble and K. Ghose, "Analytical Energy Dissipation Models for Low Power Caches,"
Proc. Int'l Symp. Low Power Electronics and Design, 1997.- [30] M. Kamble and K. Ghose, "Energy-Efficiency of VLSI Caches: A Comparative Study,"
Proc. Int'l Conf. VLSI Design, 1997.- [31] F. Pollack, "New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies (Keynote Address)(Abstract Only),"
Proc. ACM/IEEE Int'l Symp. Microarchitecture (MICRO 32), 1999.- [32] J. Schutz and C. Webb, "A Scalable X86 CPU Design for 90 nm Process,"
Proc. IEEE Int'l Solid-State Circuits Conf. (ISSCC), 2004.- [33] H. McIntyre et al., "A 4-MB On-Chip L2 Cache for a 90-nm 1.6-GHz 64-Bit Microprocessor,"
IEEE J. Solid-State Circuits, vol. 40., no. 1, pp. 52-59, Jan. 2005.- [34] J. Chang et al., "A 130-nm Triple-Vt 9-MB Third-Level On-Die Cache for the 1.7-GHz Itanium 2 Processor,"
IEEE J. Solid-State Circuits, vol. 40, no. 1, pp. 195-203, Jan. 2005.- [35] C. McNairy and R. Bhatia, "Montecito: A Dual-Core, Dual-Thread Itanium Processor,"
IEEE MICRO, vol. 25, no. 2, pp. 10-20, Mar./Apr. 2005.- [36] AMD, "AMD Website," www.amd.com/, 2010.
- [37] Intel, "Intel Website," www.intel.com, 2010.
- [38] J. Tendler, J. Dodson, J. Fields, H. Le, and B. Sinharoy, "POWER4 System Microarchitecture,"
IBM J. Research and Development, vol. 46, no. 1, pp. 5-25, Jan. 2002.- [39] R. Kalla, B. Sinharoy, and J. Tendler, "IBM Power5 Chip: A Dual-Core Multithreaded Processor,"
IEEE MICRO, vol. 24, no. 2, pp. 40-47, Mar./Apr. 2004.- [40] H. Le, W. Starke, J. Fields, F. O'Connell, D. Nguyen, B. Ronchetti, W. Sauer, E. Schwarz, and M. Vaden, "IBM Power6 Microarchitecture,"
IBM J. Research and Development, vol. 51, no. 6, Nov. 2007.- [41] P. Conway and B. Hughes, "The AMD Opteron Northbridge Architecture,"
IEEE MICRO, vol. 27, no. 2, pp. 10-21, Mar./Apr. 2007.- [42] B. Stackhouse et al., "A 65 nm 2-Billion Transistor Quad-Core Itanium Processor,"
IEEE J. Solid-State Circuits, vol. 44, no. 1, pp. 18-31, Jan. 2009.- [43] P. Kongetira, K. Aingaran, and K. Olukotun, "Niagara: A 32-Way Multithreaded SPARC Processor,"
IEEE MICRO, vol. 25, no. 2, pp. 21-29, Mar./Apr. 2005.- [44] M. Shah et al., "UltraSPARC T2: A Highly-Treaded, Power-Efficient, SPARC SOC,"
Proc. IEEE Asian Solid-State Circuits Conf., 2007.- [45] U. Nawathe, M. Hassan, K. Yen, A. Kumar, A. Ramachandran, and D. Greenhill, "Implementation of an 8-Core, 64-Thread, Power-Efficient SPARC Server on a Chip,"
IEEE J. Solid-State Circuits, vol. 43, no. 1, pp. 6-20, Jan. 2008.- [46] G. Konstadinidis et al., "Architecture and Physical Implementation of a Third Generation 65nm, 16 Core, 32 Thread Chip-Multithreading SPARC Processor,"
IEEE J. Solid-State Circuits, vol. 44, no. 1, pp. 7-17 Jan. 2009.- [47] Azul, "Azul Website," www.azulsystems.com, 2010.
- [48] J. Andrews and N. Baker, "Xbox 360 System Architecture,"
IEEE MICRO, vol. 26, no. 2, pp. 25-37, Mar./Apr. 2006.- [49] D. Pham et al., "Overview of the Architecture, Circuit Design, and Physical Implementation of a First-Generation Cell Processor,"
IEEE J. Solid-State Circuits, vol. 14, no. 1, pp. 179-196, Jan. 2006.- [50] M. Kistler, M. Perrone, and F. Petrini, "Cell Multiprocessor Communication Network: Built for Speed,"
IEEE MICRO, vol. 26, no. 3, pp. 10-23, May/June 2006.- [51] L. Seiler et al., "Larrabee: A Many-Core x86 Architecture for Visual Computing,"
ACM Trans. Graphics, vol. 27, no. 3, article 18, Aug. 2008.- [52] S. Vangal et al., "An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS,"
Proc. IEEE Int'l Solid-State Circuits Conf. (ISSCC), 2007.- [53] S. Bell et al., "TILE64 - Processor: A 64-Core SoC with Mesh Interconnect,"
Proc. IEEE Int'l Solid-State Circuits Conf. (ISSCC), 2008.- [54] Clearspeed, "ClearSpeed Website," www.clearspeed.com, 2010.
- [55] J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, "GPU Computing,"
Proc. IEEE, vol. 96, no. 5, pp. 879–899 May 2008.- [56] Nvidia, "NVIDIA Website," www.nvidia.com, 2010.
- [57] B. Khailany, T. Williams, J. Lin, E. Long, M. Rygh, D. Tovey, and W. Dally, "A Programmable 512 GOPS Stream Processor for Signal, Image, and Video Processing,"
IEEE J. Solid-State Circuits, vol. 43, no. 1, pp. 202-213, Jan. 2008.- [58] D. Burger and T. Austin, "The SimpleScalar Tool Set, Version 2.0,"
SIGARCH Computer Architecture News, 1997.- [59] L. Zhao, R. Iyer, J. Moses, R. lllikkal, S. Makineni, and D. Newell, "Exploring Large-Scale CMP Architectures Using ManySim,"
IEEE MICRO, vol. 27, no. 4, pp. 21-33, July/Aug. 2007.- [60] P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, "Simics: A Full System Simulation Platform,"
Computer, vol. 35, no. 2, pp. 50-58, Feb. 2002.- [61] C. Hughes, V. Pai, P. Ranganathan, and S. Adve, "Rsim: Simulating Shared-Memory Multiprocessors with ILP Processors,"
Computer, vol. 35, no. 2, pp. 40-49, Feb. 2002.- [62] M. Rosenblum, E. Bugnion, S. Devine, and S. Herrod, "Using the SimOS Machine Simulator to Study Complex Computer Systems,"
ACM Trans. Modeling and Computer Simulation, vol. 7, no. 1, pp. 78-103, Jan. 1997.- [63] T. Oh, H. Lee, K. Lee, and S. Cho, "An Analytical Model to Study Optimal Area Breakdown between Cores and Caches in a Chip Multiprocessor,"
Proc. IEEE Computer Soc. Ann. Symp. VLSI (ISVLSI), 2009.- [64] Y. Li, B. Lee, D. Brooks, Z. Hu, and K. Skadron, "CMP Design Space Exploration Subject to Physical Constraints,"
Proc. Int'l Symp. High-Performance Computer Architecture (HPCA), 2006.- [65] M. Monchiero, R. Canal, and A. González, "Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal View,"
Proc. Int'l Conf. Supercomputing (ICS), 2006.- [66] Maplesoft, "Maple," www.maplesoft.com/productsMaple/, 2010.
- [67] Sagemath, "Sage," www.sagemath.org/, 2010.
- [68] D. Chandra, F. Guo, S. Kim, and Y. Solihin, "Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture,"
Proc. Int'l Symp. High-Performance Computer Architecture (HPCA), 2005.- [69] X. Chen and T. Aamodt, "A First-Order Fine-Grained Multithreaded Throughput Model,"
Proc. Int'l Symp. High Performance Computer Architecture (HPCA), 2009.- [70] D.J. Sorin, V.S. Pai, S.V. Adve, M.K. Vernon, and D.A. Wood, "Analytic Evaluation of Shared-Memory Systems with ILP Processors,"
ACM SIGARCH Computer Architecture News, vol. 26, no. 3, pp. 380-391, June 1998. |