Subscribe

Issue No.12 - Dec. (2012 vol.61)

pp: 1724-1736

Ardavan Pedram , The University of Texas at Austin, Austin

Robert A. van de Geijn , The University of Texas at Austin, Austin

Andreas Gerstlauer , The University of Texas at Austin, Austin

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TC.2012.132

ABSTRACT

As technology is reaching physical limits, reducing power consumption is a key issue on our path to sustained performance. In this paper, we study fundamental tradeoffs and limits in efficiency (as measured in energy per operation) that can be achieved for an important class of kernels, namely the level-3 Basic Linear Algebra Subprograms (BLAS). It is well-accepted that specialization is the key to efficiency. This paper establishes a baseline by studying GEneral Matrix-matrix Multiplication (GEMM) on a variety of custom and general-purpose CPU and GPU architectures. Our analysis shows that orders of magnitude improvements in efficiency are possible with relatively simple customizations and fine-tuning of memory hierarchy configurations. We argue that these customizations can be generalized to perform other representative linear algebra operations. In addition to exposing the sources of inefficiencies in current CPUs and GPUs, our results show our prototype Linear Algebra Processor (LAP) implementing Double-precision GEMM (DGEMM) can achieve 600 GFLOPS while consuming less than 25 Watts in standard 45 nm technology, which is up to 50\times more energy efficient than cutting-edge CPUs.

INDEX TERMS

Bandwidth, System-on-a-chip, Linear algebra, Algorithm design and analysis, Field programmable gate arrays, Memory management, Energy efficiency, Energy management, Low power electronics, special-purpose hardware, Low-power design, energy-aware systems, performance analysis and design aids, matrix multiplication, memory hierarchy, level-3 BLAS

CITATION

Ardavan Pedram, Robert A. van de Geijn, Andreas Gerstlauer, "Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures",

*IEEE Transactions on Computers*, vol.61, no. 12, pp. 1724-1736, Dec. 2012, doi:10.1109/TC.2012.132REFERENCES

- [1] H. Esmaeilzadeh et al., "Dark Silicon and the End of Multicore Scaling,"
Proc. 38th Ann. Int'l Symp. Computer Architecture (ISCA '11), pp. 365-376, 2011.- [2] R. Hameed et al., "Understanding Sources of Inefficiency in General-Purpose Chips,"
Proc. 38th Ann. Int'l Symp. Computer Architecture (ISCA '10), June 2010.- [3] N. Zhang et al., "The Cost of Flexibility in Systems on a Chip Design for Signal Processing Applications," technical report, Univ. of California, 2002.
- [4] J.J. Dongarra et al., "An Extended Set of Fortran Basic Linear Algebra Subprograms,"
ACM Trans. Math. Software, vol. 14, no. 1, pp. 1-17, Mar. 1988.- [5] J.J. Dongarra, "A Set of Level 3 Basic Linear Algebra Subprograms,"
ACM Trans. Math. Software, vol. 16, no. 1, pp. 1-17, Mar. 1990.- [6] K. Goto et al., "Anatomy of a High-Performance Matrix Multiplication,"
ACM Trans. Math. Software, vol. 34, no. 3, article 12, May 2008.- [7] B. Kågström et al., "GEMM-Based Level 3 BLAS: High Performance Model Implementations and Performance Evaluation Benchmark,"
ACM Trans. Math. Software, vol. 24, no. 3, pp. 268-302, 1998.- [8] K. Goto et al., "High-Performance Implementation of the Level-3 BLAS,"
ACM Trans. Math. Software, vol. 35, no. 1, pp. 1-14, 2008.- [9] A. Pedram et al., "A High-Performance, Low-Power Linear Algebra Core,"
Proc. IEEE Int'l Application-Specific Systems, Architectures and Processors (ASAP '11), pp. 35-41, 2011.- [10] S. Galal et al., "Energy-Efficient Floating Point Unit Design,"
IEEE Trans. Computers, vol. 60, no. 7, pp. 913-922, 2011.- [11] N. Muralimanohar et al., "CACTI 6.0: A Tool to Model Large Caches," Technical Report HPL-2009-85, HP Laboratories Palo Alto, 2009.
- [12] "Intel® Math Kernel Library," Intel, User's Guide 314774-009US, 2009.
- [13] R.C. Whaley et al., "Automatically Tuned Linear Algebra Software,"
Proc. ACM/IEEE Conf. Supercomputing (SC '98), 1998.- [14] S. Rixner et al., "Register Organization for Media Processing,"
Proc. Sixth Int'l Symp. High-Performance Computer Architecture (HPCA), pp. 375-386, 2000.- [15] C. Kozyrakis et al., "Overcoming the Limitations of Conventional Vector Processors,"
Proc. 30th Ann. Int'l Symp. Computer Architecture (ISCA '03), pp. 399-409, 2003.- [16] K. Fatahalian et al., "Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication,"
Proc. ACM SIGGRAPH/EUROGRAPHICS Conf. Graphics Hardware (HWWS), Aug. 2004.- [17] V. Allada et al., "Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA Tesla GPU Cluster,"
Proc. IEEE Int'l Conf. Cluster Computing and Workshops (CLUSTER '09), pp. 1-9, 2009.- [18] V. Volkov et al., "Benchmarking GPUs to Tune Dense Linear Algebra,"
Proc. ACM/IEEE Conf. Supercomputing (SC '08), pp. 1-11, 2008.- [19] G. Tan et al., "Fast Implementation of DGEMM on Fermi GPU,"
Proc. Int'l Conf. for High Performance Computing, Networking, Storage and Analysis (SC '11), 2011.- [20] R. Urquhart et al., "Systolic Matrix and Vector Multiplication Methods for Signal Processing,"
Proc. IEE Comm. Radar and Signal Processing, vol. 131, no. 6, pp. 623-631, Oct. 1984.- [21] V. Kumar et al., "Synthesizing Optimal Family of Linear Systolic Arrays for Matrix Computations,"
Proc. Int'l Conf. Systolic Array (ICSA '88), pp. 51-60, 1988.- [22] H. Jagadish et al., "A Family of New Efficient Arrays for Matrix Multiplication,"
IEEE Trans. Computers, vol. 38, no. 1, pp. 149-155, Jan. 1989.- [23] T. Lippert et al., "Hyper-Systolic Matrix Multiplication,"
Parallel Computing, vol. 27, pp. 737-759, Jan. 2001.- [24] K. Johnson et al., "General-Purpose Systolic Arrays,"
Computer, vol. 26, no. 11, pp. 20-31, 1993.- [25] C. Takahashi et al., "Design and Power Performance Evaluation of On-Chip Memory Processor with Arithmetic Accelerators,"
Proc. Int'l Workshop Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA '08), pp. 51-57, 2008.- [26] J. Kelm et al., "Rigel: An Architecture and Scalable Programming Interface for a 1000-Core Accelerator,"
Proc. 36th Ann. Int'l Symp. Computer Architecture (ISCA '09), June 2009.- [27] S. Vangal et al., "An 80-Tile Sub-100-W Teraflops Processor in 65-nm CMOS,"
IEEE J. Solid-State Circuits, vol. 43, no. 1, pp. 29-41, Jan. 2008.- [28] "CSX700 Floating Point Processor," ClearSpeed Technology Ltd., Datasheet 06-PD-1425 Rev 1, 2011.
- [29] M. Parker, "Achieving Teraflops Performance with 28 nm FPGAs,"
EDA Tech Forum, Dec. 2010.- [30] O. Garreau et al., "Scaling up to Teraflops Performance with the Virtex-7 Family and High-Level Synthesis," Xilinx White Paper: Virtex-7 FPGA, Feb. 2011.
- [31] M. Parker, "High-Performance Floating-Point Implementation Using FPGAs,"
Proc. IEEE Military Comm. Conf. (MILCOM), 2009.- [32] P. Zicari, "A Matrix Product Accelerator for Field Programmable Systems on Chip,"
Microprocessors and Microsystems, vol. 32, pp. 53-67, 2008.- [33] L. Zhuo and V. Prasanna, "Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on Reconfigurable Computing Systems,"
IEEE Trans. Parallel and Distributed Systems, vol. 18, no. 4, pp. 433-448, Apr. 2007.- [34] G. Kuzmanov et al., "Floating-Point Matrix Multiplication in a Polymorphic Processor,"
Proc. Int'l Conf. Field-Programmable Technology (ICFPT '07), pp. 249-252, 2007.- [35] Y. Dou et al., "64-bit Floating-Point FPGA Matrix Multiplication,"
Proc. ACM/SIGDA 13th Int'l Symp. Field-Programmable Gate Arrays (FPGA '05), 2005.- [36] J.-W. Jang et al., "Energy- and Time-Efficient Matrix Multiplication on FPGAs,"
IEEE Trans. Very Large Scale Integration Systems, vol. 13, no. 11, pp. 1305-1319, Nov. 2005.- [37] V. Kumar and Y. Tsai, "On Synthesizing Optimal Family of Linear Systolic Arrays for Matrix Multiplication,"
IEEE Trans. Computers, vol. 40, no. 6, pp. 770-774, June 1991.- [38] V. Eijkhout,
Introduction to High Performance Scientific Computing, www.lulu.com, 2011.- [39] J. Li, "A Poly-Algorithm for Parallel Dense Matrix Multiplication on Two-Dimensional Process Grid Topologies,"
Concurrency: Practice and Experience vol. 9, no. 5, pp. 345-389, May 1997.- [40] R. van de Geijn and J. Watts, "SUMMA: Scalable Universal Matrix Multiplication Algorithm,"
Concurrency: Practice and Experience, vol. 9, no. 4, pp. 255-274, Apr. 1997.- [41] B.A. Hendrickson and D.E. Womble, "The Torus-Wrap Mapping for Dense Matrix Calculations on Massively Parallel Computers,"
SIAM J. Scientific Computing, vol. 15, no. 5, pp. 1201-1226, 1994.- [42] J. Choi et al., "ScaLAPACK: A Scalable Linear Algebra Library for Distributed Memory Concurrent Computers,"
Proc. Fourth Symp. Frontiers of Massively Parallel Computation (FMPC '92), 1992.- [43] A. Pedram et al., "Co-Design Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures," Technical Report UT-CERC-12-02, UT Austin, CERC, 2011.
- [44] S. Li et al., "McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures,"
Proc. IEEE/ACM 42nd Ann. Int'l Symp. Microarchitecture (MICRO), 2009.- [45] D. Brooks, "Wattch: A Framework for Architectural-Level Power Analysis and Optimizations,"
Proc. 27th Ann. Int'l Symp. Computer Architecture (ISCA '00), pp. 83-94, 2000.- [46] S. Hong et al., "An Integrated GPU Power and Performance Model,"
Proc. 37th Ann. Int'l Symp. Computer Architecture (ISCA '10), June 2010.- [47] H. Wong et al., "Demystifying GPU Microarchitecture through Microbenchmarking,"
Proc. IEEE Int'l Symp. Performance Analysis of Systems and Software (ISPASS '10), pp. 235-246, 2010.- [48] "Samsung DDR3 SDRAM:High-Performance, Energy-Efficient Memory for Today's Green Computing Platforms," technical report, SAMSUNG Green Memory, Mar. 2009.
- [49] "Fermi Computer Architecture White Paper," technical report, NVIDIA, 2009.
- [50] D. Kanter, "Inside Fermi: Nvidia's HPC Push," technical report, Real World Tech nologies, Sept. 2009.
- [51] V. George et al., "Penryn: 45-nm Next Generation Intel Core 2 Processor,"
Proc. IEEE Asian Solid-State Circuits Conf., Jan. 2008.- [52] N. Muralimanohar et al., "Architecting Efficient Interconnects for Large Caches with Cacti 6.0,"
IEEE Micro, vol. 28, no. 1, pp. 69-79, Jan./Feb. 2008.- [53] R. Gonzalez and M. Horowitz, "Energy Dissipation in General Purpose Microprocessors,"
IEEE J. Solid-State Circuits, vol. 31, no. 9, pp. 1277-1284, Sept. 1996.- [54] M. Ware et al., "Architecting for Power Management: The IBM POWER7 Approach,"
Proc. IEEE 16th Int'l Symp. High Performance Computer Architecture (HPCA '10), pp. 1-11, 2010.- [55] F. Lauginiger et al., "Performance of a Multicore Matrix Multiplication Library,"
Proc. Workshop Software Tools for MultiCore Systems (STMCS '07), Jan. 2007.- [56] E.S. Chung et al., "Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?"
Proc. IEEE/ACM 43rd Ann. Int'l Symp. Microarchitecture (MICRO '43), pp. 225-236, 2010.- [57] E. Anderson et al.,
LAPACK Users' Guide, third ed. Soc. for Industrial and Applied Math., 1999. |