The Community for Technology Leaders
Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (2004)
Antibes Juan-les-Pins, France
Sept. 29, 2004 to Oct. 3, 2004
ISSN: 1089-795X
ISBN: 0-7695-2229-7
pp: 85-96
Leonardo Bachega , IBM T. J. Watson Research Center, Yorktown Heights, NY
Siddhartha Chatterjee , IBM T. J. Watson Research Center, Yorktown Heights, NY
Kenneth A. Dockser , IBM Corporation, Research Triangle Park, NC
John A. Gunnels , IBM T. J. Watson Research Center, Yorktown Heights, NY
Manish Gupta , IBM T. J. Watson Research Center, Yorktown Heights, NY
Fred G. Gustavson , IBM T. J. Watson Research Center, Yorktown Heights, NY
Christopher A. Lapkowski , IBM Corporation, Markham, ON, Canada
Gary K. Liu , IBM Corporation, Markham, ON, Canada
Mark P. Mendell , IBM Corporation, Markham, ON, Canada
Charles D. Wait , IBM Corporation, Rochester, MN
T. J. Chris Ward , IBM T. J. Watson Research Center, Yorktown Heights, NY
ABSTRACT
We describe the design, implementation, and evaluation of a dual-issue SIMD-like extension of the PowerPC 440 floating-point unit (FPU) core. This extended FPU is targeted at both IBM's massively parallel Blue-Gene/L machine as well as more pervasive embedded platforms. It has several novel features, such as a computational crossbar and cross-load/store instructions, which enhance the performance of numerical codes. We further discuss the hardware-software co-design that was essential to fully realize the performance benefits of the FPU when constrained by the memory bandwidth limitations and high penalties for misaligned data access imposed by the memory hierarchy on a BlueGene/L node. We describe several novel compiler and algorithmic techniques to take advantage of this architecture. Using both hand-optimized and compiled code for key linear algebraic kernels, we validate the architectural design choices, evaluate the success of the compiler, and quantify the effectiveness of the novel algorithm design techniques. Preliminary performance data shows that the algorithm-compiler-hardware combination delivers a significant fraction of peak floating-point performance for compute-bound kernels such as matrix multiplication, and delivers a significant fraction of peak memory bandwidth for memory-bound kernels such as daxpy, while being largely insensitive to data alignment.
INDEX TERMS
null
CITATION
Leonardo Bachega, Siddhartha Chatterjee, Kenneth A. Dockser, John A. Gunnels, Manish Gupta, Fred G. Gustavson, Christopher A. Lapkowski, Gary K. Liu, Mark P. Mendell, Charles D. Wait, T. J. Chris Ward, "A High-Performance SIMD Floating Point Unit for BlueGene/L: Architecture, Compilation, and Algorithm Design", Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, vol. 00, no. , pp. 85-96, 2004, doi:10.1109/PACT.2004.10014
81 ms
(Ver 3.3 (11022016))