This Article 
 Bibliographic References 
 Add to: 
An Optimized Cell BE Special Function Library Generated by Coconut
August 2009 (vol. 58 no. 8)
pp. 1126-1138
Christopher Kumar Anand, McMaster University, Hamilton
Wolfram Kahl, McMaster University, Hamilton
Coconut, a tool for developing high-assurance, high-performance kernels for scientific computing, contains an extensible domain-specific language (DSL) embedded in Haskell. The DSL supports interactive prototyping and unit testing, simplifying the process of designing efficient implementations of common patterns. Unscheduled C and scheduled assembly language output are supported. Using the patterns, even nonexpert users can write efficient function implementations, leveraging special hardware features. A production-quality library of elementary functions for the Cell BE SPU compute engines has been developed. Coconut-generated and -scheduled vector functions were more than four times faster than commercially distributed functions written in C with intrinsics (a nicer syntax for in-line assembly), wrapped in loops and scheduled by spuxlc. All Coconut functions were faster, but the difference was larger for hard-to-approximate functions for which register-level SIMD lookups made a bigger difference. Other helpful features in the language include facilities for translating interval and polynomial descriptions between GHCi, a Haskell interpreter used to prototype in the DSL, and Maple, used for exploration and minimax polynomial generation. This makes it easier to match mathematical properties of the functions with efficient calculational patterns in the SPU ISA. By using single, literate source files, the resulting functions are remarkably readable.

[1] S. Gal, “Computing Elementary Functions: A New Approach for Achieving High Accuracy and Good Performance,” Proc. Symp. Accurate Scientific Computations, pp. 1-16, 1986.
[2] S. Peyton Jones et al., “The Revised Haskell 98 Report,” Cambridge Univ. Press, http:/, 2003.
[3] P. Hudak, “Building Domain-Specific Embedded Languages,” ACM Computing Surveys, vol. 28, no. 4es, p. 196, Dec. 1996.
[4] W. Kahl, C.K. Anand, and J. Carette, “Control-Flow Semantics for Assembly-Level Data-Flow Graphs,” Proc. Eighth Int'l Seminar on Relational Methods in Computer Science, (RelMiCS 2005), pp. 147-160, 2006.
[5] D.E. Knuth, “Literate Programming,” The Computer J., vol. 27, no. 2, pp. 97-111, 1984.
[6] C. Grelck and S.-B. Scholz, “SAC: A Functional Array Language for Efficient Multi-Threaded Execution,” Int'l J. Parallel Programming, vol. 34, no. 4, pp. 383-427, 2006.
[7] C. Mueller and A. Lumsdaine, “Expression and Loop Libraries for High-Performance Code Synthesis,” Proc. 19th Int'l Workshop Languages and Compilers for Parallel Computing, 2006.
[8] G. Bandera, M. Gonzalez, J. Villalba, J. Hormigo, and E.L. Zapata, “Evaluation of Elementary Functions Using Multimedia Features,” Proc. 18th Int'l Parallel and Distributed Processing Symp. (IPDPS '04), 2004.
[9] X. Merrheim, J.-M. Muller, and H.-J. Yeh, “Fast Evaluation of Polynomials and Inverses of Polynomials,” Proc. 11th Symp. Computer Arithmetic, 1993.
[10] P.K. Dubey, M.A. Kaplan, and S.M. Joshi, “Efficient CRC Generation Utilizing Parallel Table Lookup Operations,” US Patent 6223320, 2001.
[11] Z. Shi and R.B. Lee, “Bit Permutation Instructions for Accelerating Software Cryptography,” Proc. IEEE Int'l Conf. Application-Specific Systems, Architectures and Processors (ASAP), pp. 138-148, July 2000.
[12] A. Sazegari, “Vectorized Table Lookup,” US Patent 20020184480, 2002.
[13] IBM Corp., PowerPC Microprocessor Family: Vector/SIMD Multimedia Extension Technology Programming Environments Manual. IBM Systems and Technology Group, 2005.
[14] IBM Corp., Synergistic Processor Unit Instruction Set Architecture. IBM Systems and Technology Group, Oct. 2006.
[15] G.H. Golub and L.B. Smith, “Algorithm 414: Chebyshev Approximation of Continuous Functions by a Chebyshev System of Functions,” Comm. ACM, vol. 14, no. 11, pp. 737-746, 1971.
[16] N. Brisebarre and S. Chevillard, “Efficient Polynomial $L^{\infty }$ Approximations,” Proc. 18th IEEE Symp. Computer Arithmetic, pp. 169-176, 2007.
[17] W. Thaller, “Explicitly Staged Software Pipelining,” master's thesis, Dept. of Computing and Software, McMaster Univ., , 2006.
[18] C.K. Anand and W. Kahl, “A Domain-Specific Language for the Generation of Optimized SIMD-Parallel Assembly Code,” SQRL Report 43, Software Quality Research Laboratory, McMaster Univ., http://sqrl.mcmaster.casqrl_reports.html , May 2007.
[19] J.M. Muller, Elementary Functions: Algorithms and Implementation. Birkhauser Boston, Inc., 1997.
[20] R. Enenkel, “A Comprehensive Test Environment for Mathematical Functions,” IBM Technical Report TR-74.200, IBM Corp., 2004.

Index Terms:
Special function approximations, parallel and vector implementations, code generation, specialized application languages, SIMD processors, applicative (functional) programming.
Christopher Kumar Anand, Wolfram Kahl, "An Optimized Cell BE Special Function Library Generated by Coconut," IEEE Transactions on Computers, vol. 58, no. 8, pp. 1126-1138, Aug. 2009, doi:10.1109/TC.2008.223
Usage of this product signifies your acceptance of the Terms of Use.