Subscribe

Issue No.12 - December (2011 vol.60)

pp: 1692-1703

Zhimin Chen , Virginia Polytechnic Institute and State University, Blacksburg

Patrick Schaumont , Virginia Polytechnic Institute and State University, Blacksburg

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TC.2010.256

ABSTRACT

The Montgomery Multiplication is one of the cornerstones of public-key cryptography, with important applications in the RSA algorithm, in Elliptic-Curve Cryptography, and in the Digital Signature Standard. The efficient implementation of this long-word-length modular multiplication is crucial for the performance of public-key cryptography. Along with the strong momentum of shifting from single-core to multicore systems, we present a parallel-software implementation of the Montgomery multiplication for multicore systems. Our comprehensive analysis shows that the proposed scheme, pSHS, partitions the task in a balanced way so that each core has the same amount of job to do. In addition, we also comprehensively analyze the impact of intercore communication overhead on the performance of pSHS. The analysis reveals that pSHS is high performance, scalable over different number of cores, and stable when the communication latency changes. The analysis also tells us how to set different parameters to achieve the optimal performance. We implemented pSHS on a prototype multicore architecture configured in a Field Programmable Gate Array (FPGA). Compared with the sequential implementation, pSHS accelerates 2,048-bit Montgomery multiplication by 1.97, 3.68, and 6.13 times on, respectively, two-core, four-core, and eight-core architectures with communication latency equal to 100 clock cycles.

INDEX TERMS

Montgomery multiplication, public-key cryptography, parallel programming, tiled processor.

CITATION

Zhimin Chen, Patrick Schaumont, "A Parallel Implementation of Montgomery Multiplication on Multicore Systems: Algorithm, Analysis, and Prototype",

*IEEE Transactions on Computers*, vol.60, no. 12, pp. 1692-1703, December 2011, doi:10.1109/TC.2010.256REFERENCES

- [1] D. Geer, “Chip Makers Turn to Multicore Processors,”
Computer, vol. 38, no. 5, pp. 11-13, 2005.- [2] IBM DeveloperWorks “Cell Broadband Engine Programming Handbook (Version 1.1),” http://www.ibm.com/ developerworks/ power/librarypa-cellperf/, 2005.
- [3] ARM “ARM11 MPCore Processor Technical Reference Manual,” http://infocenter.arm.com/help/index.jsp?topic=/ com. arm. doc.ddi0360fin dex.html , 2011.
- [4] Intel “Intel Xeon Processor 7400 Series Datasheet,” http://www.intel.com/Assets/en_US/PDF/datasheet 320335.pdf, 2011.
- [5] AMD “Six-Core AMD Opteron Processor Product Brief,” http://www.amd.com/us/products/server/processors/ six-core- opteron/Pages six-core-opteron-product-brief.aspx, 2011.
- [6] P.L. Montgomery, “Modular Multiplication without Trial Division,”
Math. of Computation, vol. 44, no. 170, pp. 519-521, 1985.- [7] Ç.K. Koç, T. Acar, and B.S. Kaliski,Jr., “Analyzing and Comparing Montgomery Multiplication Algorithms,”
IEEE Micro, vol. 16, no. 3, pp. 26-33, June 1996.- [8] N. Costigan and P. Schwabe, “Fast Elliptic-Curve Cryptography on the Cell Broadband Engine,”
Proc. Int'l Conf. Cryptology in Africa: Progress in Cryptology (AFRICACRYPT '09), pp. 368-385, 2009.- [9] R. Szerwinski and T. Güneysu, “Exploiting the Power of GPUs for Asymmetric Cryptography,”
Proc. Workshop Cryptographic Hardware and Embedded System (CHES '08), pp. 79-99, 2008.- [10] A. Moss, D. Page, and N.P. Smart, “Toward Acceleration of RSA Using 3D Graphics Hardware,”
Proc. IMA Int'l Conf. Cryptography and Coding 2007, pp. 213-220, 2007.- [11] S. Fleissner, “GPU-Accelerated Montgomery Exponentiation,”
Proc. Int'l Conf. Computational Science (ICCS '07), pp. 213-220, 2007.- [12] N. Costigan and M. Scott, “Accelerating SSL Using the Vector Processors in IBM's Cell Broadband Engine for Sonys Playstation 3,”
Proc. 2009 SPEED Workshop, http://www. hyperelliptic.orgSPEED, Nov. 2009.- [13] R. Rivest, A. Shamir, and L. Adleman, “A Method for Obtaining Digital Signatures and Public Key Cryptosystems,”
Comm. ACM, vol. 21, pp. 120-126, 1978.- [14] National Institute of Standards and Technology (NIST), “Digital Signature Standard (FIPS 186-2),” 2000.
- [15] N. Koblitz, “Elliptic Curve Cryptosystems,”
Math. of Computation, vol. 48, no. 177, pp. 203-209, 1987.- [16] J.-C. Bajard, L.-S. Didier, and P. Kornerup, “An RNS Montgomery Modular Multiplication Algorithm,”
IEEE Trans. Computers, vol. 47, no. 7, pp. 766-776, July 1998.- [17] M.E. Kaihara and N. Takagi, “Bipartite Modular Multiplication,”
Proc. Workshop Cryptographic Hardware and Embedded System (CHES '05), pp. 201-210, 2005.- [18] M.E. Kaihara and N. Takagi, “Bipartite Modular Multiplication Method,”
IEEE Trans. Computers, vol. 57, no. 2, pp. 157-164, Feb. 2008.- [19] K. Sakiyama, M. Knezevic, J. Fan, B. Preneel, and I. Verbauwhede, “Tripartite Modular Multiplication,” technical report, COSIC Internal Report, 2009.
- [20] K. Sakiyama, L. Batina, B. Preneel, and I. Verbauwhede, “Multicore Curve-Based Cryptoprocessor with Reconfigurable Modular Arithmetic Logic Units over GF($2^n$ ),”
IEEE Trans. Computers, vol. 56, no. 9, pp. 1269-1282, Sept. 2007.- [21] J. Fan, K. Sakiyama, and I. Verbauwhede, “Elliptic Curve Cryptography on Embedded Multicore Systems,”
Design Automation for Embedded Systems, vol. 12, no. 3, pp. 231-242, 2008.- [22] J. Fan, K. Sakiyama, and I. Verbauwhede, “Montgomery Modular Multiplication Algorithm for Multi-Core Systems,”
Proc. IEEE Workshop Signal Processing Systems, pp. 261-266, 2007.- [23] B. Baldwin, W.P. Maranane, and R. Granger, “Reconfigurable Hardware Implementation of Arithmetic Modulo Minimal Redundancy Cyclotomic Primes for ECC,”
Proc. Int'l Conf. Reconfigurable Computing and FPGAs, pp. 255-260, 2009.- [24]
Multicore Processors and Systems (Integrated Circuits and Systems), S. Keckler, K. Olukotun, and P.H. Hofstee, eds. Springer, 2009.- [25] K. Sankaralingam, R. Nagarajan, R. Desikan, S. Drolia, M.S. Govindan, P. Gratz, D. Gulati, H. Hanson, C. Kim, H. Liu, N. Ranganathan, S. Sethumadhavan, S. Sharif, P. Shivakumar, S.W. Keckler, and D. Burger, “Distrubuted Microarchitectural Protocols in the Trips Prototype Processor,”
Proc. 39th Ann. Int'l Symp. Microarchitecture, pp. 480-491, 2006.- [26] M. Baron, “Low-Key Intel 80-Core Intro: The Tip of the Iceberg,” microprocessor report, 2007.
- [27] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M.M.C.-C. Miao, J.F. Brown, and A. Agarwal, “On-Chip Interconnection Architecture of the Tile Processor,”
IEEE Micro, vol. 27, no. 5, pp. 15-31, Sept./Oct. 2007.- [28] M.B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffmann, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal, “The Raw Microprocessor: A Computational Fabric for Software Circuits and General Purpose Programs,”
IEEE Micro, vol. 22, no. 2, pp. 25-35, Mar./Apr. 2002.- [29] Z. Chen and P. Schaumont, “pSHS: A Scalable Parallel Software Implementation of Montgomery Multiplication for Multicore Systems,”
Proc. Design, Automation and Test in Europe (DATE '10), pp. 843-848, 2010.- [30] C.D. Walter, “Montgomery Exponentiation Needs No Final Subtractions,”
Electronics Letters, vol. 35, no. 21, pp. 1831-1832, 1999.- [31] A. Karatsuba and Y. Ofman, “Multiplication of Many-Digital Numbers by Automatic Computers,”
Proc. USSR Academy of Sciences, vol. 145, pp. 293-294, 1962.- [32] Intel “Intel Single-Chip Cloud Computers,” http://techresearch. intel.com/articles/ Tera-Scale1826.htm, 2011.
- [33] N. Gura, A. Patel, A. Wander, H. Eberle, and S.C. Shantz, “Comparing Elliptic Curve Cryptography and RSA on 8-Bit CPUs,”
Proc. Cryptographic Hardware and Embedded Systems (CHES '04), pp. 119-132, 2004.- [34] M. Koschuch, J. Lechner, A. Weitzer, J. Großschädl, A. Szekely, S. Tillich, and J. Wolkerstorfer, “Hardware/Software Co-Design of Elliptic Curve Cryptography on an 8051 Microcontroller,”
Proc. Cryptographic Hardware and Embedded Systems (CHES '06), pp. 430-444, 2006. |