Issue No. 03 - March (2017 vol. 66)
Wangchen Dai , Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong
Donald Donglong Chen , Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong
Ray C. C. Cheung , Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong
Cetin Kaya Koc , Department of Computer Science, Universty of California Santa Barbara, Santa Barbara, CA
The modular multiplication operation is the most time-consuming operation for number-theoretic cryptographic algorithms involving large integers, such as RSA and Diffie-Hellman. Implementations reveal that more than 75 percent of the time is spent in the modular multiplication function within the RSA for more than 1,024-bit moduli. There are fast multiplier architectures to minimize the delay and increase the throughput using parallelism and pipelining. However such designs are large in terms of area and low in efficiency. In this paper, we integrate the fast Fourier transform (FFT) method into the McLaughlin’s framework, and present an improved FFT-based Montgomery modular multiplication (MMM) algorithm achieving high area-time efficiency. Compared to the previous FFT-based designs, we inhibit the zero-padding operation by computing the modular multiplication steps directly using cyclic and nega-cyclic convolutions. Thus, we reduce the convolution length by half. Furthermore, supported by the number-theoretic weighted transform, the FFT algorithm is used to provide fast convolution computation. We also introduce a general method for efficient parameter selection for the proposed algorithm. Architectures with single and double butterfly structures are designed obtaining low area-latency solutions, which we implemented on Xilinx Virtex-6 FPGAs. The results show that our work offers a better area-latency efficiency compared to the state-of-the-art FFT-based MMM architectures from and above 1,024-bit operand sizes. We have obtained area-latency efficiency improvements up to 50.9 percent for 1,024-bit, 41.9 percent for 2,048-bit, 37.8 percent for 4,096-bit and 103.2 percent for 7,680-bit operands. Furthermore, the operating latency is also outperformed with high clock frequency for length-64 transform and above.
Computer architecture, Hardware, Algorithm design and analysis, Spectral analysis, Periodic structures, Fast Fourier transforms
Wangchen Dai, Donald Donglong Chen, Ray C. C. Cheung, Cetin Kaya Koc, "Area-Time Efficient Architecture of FFT-Based Montgomery Multiplication", IEEE Transactions on Computers, vol. 66, no. , pp. 375-388, March 2017, doi:10.1109/TC.2016.2601334