This Article 
 Bibliographic References 
 Add to: 
Cache-Based Memory Copy Hardware Accelerator for Multicore Systems
November 2010 (vol. 59 no. 11)
pp. 1494-1507
Filipa Duarte, Holst Centre/IMEC, Eindhoven
Stephan Wong, Delft University of Technology, The Netherlands
In this paper, we present a new architecture of the cache-based memory copy hardware accelerator in a multicore system supporting message passing. The accelerator is able to accelerate memory data movements, in particular memory copies. We perform an analytical analysis based on open-queuing theory to study the utilization of our accelerator in a multicore system. In order to correctly model the system, we gather the necessary information by utilizing a full-system simulator. We present both the simulation results and the analytical analysis. We demonstrate the advantages of our solution based on a full-system simulator utilizing several applications: the STREAM benchmark and the receiver-side of the TCP/IP stack. Our accelerator provides speedups from 2.96 to 4.61 for the receiver-side of the TCP/IP stack, reduces the number of instructions from 26 percent to 44 percent and achieves a higher cache hit rate. Utilizing the analytical analysis, our accelerator reduces in the average number of cycles executed per instruction up to 50 percent for one of the CPUs in the multicore system.

[1] S. Vassiliadis, F. Duarte, and S. Wong, "A Load/Store Unit for a memcpy Hardware Accelerator," Proc. Int'l Conf. Field Programmable Logic and Applications, pp. 537-541, 2007.
[2] S. Wong, F. Duarte, and S. Vassiliadis, "A Hardware Cache memcpy Accelerator," Proc. IEEE Int'l Conf. Field-Programmable Technology, pp. 141-148, 2006.
[3] F. Duarte and S. Wong, "A memcpy Hardware Accelerator Solution for Non Cache-Line Aligned Copies," Proc. Int'l Conf. Application-Specific Systems, Architectures and Processors, pp. 397-402, 2007.
[4] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner, "Simics: A Full System Simulation Platform," IEEE Computer, vol. 35, no. 2, pp. 50-58, Feb. 2002.
[5] H. Shan and J.P. Singh, "A Comparison of MPI, SHMEM and Cache-Coherence Shared Address Space Programming Models on Tightly-Coupled Multiprocessors," Int'l J. Parallel Programming, vol. 29, no. 3, pp. 283-318, May 2001.
[6] J. Leverich, H. Arakida, A. Solomatnikov, A. Firoozshahian, M. Horowitz, and C. Kozyrakis, "Comparing Memory Systems for Chip Multiprocessors," ACM SIGARCH Computer Architecture News, vol. 35, no. 2, pp. 358-368, May 2007.
[7] F. O'Carroll, H. Tezuka, A. Hori, and Y. Ishikawa, "The Design and Implementation of Zero Copy MPI Using Commodity Hardware with a High Performance Network," Proc. 12th Int'l Conf. Supercomputing, pp. 243-250, 1998.
[8] "Cray T3D System Architecture," Cray Research, Inc., 1993.
[9] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, D.N.J. Chapin, M.H.J. Baxter, A. Gupta, M. Rosenblum, and J. Hennessy, "The Stanford FLASH Multiprocessor," Proc. 21st Ann. Int'l Symp. Computer Architecture, pp. 302-313, 1994.
[10] J. Heinlein, K. Gharachorloo, R. Bosch, M. Rosenblum, and A. Gupta, "Coherent Block Data Transfer in the FLASH Multiprocessor," Proc. 11th Int'l Symp. Parallel Processing, pp. 18-27, 1997.
[11] S.C. Woo, J.P. Singh, and J.L. Hennessy, "The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors," Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 219-229, 1994.
[12] D. Buntinas, G. Mercier, and W. Gropp, "Data Transfers between Processes in an SMP System: Performance Study and Application to MPI," Proc. 2006 Int'l Conf. Parallel Processing, pp. 487-496, 2006.
[13] R.E. Matick, "Comparison of Analytic Performance Models using Closed Mean-Value Analysis Versus Open-Queuing Theory for Estimating Cycles per Instruction of Memory Hierarchies," IBM J. Research and Development, vol. 47, no. 4, pp. 495-517, July 2003.
[14] R.E. Matick, T.J. Heller, and M. Ignatowski, "Analytical Analysis of Finite Cache Penalty and Cycles per Instruction of a Multiprocessor Memory Hierarchy using Miss Rates and Queuing Theory," IBM J. Research and Development, vol. 45, no. 6, pp. 819-842, Nov. 2001.
[15] L. Kleinrock, Queueing Systems, Vol. I: Theory, Vol. II: Computer Applications. John Wiley and Sons, Inc., 1975.
[16] S. Lavenberg, Computer Performance Modeling Handbook. Academic Press, Inc., 1983.
[17] D. Gross and C.M. Harris, Fundamentals of Queueing Theory. John Wiley and Sons, Inc., 1998.
[18] T. Lovett and R. Clapp, "STiNG: A CC-NUMA Computer System for the Comercial Marketplace," ACM SIGARCH Computer Architecture News, vol. 24, no. 2, pp. 308-317, May 1996.
[19] J.D. McCalpin, "A Survey of Memory Bandwidth and Machine Balance in Current High Performance Computers," IEEE Computer Soc. Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19-25, Dec. 1995.
[20] L. Zhao, L. Bhuyan, R. Iyer, S. Makineni, and D. Newell, "Hardware Support for Accelerating Data Movement in Server Platform," IEEE Trans. Computers, vol. 56, no. 6, pp. 740-753, June 2007.
[21] "SimpleScalar," http:/, 2010.

Index Terms:
Hardware accelerator, cache, multicore, TCP/IP, open-queuing theory.
Filipa Duarte, Stephan Wong, "Cache-Based Memory Copy Hardware Accelerator for Multicore Systems," IEEE Transactions on Computers, vol. 59, no. 11, pp. 1494-1507, Nov. 2010, doi:10.1109/TC.2010.41
Usage of this product signifies your acceptance of the Terms of Use.