The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.05 - May (2012 vol.61)
pp: 622-635
Junhee Yoo , Syst.-LSI Div., Samsung Electron. Co. Ltd., Yongin, South Korea
ABSTRACT
Memory-intensive operations and their memory access latency are often the performance bottleneck in parallel applications. In this paper, we investigate the concept of active memory operation which is an active data processing operation performed on the memory side. Utilizing the active memory operation, we can replace multiple transactions of memory accesses over the on-chip network and related computations on the processor side with a smaller number of high-level transactions and computations on the memory side. To realize the concept, we have designed a special-purpose processor called active memory processor which is tightly coupled with the memory and executes the active memory operations. In our case studies, we have applied the concept to five real-world applications (parallelized JPEG, FFT, text indexing for data mining, histogram, and eikonal equation solver) running on a 36--tile architecture with 64 cores and four memory tiles and found that the proposed approach can improve performance by 20.5~ 259.3 percent.
INDEX TERMS
network-on-chip, memory architecture, special-purpose processor, active memory processor, network-on-chip-based architecture, memory-intensive operation, memory access latency, active memory operation, active data processing operation, on-chip network, high level transactions, System-on-a-chip, Memory management, Random access memory, Registers, Generators, Prefetching, shared memory system., Active memory operation, network-on-chip, on-chip communication
CITATION
Junhee Yoo, "Active Memory Processor for Network-on-Chip-Based Architecture", IEEE Transactions on Computers, vol.61, no. 5, pp. 622-635, May 2012, doi:10.1109/TC.2011.66
REFERENCES
[1] B.K. Mathew, S.A. McKee, J.B. Carter, and A. Davis, “Design of a Parallel Vector Access Unit for SDRAM Memory Systems,” Proc. Sixth Int'l Symp. High-Performance Computer Architecture, pp. 39-48, Jan. 2000.
[2] W.-f. Lin, S.K. Reinhart, and D. Burger, “Reducing DRAM Latencies with an Integrated Memory Hierarchy Design,” Proc. Seventh Int'l Symp. High-Performance Computer Architecture, pp. 301-312, Jan. 2001.
[3] J. Kuskin et al., “The Stanford FLASH Multiprocessor,” Proc. 21st Int'l Symp. Computer Architecture, pp. 302-313, June 1994.
[4] J. Ahn, M. Erez, and W.J. Dally, “Scatter-Add in Data Parallel Architectures,” Proc. 11th Int'l Symp. High-Performance Computer Architecture, pp. 132-142, 2005.
[5] H.P. Hofstee, “Power Efficient Processor Architecture and the Cell Processor,” Proc. 11th Int'l Symp. High-Performance Computer Architecture, 2005.
[6] I. Buck et al., “Brook for GPUs: Stream Computing on Graphics Hardware,” Proc. SIGGRAPH, 2004.
[7] J. Gummaraju, M. Erez, J. Coburn, M. Rosenblum, and W.J. Dally, “Architectural Support for the Stream Execution Model on General-Purpose Processors,” Proc. 16th Int'l Conf. Parallel Architectures and Compilation Techniques, Sept. 2007.
[8] D. Kroft, “Lockup-Free Instruction Fetch/Prefetch Cache Organization,” Proc. 25 years Int'l Symp. Computer Architecture (Selected Papers), pp. 195-201, 1999.
[9] T.V. Eicken, D.E. Culler, S.C. Goldstein, and K.E. Schauser, “Active Messages: A Mechanism for Integrated Communication and Computation,” Proc. 19th Ann. Int'l Symp. Computer Architecture, 1992.
[10] ARM Ltd., ARM1176 Processor, http://www.arm.com/ products/processors/ classic/arm11arm1176.php, 2011.
[11] Z. Fang, L. Zhang, J.B. Carter, A. Ibrahim, and M.A. Parker, “Active Memory Operations,” Proc. 21th Ann. Int'l Conf. Supercomputing, pp. 232-241, July 2007.
[12] S. Murali et al., “Analysis of Error Recovery Schemes for Networks on Chips,” IEEE Design and Test of Computers, vol. 22, no. 5, pp. 434-442, Sept./Oct. 2005.
[13] Khronos Group, OpenCL Overview, http://www.khronos.orgopencl/, 2011.
[14] T.R. Halfhill, “Parallel Processing with CUDA,” Microprocessor Report, Jan. 2008.
[15] Carbon Design Systems, Inc., Carbon SoC Designer, http://carbondesignsystems.comproducts_socd.shtml , 2011.
[16] D. Wang et al., “DRAMsim: A Memory-System Simulator,” SIGARCH Computer Architecture News, vol. 33, no. 4, pp. 100-107, Sept. 2005.
[17] T. Krishna, A. Kumar, P. Chiang, M. Erez, and L.-S. Peh, “NoC with Near-Ideal Express Virtual Channels Using Global-Line Communication,” Proc. 16th IEEE Symp. High Performance Interconnects, pp. 11-20, Aug. 2008.
[18] S. Kumar et al., “Atomic Vector Operations on Chip Multiprocessors,” Proc. 35th Int'l Symp. Computer Architecture, pp. 441-452, June 2008.
[19] J.Z. Wang, Integrated Region-Based Image Retrieval, Kluwer Academic Publishers, 2001.
[20] A. Firoozshahian et al., “A Memory System Design Framework: Creating Smart Memories,” Proc. 36th Int'l Symp. Computer Architecture (ISCA), pp. 406-417, June 2009.
[21] M. Oskin, F.T. Chong, and T. Sherwood, “Active Pages: A Computation Model for Intelligent Memory,” Proc. 25th Int'l Symp. Computer Architecture (ISCA), pp. 192-203, June 1998.
[22] J.B. Brockman, P.M. Kogge, V. Freeh, S.K. Kuntz, and T. Sterling, “Microservers: A New Memory Semantics for Massively Parallel Computing,” Proc. 13th Ann. Int'l Conf. Supercomputing, pp. 454-463, July 1999.
[23] D. Patterson et al., “Intelligent RAM (IRAM): Chips that Remember and Compute,” Proc. 43rd Int'l Solid State Circuits Conf. (ISSCC), pp. 224-225, Feb. 1997.
[24] W. Dally, U.J. Kapasi, B. Khailany, J.H. Ahn, and A. Das, “Stream Processors: Programmability with Efficiency,” ACM Queue, vol. 2, pp. 52-62, Mar. 2004.
[25] Int'l Technology Roadmap for Semiconductors, ITRS Report, http:/public.itrs.net/, 2011.
[26] Taiwan Semiconductor Manufacturing Company, Ltd, http:/www.tsmc.com/, 2011.
[27] J. Yoo, S. Yoo, and K. Choi, “Topology/Floorplan/Pipeline Co-Design of Cascaded Crossbar Bus,” IEEE Trans. Very Large Scale Integration Systems, vol. 17, no. 8, pp. 1034-1047, Aug. 2009.
[28] J.H. Ahn, M. Erez, and W.J. Dally, “The Design Space of Data-Parallel Memory Systems,” Proc. ACM/IEEE Conf. Supercomputing, no. 80, Nov. 2006.
[29] J. Yoo, S. Yoo, and K. Choi, “Multiprocessor System-on-Chip Designs with Active Memory Processors for Higher Memory Efficiency,” Proc. 46th Design Automation Conf., pp. 806-811, July 2009.
[30] J. Shopf, J. Barczak, C. Oat, and N. Tatarchuck, “March of the Froblins: Simulation and Rendering Massive Crowds of Intelligent and Detailed Creatures on GPU,” SIGGRAPH '08 : Proc. ACM SIGGRAPH Classes, pp. 52-101, Aug. 2008.
[31] W.K. Jeong and R.T. Whitaker, “A Fast Eikonal Equation Solver for Parallel Systems,” Proc. SIAM Conf. Computational Science and Eng., pp. 1-5, 2007.
[32] J. Leverich, H. Arakida, A. Solomatnikov, A. Firoozshahian, M. Horowitz, and C. Kozyrakis, “Comparing Memory Systems for Chip Multiprocessors,” Proc. 34th Int'l Symp. Computer Architecture, pp. 358-368, June 2007.
[33] P.M. Kogge, T. Sunaga, H. Miyatake, K. Kitamura, and E. Retter, “Combined DRAM and Logic Chip for Massively Parallel Systems,” pp. 4-16, Mar. 1995.
[34] C.E. Kozyrakis, J. Gebis, D. Martin, S. Williams, I. Mavroidis, S. Pope, D. Jones, D. Patterson, and K. Yelick, “Vector IRAM: A Media-Oriented Vector Processor with Embedded DRAM,” Proc. 12th Hot Chips Conf., Aug. 2000.
[35] Y. Kang, W. Huang, S.-M. Yoo, Z. Ge, D. Keen, V. Lam, P. Pattnaik, and J. Torrellas, “FlexRAM: Toward an Advanced Intelligent Memory System,” Proc. IEEE Int'l Conf. Computer Design, pp. 192-201, 1999.
17 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool