The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - March (2008 vol.57)
pp: 375-388
ABSTRACT
In general, the main purpose for using PIM modules is to dramatically increase the Data-Level Parallelism (DLP) and avoid the limited issue rate of current systems (even when they include SIMD extensions) caused by the limited data bandwidth and functional units. Our approach is to divide the PIM module into hundreds of smaller pieces so that each of these smaller PIMs can execute motion estimation for a group of macro blocks in a parallel fashion. We also design the logic in each PIM to execute in a highly pipelined fashion so that even more parallelism can be exploited. The main contribution of this paper is the presentation of architectural techniques that can be used in the PIM module to overcome the addressing and data sharing overhead when these smaller PIMs are used. Our architectural techniques have been applied to motion estimation. Indeed, it has been reported that Motion estimation takes the majority of the execution time of MPEG encoding and it has been researched by many because of its importance in MPEG encoding. With our paradigm and techniques, the host processor can be relieved from the most computationally demanding and data-intensive portions of the workload which should therefore yield a significant performance gain.
INDEX TERMS
Special-Purpose and Application-Based Systems, Real-time and embedded systems
CITATION
Jung-Yup Kang, Sandeep K. Gupta, Jean-Luc Gaudiot, "An Efficient Data-Distribution Mechanism in a Processor-In-Memory (PIM) Architecture Applied to Motion Estimation", IEEE Transactions on Computers, vol.57, no. 3, pp. 375-388, March 2008, doi:10.1109/TC.2007.70818
REFERENCES
[1] J. Abel, K. Balasubramanian, M. Bargeron, T. Craver, and M. Phlipot, “Applications Tuning for Streaming SIMD Extensions,” Intel Technology J., second quarter, 1999.
[2] “Advanced Micro Devices,” 3Dnow! Technology Manual, 1999.
[3] T. Austin, E. Larson, and D. Ernst, “SimpleScalar: An Infrastructure for Computer System Modeling,” Computer, vol. 35, no. 2, pp. 59-67, 2002.
[4] J. Barth, W. Reohr, P. Parries, G. Fredeman, J. Golz, S. Schuster, R. Matick, H. Hunter, C. Tanner, J. Harig, H. Kim, B. Khan, J. Griesemer, R. Havreluk, K. Yanagisawa, T. Kirihata, and S. Iyer, “A 500MHz Random Cycle 1.5 ns-Latency, SOI Embedded DRAM Macro Featuring a 3T Micro Sense Amplifier,” Proc. IEEE Int'l Solid-State Circuits Conf., pp. 486-487, 2007.
[5] R. Bhargava, L.K. John, B.L. Evans, and R. Radhakrishnan, “Evaluating MMX Technology Using DSP and Multimedia Applications,” Proc. 31st Int'l Symp. Microarchitecture, pp. 37-46, 1998.
[6] D. Burger, T.M. Austin, and S. Bennett, “Evaluating Future Microprocessors: The SimpleScalar Tool Set,” Technical Report CS-TR-1996-1308, Univ. of Wisconsin-Madison, 1996.
[7] D. Burger, J.R. Goodman, and A. Kägi, “Memory Bandwidth Limitations of Future Microprocessors,” Proc. 23rd Int'l Symp. Computer Architecture, pp. 78-89, 1996.
[8] F. Cavalli, R. Cucchiara, M. Piccardi, and A. Prati, “Performance Analysis of MPEG-4 Decoder and Encoder,” Proc. Fourth Int'l Symp. Video/Image Processing and Multimedia Comm., 2002.
[9] J. Corbal, R. Espasa, and M. Valero, “Exploiting a New Level of DLP in Multimedia Applications,” Proc. 32nd Int'l Symp. Microarchitecture, pp. 72-80, 1999.
[10] J. Corbal, R. Espasa, and M. Valero, “DLP$+$ TLP Processors for the Next Generation of Media Workloads,” Proc. Seventh Int'l Symp. High-Performance Computer Architecture, pp. 219-228, 2001.
[11] J. Corbal, R. Espasa, and M. Valero, “Three-Dimensional Memory Vectorization for High-Bandwidth Media Memory Systems,” Proc. 35th Ann. ACM/IEEE Int'l Symp. Microarchitecture, pp. 149-160, 2002.
[12] K. Diefendorff and P.K. Dubey, “How Multimedia Workloads Will Change Processor Design,” Computer, vol. 30, no. 9, pp. 43-45, Sept. 1997.
[13] Z. Fang and S.A. McKee, “MPEG-4: Fallacies and Paradoxes,” Proc. Fifth IEEE Int'l Workshop Workload Characterization, pp. 91-97, 2002.
[14] R. Fromm, S. Perissakis, N. Cardwell, C.E. Kozyrakis, B. McGaughy, D.A. Patterson, T.E. Anderson, and K.A. Yelick, “The Energy Efficiency of IRAM Architectures,” Proc. 24th Int'l Symp. Computer Architecture, pp. 327-337, 1997.
[15] D.L. Gall, “MPEG: A Video Compression Standard for Multimedia Applications,” Comm. ACM, vol. 34, pp. 47-58, Apr. 1991.
[16] M. Hall, P. Kogge, J. Koller, P. Diniz, J. Chame, J. Draper, J. LaCoss, J. Granacki, A. Srivastava, W. Athas, J. Brockman, V. Freeh, J. Park, and J. Shin, “Mapping Irregular Applications to DIVA, A PIM-Based Data-Intensive Architecture,” Proc. 13th Int'l Conf. Supercomputing, 1999.
[17] Z. He and M.L. Liou, “A High-Performance Fast Search Algorithm for Block Matching Motion Estimation,” IEEE Trans. Circuits and Systems for Video Technology, vol. 7, no. 5, pp. 826-828, 1997.
[18] Intel, “Block-Matching in Motion Estimation Algorithms Using Streaming SIMD Extensions 3,” Intel Applications Notes, Dec. 2003.
[19] Understanding Memory Access Characteristics of Motion Estimation, Intel Strategy and Solutions Development Center, Oct. 2004.
[20] “MPEG-1 Video: ISO/IEC 11172-2,” Int'l Organization for Standardization, 2007.
[21] “MPEG-2 Video: ISO/IEC 13818-2,” Int'l Organization for Standardization, 2007.
[22] “MPEG-4: ISO/IEC 14496-2,” Int'l Organization for Standardization, 2007.
[23] K. Itoh, “Limitations and Challenges of Multigigabit DRAM Chip Design,” IEEE J. Solid-State Circuits, vol. 32, no. 5, pp. 624-634, 1997.
[24] “H.263: ITU Recommendation H.263,” Telecomm. Standardization Sector, ITU, 2007.
[25] “H.264: ITU Recommendation H.264,” Telecomm. Standardization Sector, ITU, 2007.
[26] J.-Y. Kang, S. Shah, S. Gupta, and J.-L. Gaudiot, “An Efficient PIM (Processor-In-Memory) Architecture for Motion Estimation,” Proc. 14th IEEE Int'l Conf. Application-Specific Systems, Architectures, and Processors, pp. 273-283, June 2003.
[27] Y. Kang, M. Huang, S. Yoo, Z. Ge, D. Keen, V. Lam, P. Pattnaik, and J. Torrellas, “FlexRAM: Toward an Advanced Intelligent Memory System,” Proc. IEEE Int'l Conf. Computer Design, 1999.
[28] J. Keshava and V. Pentkovski, “Pentium III Processor Implementation Tradeoffs,” Intel Technology J., second quarter, 1999.
[29] J. Kneip, S. Bauer, J. Vollmer, B. Schmale, P. Kuhn, and R. Bosch, “The MPEG-4 Video Coding Standard—A VLSI Point of View,” Proc. IEEE Workshop Signal Processing Systems, pp. 43-52, Oct. 1998.
[30] C.E. Kozyrakis, S. Perissakis, D. Patterson, T. Anderson, K. Asanović, N. Cardwell, R. Fromm, J. Golbus, B. Gribstad, K. Keeton, R. Thomas, N. Treuhaft, and K. Yelick, “Scalable Processors in the Billion-Transistor Era: IRAM,” Computer, vol. 30, no. 9, pp. 75-78, Sept. 1997.
[31] E. News, “IBM Doubles Microprocessor Performance with Embedded DRAM Technology,” Feb. 2007.
[32] M. Oskin, F.T. Chong, and T. Sherwood, “Active Pages: A Computation Model for Intelligent Memory,” Proc. 25th Ann. Int'l Symp. Computer Architecture, pp. 192-203, June 1998.
[33] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick, “A Case for Intelligent RAM,” IEEE Micro, vol. 17, no. 2, pp. 34-44, Mar./Apr. 1997.
[34] D. Patterson, K. Asanovic, A. Brown, R. Fromm, J. Golbus, B. Gribstad, K. Keeton, C. Kozyrakis, D. Martin, S. Perissakis, R. Thomas, N. Treuhaft, and K. Yelick, “Intelligent RAM (IRAM): The Industrial Setting, Applications, and Architecture,” Proc. IEEE Int'l Conf. Computer Design, 1997.
[35] D. Patterson, N. Cardwell, T. Anderson, R. Fromm, K. Keeton, C.E. Kozyrakis, R. Thomas, and K. Yelick, “Intelligent RAM (IRAM): Chips that Remember and Compute,” Proc. IEEE Int'l Conf. Computer Design, pp. 224-225, 1997.
[36] D.A. Patterson and J.L. Hennessy, Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 2002.
[37] A. Peleg and U. Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE Micro, vol. 16, no. 4, pp. 42-50, Aug. 1996.
[38] P. Ranganathan, S. Adve, and N.P. Jouppi, “Performance of Image and Video Processing with General-Purpose Processors and Media ISA Extensions,” Proc. 26th Ann. Int'l Symp. Computer Architecture, pp. 124-135, 1999.
[39] N.T. Slingerland and A.J. Smith, “Cache Performance for Multimedia Applications,” Proc. 15th Int'l Conf. Supercomputing, pp. 204-217, 2001.
[40] M. Sun and K. Yang, “A Flexible VLSI Architecture for Full-Search Block-Matching Motion Vector Estimation,” Proc. IEEE Int'l Symp. Circuits and Systems, pp. 179-182, May 1989.
[41] D. Talla, “Architectural Techniques to Accelerate Multimedia Applications on General-Purpose Processors,” PhD dissertation, Univ. of Texas at Austin, 2001.
[42] A.M. Tourapis, O.C. Au, and M.L. Liou, “Predictive Motion Vector Field Adaptive Search Technique (PMVFAST): Enhancing Block Based Motion Estimation,” Proc. Visual Comm. and Image Processing, pp. 883-892, Jan. 2001.
[43] M. Tremblay, J.M. O'Connor, V. Narayanan, and L. He, “VIS Speeds New Media Processing,” IEEE Micro, vol. 16, no. 4, pp. 10-20, Aug. 1996.
[44] K. Yang, M. Sun, and L. Wu, “A Family of VLSI Designs for Motion Compensation Block Matching Algorithm,” IEEE Trans. Circuits and Systems, vol. 36, no. 10, pp. 1317-1325, 1989.
26 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool