The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.02 - February (2012 vol.61)
pp: 237-250
Felipe Cabarcas , Barcelona Supercomputing Center, Barcelona and Universidad de Antioquia, Medellin
Marc Gonzalez , Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, Barcelona
Alex Ramirez , Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, Barcelona
Xavier Martorell , Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, Barcelona
Eduard Ayguade , Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, Barcelona
ABSTRACT
Multimedia extensions based on Single-Instruction Multiple-Data (SIMD) units are widespread. They have been used, for some time, in processors and accelerators (e.g., the Cell SPEs). SIMD units usually have significant memory alignment constraints in order to meet power requirements and design simplicity. This increases the complexity of the code generated by the compiler as, in the general case, the compiler cannot be sure of the proper alignment of data. For that, the ISA provides either unaligned memory load and store instructions, or a special set of instructions to perform realignments in software. In this paper, we propose a hardware realignment unit that takes advantage of the DMA transfers needed in accelerators with local memories. While the data are being transferred, it is realigned on the fly by our realignment unit, and stored at the desired alignment in the accelerator memory. This mechanism can help programmers to better organize data in the accelerator memory so that the accelerator can possibly access the data with no special instructions. Finally, the data are realigned properly also when put back to main memory. Our experiments with nine applications show that with our approach, the bandwidth of the DMA transfers is not penalized.
INDEX TERMS
DMA, multicores, alignment, SIMD units.
CITATION
Felipe Cabarcas, Marc Gonzalez, Alex Ramirez, Xavier Martorell, Eduard Ayguade, "DMA++: On the Fly Data Realignment for On-Chip Memories", IEEE Transactions on Computers, vol.61, no. 2, pp. 237-250, February 2012, doi:10.1109/TC.2010.255
REFERENCES
[1] T. Conte, P. Dubey, M. Jennings, R. Lee, A. Peleg, S. Rathnam, M. Schlansker, P. Song, and A. Wolfe, "Challenges to Combining General-Purpose and Multimedia Processors," Computer, vol. 30, no. 12, pp. 33-37, Dec. 1997.
[2] N. Slingerland and A.J. Smith, "Measuring the Performance of Multimedia Instruction Sets," IEEE Trans. Computers, vol. 51, no. 11, pp. 1317-1332, Nov. 2002.
[3] M. Alvarez, E. Salami, A. Ramirez, and M. Valero, "Performance Impact of Unaligned Memory Operations in SIMD Extensions for Video Codec Applications," Proc. IEEE Int'l Symp. Performance Analysis of Systems and Software, pp. 62-71, 2007.
[4] A. Shahbahrami, B. Juurlink, and S. Vassiliadis, "Performance Impact of Misaligned Accesses in SIMD Extensions," Proc. 17th Ann. Workshop Circuits, Systems and Signal Processing, pp. 23-24, Nov. 2006.
[5] D. Boggs, A. Baktha, J. Hawkins, D. Marr, J. Miller, P. Roussel, R. Singhal, B. Toll, and K. Venkatraman, "The Microarchitecture of the Intel Pentium 4 Processor on 90nm Technology," Intel Technology J., vol. 8, no. 1, pp. 1-17, 2004.
[6] S. Stepanian, "SB-1 MIPS64 CPU Core," Embedded Processor Forum, 2000.
[7] User Manual SPRU732C: TMS320 C64x/C64x+ DSP CPU and Instruction Set Reference Guide, Texas Instruments, 2005.
[8] J. VAN De Waerdt et al., "The TM3270 Media-Processor," Proc. 38th Ann. IEEE/ACM Int'l Symp. Microarchitecture, pp. 331-342, 2005.
[9] A. Eichenberger, P. Wu, and K. O'Brien, "Vectorization for SIMD Architectures with Alignment Constraints," Proc. ACM SIGPLAN Conf. Programming Languages Design and Implementation, pp. 82-93, June 2004.
[10] P. Wu, A. Eichenberger, and A. Wang, "Efficient SIMD Code Generation for Runtime Alignment and Length Conversion," Proc. Int'l Symp. Code Generation and Optimization, pp. 153-164, 2005.
[11] A.E. Eichenberger, K. O'Brien, K. O'Brien, P. Wu, T. Chen, P.H. Oden, D.A. Prener, J.C. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, and M. Gschwind, "Optimizing Compiler for the Cell Processor," Proc. 14th Int'l Conf. Parallel Architectures and Compilation Techniques, pp. 161-172, 2005.
[12] M. Gschwind, D. Erb, S. Manning, and M. Nutter, "An Open Source Environment for Cell Broadband Engine System Software," Computer, vol. 40, no. 6, pp. 37-47, June 2007.
[13] M. Gschwind, H.P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki, "Synergistic Processing in Cell's Multicore Architecture," IEEE Micro, vol. 26, no. 2, pp. 10-24, Mar./Apr. 2006.
[14] M. Kistler, M. Perrone, and F. Petrini, "Cell Multiprocessor Communication Network: Built for Speed," IEEE Micro, vol. 26, no. 3, pp. 10-23, May/June 2006.
[15] T. Ainsworth, T. Pinkston, N. Technol, and R. Beach, "Characterizing the Cell EIB On-Chip Network," IEEE Micro, vol. 27, no. 5, pp. 6-14, Sept./Oct. 2007.
[16] A. Rico, F. Cabarcas, A. Quesada, M. Pavlovic, A.J. Vega, C. Villavieja, Y. Etsion, and A. Ramirez, "Scalable Simulation of Decoupled Accelerator Architectures," Technical Report UPC-DAC-RR-2010-14, UPC, http://gsi.ac.upc.edu/reports/2010/14tasksim.pdf , 2010.
[17] J. Kahle, M. Day, H. Hofstee, C. Johns, T. Maeurer, and D. Shippy, "Introduction to the Cell Multiprocessor," IBM J. Research and Development, vol. 49, nos. 4/5, pp. 589-604, 2005.
[18] H.P. Hofstee, "Power Efficient Processor Architecture and the Cell Processor," Proc. 11th Int'l Symp. High-Performance Computer Architecture, pp. 258-262, 2005.
[19] D. Hackenber, "Fast Matrix Multiplication on Cell (SMP) Systems," http://www.tu-dresden.de/zih/cellmatmul, 2007.
[20] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery, "LU Decomposition and Its Applications," Numerical Recipes in FORTRAN: The Art of Scientific Computing, pp. 34-42, Cambridge Univ. Press, 1992.
[21] A. Buttari, J. Langou, J. Kurzak, and J. Dongarra, "A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures," Parallel Computing, vol. 35, no. 1, pp. 38-53, 2009.
[22] T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Wu, "An Efficient k-Means Clustering Algorithm: Analysis and Implementation," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 881-892, July 2002.
[23] P. Hart, "Nearest Neighbor Pattern Classification," IEEE Trans. Information Theory, vol. 13, no. 1, pp. 21-27, Jan. 1967.
[24] D. Bailey et al., "The NAS Parallel Benchmarks," Int'l J. High Performance Computing Applications, vol. 5, no. 3, pp. 63-73, 1991.
[25] P. Luszczek, D. Bailey, J. Dongarra, J. Kepner, R. Lucas, R. Rabenseifner, and D. Takahashi, "The HPC Challenge (HPCC) Benchmark Suite," Proc. ACM/IEEE Conf. Supercomputing, pp. 11-17, 2006.
[26] Block-Matching in Motion Estimation Algorithms Using Streaming SIMD Extensions 3, Intel Corporation, 2003.
[27] K. Diefendorff, P. Dubey, R. Hochsprung, and H. Scales, "AltiVec Extension to PowerPC Accelerates Media Processing," IEEE Micro, vol. 20, no. 2, pp. 85-95, Mar./Apr. 2000.
[28] D. Sweetman, See MIPS Run. Morgan Kaufmann, 2006.
[29] R. Sites, Alpha Architecture Reference Manual. Digital Press, 1998.
[30] D. Nuzman and R. Henderson, "Multi-Platform Auto-Vectorization," Proc. Int'l Symp. Code Generation and Optimization, pp. 281-294, 2006.
[31] G. Cheong and M. Lam, "An Optimizer for Multimedia Instruction Sets," Proc. Second SUIF Compiler Workshop, 1997.
[32] VAST-F/AltiVec: Automatic Fortran Vectorizer for PowerPC Vector Unit, Crescent Bay Software, 2004.
[33] S. Larsen, E. Witchel, and S. Amarasinghe, "Increasing and Detecting Memory Address Congruence," Proc. 11th Int'l Conf. Parallel Architectures and Compilation Techniques, pp. 18-29, 2002.
[34] G. Payá-Vayá, J. Martín-Langerwerf, S. Moch, and P. Pirsch, "An Enhanced DMA Controller in SIMD Processors for Video Applications," Proc. 22nd Int'l Conf. Architecture of Computing Systems, pp. 159-170, 2009.
[35] L. Seiler et al., "Larrabee: A Many-Core x86 Architecture for Visual Computing," Proc. Int'l Conf. Computer Graphics and Interactive Techniques, pp. 1-15, 2008.
54 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool