Publication 2002 Issue No. 4 - April Abstract - An Efficient Algorithm for Out-of-Core Matrix Transposition
An Efficient Algorithm for Out-of-Core Matrix Transposition
April 2002 (vol. 51 no. 4)
pp. 420-438
 ASCII Text x J. Suh, V.K. Prasanna, "An Efficient Algorithm for Out-of-Core Matrix Transposition," IEEE Transactions on Computers, vol. 51, no. 4, pp. 420-438, April, 2002.
 BibTex x @article{ 10.1109/12.995452,author = {J. Suh and V.K. Prasanna},title = {An Efficient Algorithm for Out-of-Core Matrix Transposition},journal ={IEEE Transactions on Computers},volume = {51},number = {4},issn = {0018-9340},year = {2002},pages = {420-438},doi = {http://doi.ieeecomputersociety.org/10.1109/12.995452},publisher = {IEEE Computer Society},address = {Los Alamitos, CA, USA},}
 RefWorks Procite/RefMan/Endnote x TY - JOURJO - IEEE Transactions on ComputersTI - An Efficient Algorithm for Out-of-Core Matrix TranspositionIS - 4SN - 0018-9340SP420EP438EPD - 420-438A1 - J. Suh, A1 - V.K. Prasanna, PY - 2002KW - matrix transposeKW - data transfer timeKW - index computation timeKW - I/O timeKW - out-of-coreKW - execution timeVL - 51JA - IEEE Transactions on ComputersER -

Efficient transposition of Out-of-core matrices has been widely studied. These efforts have focused on reducing the number of I/O operations. However, in state-of-the-art architectures, memory-memory data transfer time and index computation time are also significant components of the overall time. In this paper, we propose an algorithm that considers the index computation time and the I/O time and reduces the overall execution time. Our algorithm reduces the total execution time by reducing the number of I/O operations and eliminating the index computation. In doing so, two techniques are employed: writing the data onto disk in predefined patterns and balancing the number of disk read and write operations. The index computation time, which is an expensive operation involving two divisions and a multiplication, is eliminated by partitioning the memory into read and write buffers. The expensive in-processor permutation is replaced by data collection from the read buffer to the write buffer. Even though this partitioning may increase the number of I/O operations for some cases, it results in an overall reduction in the execution time due to the elimination of the expensive index computation. Our algorithm is analyzed using the well-known Linear Model and the Parallel Disk Model. The experimental results on Sun Enterprise, SGI R12000, and Pentium III show that our algorithm reduces the overall execution time by up to 50 percent compared with the best known algorithms in the literature.

[1] A. Aggarwal and J. S. Vitter, The Input/Output Complexity of Sorting and related Problems Comm. ACM, vol. 31, no. 9, pp. 1116-1127, 1988.
[2] M.B. Ari, “On Transposing Large$2^n \times 2^n$Matrices,” IEEE Trans. Computers, vol. 28, no. 1, pp. 72-75, Jan. 1979.
[3] R. Bernecky, “Sonar Beamforming Challenge Problems,” presented at DARPA/ITO Embeddable Systems PI Meeting, June 1996.
[4] L. Carter, J. Ferrante, and S.F. Hummel, “Hierarchical Tiling for Improved Superscalar Performance,” Proc. Nineth Int'l Symp. Parallel Processing, pp. 239-245, Apr. 1995.
[5] P.M. Chen, E.K. Lee, G.A. Gibson, R.H. Katz, and D.A. Patterson, "RAID: High-Performance Reliable Secondary Storage," ACM Computing Surveys, vol. 36, no. 3, pp. 145-185, Aug. 1994.
[6] T.H. Cormen, “Virtual Memory for Data-Parallel Computing,” PhD Thesis, Massachussetts Inst. of Technology, MIT/LCS/TR-559, 1992.
[7] T.H. Cormen and M. Hirschl, “Early Experiences in Evaluating the Parallel Disk Model with the ViC* Implementation,” Parallel Computing, vol. 23, nos. 4-5, pp. 571-600, June 1997.
[8] T.H. Cormen, T. Sundquist, and L.F. Wisniewski, “Asymptotically Tight Bounds for Performing BMMC Permutations on Parallel Disk Systems,” SIAM J. Computing, vol. 28, no. 1, pp. 105-136, 1994.
[10] DARPA,http://www.darpa.mil/ito/research/disindex.html , 2000.
[11] L.G. Delcaro and G.L. Sicuranza, “A Method on Transposing Externally Stored Matrices,” IEEE Trans. Computers, vol. 23, no. 9, pp. 967-970, 1974.
[12] D.E. Dudgeon and R.M. Mersereau, Multidimensional Digital Signal Processing.Englewood Cliffs, N.J.: Prentice Hall, 1984.
[13] J.O. Eklundh, “A Fast Computer Method for Matrix Transposing,” IEEE Trans. Computers, vol. 20, no. 7, pp. 801-803, 1972.
[14] R.W. Floyd, “Permuting Information in Idealized Two-Level Storage,” Complexity of Computer Computations, pp. 105-109, Plenum, 1972.
[15] R.A. Games, “Benchmarking Methodology for Real-Time Embedded Scalable High Performance Computing,” MITRE Technical Report MTR 96B0000010, Mar. 1996.
[16] K. Hwang and Z. Xu, “Scalable Parallel Computers for Real-Time Signal Processing,” IEEE Signal Processing Magazine, vol. 13, no. 4, pp. 50-66, July 1979.
[17] M. Kallahalla and P.J. Varman, Optimal Read-Once Parallel Disk Scheduling Proc. Workshop on I/O in Parallel and Distributed Systems (IOPADS), pp. 68-77, 1999.
[18] M. Kallahalla and P. Varman, “An Improved Parallel Prefetching Algorithm,” Proc. Int'l Conf. High Performance Computing, Dec. 1998.
[19] S.D. Kaushik et al., "Efficient Transposition Algorithms for Large Matrices," Proc. Supercomputing '93, IEEE Computer Soc. Press, Los Alamitos, Calif., 1993, pp. 656-665.
[20] V. Kumar, A. Grama, A. Gupta, and G. Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjamin Cummings, 1994.
[21] M. Lee, W. Liu, and V.K. Prasanna, “A Mapping Methodology for Designing Software Task Pipelines for Embedded Signal Processing,” Proc. Third Int'l Workshop Embedded HPC Systems and Applications (EHPC '98), at the 12th Int'l Parallel Processing Symp. (IPPS '98), and the Ninth Symp. Parallel and Distributed Processing (SPDP '98), Apr. 1979.
[22] Y.W. Lim and V.K. Prasanna, “Scalable Portable Implementations of Space-Time Adaptive Processing,” Proc. 10th Int'l Conf. High Performance Computers, June 1996.
[23] Y.W. Lim, P.B. Bhat, and V.K. Prasanna, “Efficient Algorithms for Block-Cyclic Redistribution of Arrays,” Algorithmica, vol. 24, pp. 298-330, 1999.
[24] H. Park, J. Suh, V.K. Prasanna, and M. Ung, “Parallel Implementation of 2D FFT on High Performance Computing Platforms,” Proc. DoD HPC User's Conf. '98, June 1998.
[25] H.K. Ramapriyan, “A Generalization of Eklundh's Algorithm for Transposing Large Matrices,” IEEE Trans. Computers, vol. 24, no. 12, pp. 1221-1226, Dec. 1975.
[26] J.C. Sheperdson and H.E. Sturgis, “Computability of Recursive Functions,” J. ACM, vol. 10, pp. 217-255, 1963.
[27] J. Suh and V.K. Prasanna, “Portable Implementation of Real Time Signal Processing Benchmarks on HPC Platforms,” Proc. Int'l Workshop Applied Parallel Computing in Large Scale Scientific and Industrial Problems '98, June 1998.
[28] J.S. Vitter and E.A.M. Shriver, “Algorithms for Parallel Memory I: Two-Level Memories,” Algorithmica, vol. 12, nos. 2-3, pp. 110-147, 1994.

Index Terms:
matrix transpose, data transfer time, index computation time, I/O time, out-of-core, execution time
Citation:
J. Suh, V.K. Prasanna, "An Efficient Algorithm for Out-of-Core Matrix Transposition," IEEE Transactions on Computers, vol. 51, no. 4, pp. 420-438, April 2002, doi:10.1109/12.995452