Parallel and Distributed Processing Symposium, International (2007)
Long Beach, CA, USA
Mar. 26, 2007 to Mar. 30, 2007
Jack Dongarra , Innovative Computing Laboratory, Department of Computer Science, University of Tennessee, Knoxville, USA. email@example.com
Jean-Francois Pineau , LIP, CNRS-ENS Lyon-INRIA-UCBL, Universit? de Lyon, ?cole normale sup?rieure de Lyon, France. firstname.lastname@example.org
Yves Robert , LIP, CNRS-ENS Lyon-INRIA-UCBL, Universit? de Lyon, ?cole normale sup?rieure de Lyon, France. email@example.com
Zhiao Shi , Innovative Computing Laboratory, Department of Computer Science, University of Tennessee, Knoxville, USA. firstname.lastname@example.org
Frederic Vivien , LIP, CNRS-ENS Lyon-INRIA-UCBL, Universit? de Lyon, ?cole normale sup?rieure de Lyon, France. email@example.com
This paper is aimed at designing efficient parallel matrix-product algorithms for homogeneous master-worker platforms. While matrix-product is well-understood for homogeneous 2D-arrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are two key hypotheses that render our work original and innovative: - Centralized data. We assume that all matrix files originate from, and must be returned to, the master. The master distributes both data and computations to the workers (while in ScaLAPACK, input and output matrices are initially distributed among participating resources). Typically, our approach is useful in the context of speeding up MATLAB or SCILAB clients running on a server (which acts as the master and initial repository of files). - Limited memory. Because we investigate the paral-lelization of large problems, we cannot assume that full matrix panels can be stored in the worker memories and re-used for subsequent updates (as in ScaLAPACK). The amount of memory available in each worker is expressed as a given number of buffers, where a buffer can store a square block of matrix elements. These square blocks are chosen so as to harness the power of Level 3 BLAS routines; they are of size 80 or 100 on most platforms. We have devised efficient algorithms for resource selection (deciding which workers to enroll) and communication ordering (both for input and result messages), and we report a set of MPI experiments conducted on a platform at the University of Tennessee.
Y. Robert, F. Vivien, J. Dongarra, Z. Shi and J. Pineau, "Revisiting Matrix Product on Master-Worker Platforms," 2007 IEEE International Parallel and Distributed Processing Symposium(IPDPS), Rome, 2007, pp. 276.