BUS models: A single disk or a pool of disks (e.g., a RAID system) is connected via a single I/O node to the system.
CLU models: Computational processors are logically or physically grouped (e.g., clusterized) and each cluster contains one I/O node.
Synchronous I/O models (SIO): Processors perform I/O bursts synchronously (e.g., checkpoints and simulation data).
Asynchronous I/O Models (AIO): Processors perform I/O bursts asynchronously (e.g., out-of-core computations).
5.1.1 Parameters Estimation In order to calculate the parameters of the model (listed in listed in Table 1), we need to measure some performance features that characterize the IBM SP-2. We use three simple low-level kernel applications POLY1, COMMS1, and COMMS3 from the Parkbench benchmark suite [ 47].
The POLY1 kernel executes a polynomial evaluation and can be used to calculate the CPU floating point instructions rate. The CPU rate measured for the IBM SP-2 is MFlop/s. The analysis of the BTIO algorithm shows that each time step requires approximatively 840 MFlop for a class A problem size (830 MFlop in the parallel fraction, 10 MFlop in the serial fraction). The parameters can be computed as the ratio between the number of MFlop required by a CPU burst and the CPU performance rate . It follows that the serial and parallel CPU time are s and s, respectively.
The communication kernel COMMS1 measures the basic communication properties of a parallel computer by ping-ponging a message of given length between two processors. The values obtained by COMMS1 are the startup time required to send a message and the communication transfer rate without contention. The measured values for the IBM SP-2 are and MByte/s [ 48]. The parameter is the reciprocal of the communication transfer rate s/MByte. The parameter can be calculated as the product between the startup time and the average number of messages sent by one processor during a communication burst. The analysis of the BTIO algorithm shows that the number of messages transmitted at each communication burst by one processor is proportional to , while the average message size is proportional to . The scale function is equal to . The number of messages transmitted by each processor and their average message size are reported in Table 2. Since the BTIO algorithm uses asynchronous communications, we choose .
The communication contention level is the most difficult parameter to estimate, because its value depends on both the interconnection network and the communications pattern. However, it is possible to compute an upper bound for by means of the COMMS3 kernel. In the COMMS3 kernel, each processor sends a message to all the other processors, then waits to receive all the messages directed at it. The value obtained by COMMS3 is the total saturation bandwidth and corresponds to the maximum throughput of the communication submodel [ 49]. The saturation bandwidth measured for the IBM SP-2 is MByte/s [ 48], yielding to .
The BTIO implementation described in [ 43] was run on an IBM SP-2 with I/O nodes, each node capable of 10 MByte/s throughput during writing operations. The amount of data written by the kernel every five time steps ( ) is 10 MByte, yielding to s. During I/O bursts, BTIO uses a small number of large write operations, therefore we neglect the I/O startup time . Since the MPI-IO implementation described in [ 43] used synchronized I/O, we choose to adopt the SIO model.
The results of the model are presented in Table 3. The Table shows a comparison between the experimental execution time and the execution time obtained from the SIO model. Note that the model yields to an overestimation of the execution time with 9 processors. This happens because the assumption of exponentially distributed CPU time leads to pessimistic results.
P. Cremonesi is with Dipartimento di Elettronica e Informazione, Politecnico di Milano, Milano, Italy. E-mail: email@example.com.
C. Gennaro is with Istituto di Elaborazione della Informazione, CNR, Pisa, Italy, E-mail: firstname.lastname@example.org.
Manuscript received 5 Jan. 2000; revised 17 Aug. 2001; accepted 13 Feb. 2002.
For information on obtaining reprints of this article, please send e-mail to: email@example.com, and reference IEEECS Log Number 111195.
1. EDITOR'S NOTE: This paper originally appeared in TPDS, Vol. 13, No. 7. Unfortunately, due to extraordinary circumstances, errors were introduced into the figure captions of the paper. We are reprinting the paper in its entirety here.
Paolo Cremonesi received the MSc degree in aerospace engineering in 1992 and the PhD degree in computer science in 1996, both from the Politecnico di Milano, Milan, Italy. He is currently an assistant professor of computer science with the Politecnico di Milano. His research interests include high-performance computing modeling and other topics related to the performance evaluation of computer systems and networks.
Claudio Gennaro received the MSc degree in electronic engineering from University of Pisa in 1994 and the PhD degree in computer science from Politecnico di Milano in 1999. He is now a researcher with the National Research Council (CNR), Pisa, Italy. His current main research interests are performance evaluation of computer systems and parallel applications, similarity retrieval, and storage structures for multimedia information retrieval and multimedia document modeling.