The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.12 - December (2002 vol.13)
pp: 1320-1332
Published by the IEEE Computer Society
ABSTRACT
<p><b>Abstract</b>—This paper introduces queuing network models for the performance analysis of SPMD applications executed on general-purpose parallel architectures such as MIMD and clusters of workstations. The models are based on the pattern of computation, communication, and I/O operations of typical parallel applications. Analysis of the models leads to the definition of speedup surfaces which capture the relative influence of processors and I/O parallelism and show the effects of different hardware and software components on the performance. Since the parameters of the models correspond to measurable program and hardware characteristics, the models can be used to anticipate the performance behavior of a parallel application as a function of the target architecture (i.e., number of processors, number of disks, I/O topology, etc).</p>
Introduction
The objective of this paper is to study the performance of general-purpose parallel architectures when executing SPMD (Single Program Multiple Data) applications. A family of queuing network models is introduced that describes the behavior of SPMD applications under the assumption that the applications have a "regular" cyclic structure. The models require the knowledge of some hardware characteristics of the target architecture, which can be obtained by running simple benchmark programs available in literature. Furthermore, the knowledge of a few characteristics of the application being executed, like the amount of data transferred during the communication, and I/O phases, is needed. Nevertheless, the approach proposed is capable of modeling a wide range of parallel applications (from I/O bound to communication intensive applications). 1
The parallel systems under consideration have a generic architecture (distributed or shared memory) with a finite number of homogeneous processors and a finite number of I/O devices (e.g., disks). The systems are assumed to be monoprogrammed and executing one task per processor. The performance analysis is limited to SPMD applications. Communication can be achieved via explicit message passing or by referring to shared variables, and can be synchronous or asynchronous. I/O operations can be reads and/or writes.
In a typical SPMD program, all the processes alternate computation, communication, and I/O in a cyclic fashion. Studies on the characterization of scientific applications [ 1], [ 2], [ 3], [ 4] show that most data-parallel applications present a behavior which is rather regular and cyclic along time. The properties of many scientific parallel programs result in an execution behavior that can be naturally partitioned into disjoint intervals, each of which consists of one computation burst followed by one I/O burst (see Fig. 1).


Fig. 1. An example of a program phase decomposed in cycles and bursts.




Examples of scientific parallel applications with regular and cyclic behavior are: H3expresso, a parallel solver for the full 3D Einstein equations used to construct dynamic black hole spacetimes [ 5]; PRISM, a parallel implementation of a 3D numerical simulation of the NavierStokes equations [ 6]; ESCAT, a parallel implementation of the Schwinger Multichannel method for calculating low-energy electronmolecule collisions [ 6]. More examples can be found in [ 7], [ 8].
Extending the model presented in [ 9] and [ 10], we consider a SPMD program to be composed of one or more phases. Each phase is in turn composed of a number of statistically identical CPU, communication and, eventually, I/O bursts that are executed in a cyclic fashion. Different phases correspond to different stages of a multistage application (e.g., preprocessing, core computation, post-processing). The cyclic structure of each phase usually derives from the presence in the application of outer iteration loops (e.g., the time evolution of some physical model, or the convergence of some approximate algorithm). The paper shows that, for each phase, it is possible to define a queuing network model that describes the performance of the coupled application/architecture system.
The paper is organized as follows: Section 2 describes other works related to the performance modeling of parallel applications. Section 3 formally defines the class of parallel applications and systems to which our analysis can be applied. The section also describes the computation, communication, and I/O queuing network submodels. Section 4 integrates the submodels into a family of coupled application/architecture models. Section 5 discusses the validation of the models by comparing the predicted performance results with those of real parallel applications. The applications considered are BTIO and QCRD. Finally, Section 6 summarizes the results and the plan for future research.
2. Related Works
The first effort to model the performance of parallel programs is due to Amdahl and the well-known homonymous law that relates the number of processors to the bound of the speedup which may be expected by parallel processing [ 11]. Several extensions to Amdahl's law have been proposed. Gustafson in [ 12] and [ 13] and Gustafson et al. in [ 14] introduce the concept of scaled speedup. Flat and Kennedy [ 15] investigate the impact of synchronization and communication overhead on the performance of parallel processors. Eager et al. [ 16] studied speedup versus efficiency. Wu and Li [ 17] propose a formal definition of scalability and discuss scalability of cluster systems. Rosti et al. [ 9] extends Amdahl's law to three-dimensional space to include processors and disks. All the above approaches have the advantage of expressing analytically the speedup in a closed form. However, these models are limited because they neglect the effects of important architecture characteristics.
Other approaches use task graph models to represent the intertask precedence relations within a program, the task execution times, and the amount of data transferred among tasks. In the case of program control structures that can be represented by "series-parallel" task graphs, such as the fork-join structure, methodologies have been derived to predict the best speedup. When the task execution times are deterministic, Coffman and Denning [ 18] use critical path analysis to find the program completion time. If the task execution times are stochastic and the number of processors is infinite, the probability distribution of execution time can be determined by a straightforward but in general very costly computation [ 19], [ 20]. General acyclic random graph models are presented in [ 21], [ 22], [ 23]. Gelenbe [ 24] generalizes the task graph model to take into account the effects of communications.
For more realistic cases, where the number of processors is smaller than the number of tasks in a program, queuing network techniques can be used, where processors are modeled as service centers and parallel tasks are modeled with requests circulating in the system. All the approaches describe multiprogrammed and multitasked parallel systems executing a sequence of programs of similar task structure. The models are based on fork-join, open queuing networks, where a stream of serial-parallel applications arrives in the system. Lui et al. [ 25] compute performance bounds for fork-join parallel program. Bacelli and Lui [ 26] propose a class of models for homogeneous parallel systems executing application with a general task structure. Balsamo et al. [ 27] describe some approximate models for the analysis of heterogeneous parallel systems. Apon and Dowdy [ 28] introduce a queuing network circulating processor model that differs from previous models in that the processors (the requests) circulate among the processes (the resources). Qin and Baer [ 29] adopt a similar technique to model the performance of cluster-based architectures where each cluster is a shared-based multiprocessor. All the above works focus mainly on finding analytical solutions to queuing network models with resource contention and synchronization barriers. However, the assumption behind the models are better suited for multiprocessor systems sharing a central memory than for distributed memory systems and do not take into consideration communication, and I/O contention. Moreover, only [ 29] addresses the modeling of monoprogrammed systems.
Very little has been done to model the performance of I/O operations in parallel applications. Most of the works focus on the analysis of RAID (Redundant Arrays of Inexpensive Disks) systems [ 30] under different workload situations. Catania et al. [ 31] apply simulation techniques to study RAID architectures with dynamic declustering. They also provide some analytical lower bounds. Kotz [ 32] describes different implementations of collective I/O operations in MIMD architectures (simple parallel file system, two-phase I/O and disk-direct I/O). Simulation is used to compare the performance of the different implementations.
3. Architecture and Application Models
Throughout the discussion, we consider a monoprogrammed and monotasked parallel system with a generic structure as the one illustrated in Fig. 2. We assume the parallel system to be composed of $\big. p\bigr.$ homogeneous computational processors and $\big. d\bigr.$ homogeneous disks. Processors and disks are connected to each other via a generic communication network.


Fig. 2. General purpose parallel machine.




Improving the application model presented in [ 9], we consider an SPMD program as a sequence of $\big. M\bigr.$phases. Each phase $\big. m\bigr.$ is composed of a sequence of $\big. N(m)\bigr.$ statistically identical cycles. Each cycle consists of a number of computation bursts followed by one I/O burst. The computation burst is, in turn, composed of one burst of CPU activity followed by one burst of communication, as show in Fig. 1. If the SPMD application is well designed (i.e., load-balanced), the computation, communication, and I/O loads have the same order of magnitude for all the processors. Within each phase, we therefore assume that CPU, communication, and I/O bursts are statistically identical, i.e., that their duration are sampled from an exponential distribution with average value $\big. S^{\rm{{CPU}}}\bigr.$, $\big. S^{\rm{{COM}}}\bigr.$, and $\big. S^{I}\bigr.$, respectively. This assumption allows us to model almost all the system by means of single class queuing networks.
3.1 Communication Subsystem
Among different processor interconnection networks, two extreme situations can be identified: a fully interconnected system and a single bus system. In the former case, no link contention arises in any communication. Messages exchanged between pairs of processors experience a simple delay regardless of the network usage. In the latter case, when two or more messages are exchanged simultaneously, there is contention for the use of the shared bus. Other interconnection networks, e.g., meshes or trees, represent trade offs between the two cases. The impact of contention on communications depends also on the parallel application. For instance, a pipeline implemented on a parallel machine with a mesh network can be realized in such a way that communications do not suffer from contention.
Different types of interconnection networks lead to different ways of modeling communications. A shared bus can be modeled as a queuing server, since it is capable of handling one message at a time, while a fully interconnected network can be modeled as a delay center since messages never queue for a link. More realistic network topologies (e.g., 2D meshes, hypercubes, toruses) are more complex to model (they require the exact knowledge of the network routing mechanism).
We define the communication contention level$\big. w\bigr.$, a real number $\big. 0 \leq w \leq 1\bigr.$, that allows us to switch from the bus interconnection network architecture ( $\big. w=1\bigr.$) to the fully connected one ( $\big. w=0\bigr.$). Intermediate values represent different network architectures such as meshes and trees. The value of $\big. w\bigr.$ depends on the communication hardware as well as on the algorithm implemented. However, it is difficult to predict the correct value of $\big. w\bigr.$, except for the boundary cases $\big. w=0\bigr.$ and $\big. w=1\bigr.$. When the value of $\big. w\bigr.$ is unknown, it can be guessed by fitting model results with experimental performance measurements, as shown is Section 3. Alternatively, the two boundary values $\big. w=0\bigr.$ and $\big. w=1\bigr.$ can be used to obtain an optimistic and a pessimistic model.
3.2 I/O Subsystem
Real parallel applications exploit I/O for different purposes [ 33]: checkpoints to reduce the cost of system failures, saving of simulation data for subsequent visualization or analysis, out-of-core computation when program data structures are larger than available memory, and data retrieval when the application involves the analysis of large amounts of data.
I/O parallelism consists of using more disks and one or more controllers and distributing data across the disks. In RAID, systems files are interleaved across an array of disks sharing a common bus and controlled by a single controller. A more efficient and flexible parallel disk architecture is constituted by a set of independent disks, each disk with a separate controller and connected to a distinct processor of the parallel machine. The individual devices in a parallel independent disk system could themselves be arrays of disks.
Usually, processors with disks are not loaded with computational tasks, but they are dedicated to the handling of I/O activities. Such activities may include post-processing or visualization of output data. The I/O nodes often provide the user with the abstraction of a single, shared, file system. For example, the Panda parallel file system [ 34] is designed for an environment where processors are full-time compute nodes or full-time I/O nodes. Several commercial machines exhibit this type of high level architecture, including the Thinking Machines CM5, the Intel Paragon, the IBM-SP2, and the Cray T3E [ 35], [ 36], [ 37].
According to the architecture of the I/O subsystem, we consider two family of models ( Fig. 3):

    BUS models: A single disk or a pool of disks (e.g., a RAID system) is connected via a single I/O node to the system.

    CLU models: Computational processors are logically or physically grouped (e.g., clusterized) and each cluster contains one I/O node.

3.3 Synchronization Model
Synchronization happens during communication and I/O activities. The performance cost of synchronization operations (e.g., synchronous communications, barriers, synchronous I/O, etc.) is a function of the number of processors involved. Within each phase, we assume that the number of processors participating to synchronization events (communications or I/O) does not change.


Fig. 3. I/O configurations considered: (a) BUS (b) CLU.




In order to capture the effects of synchronization during communications, we define the synchronization level$\big. c\bigr.$ as the number of processors ( $\big. 1 \leq c \leq p\bigr.$) involved in a synchronous communication. According to the definition, processors can be thought as grouped into $\big. p/c\bigr.$synchronization groups. For instance, if $\big. p=8\bigr.$ and $\big. c=2\bigr.$, we have $\big. p/c=4\bigr.$ groups of processors, each group composed of $\big. c=2\bigr.$ processors. Processors belonging to the same group communicate with each other by means of synchronous send/receive operations. When $\big. c=1\bigr.$, all the processors perform asynchronous communications.
Studies on the characterization of parallel scientific file accesses [ 38] show that most I/O activities are fully synchronous or fully asynchronous. According to the type of I/O, we therefore construct two family of models:

    Synchronous I/O models (SIO): Processors perform I/O bursts synchronously (e.g., checkpoints and simulation data).

    Asynchronous I/O Models (AIO): Processors perform I/O bursts asynchronously (e.g., out-of-core computations).

3.4 Speedup Models
Throughout the rest of the paper, we study the performance of parallel programs composed of one phase (i.e., $\big. M = 1\bigr.$). The results can be extended to multiphase applications. We also assume that the duration of the phase (i.e., the number $\big. N\bigr.$ of cycles) is long enough to allow the system to reach a steady-state condition.
In this section, we develop a performance model for a program executed on a system composed of one processor and one disk. We use this model as a reference model. Let $\big. p^{\rm{{I/O}}}\bigr.$ be the probability of an I/O burst at the end of a computation burst ( $\big. n^{\rm{{I/O}}} = 1/p^{\rm{{I/O}}}\bigr.$ is the average number of computation bursts between two successive I/O bursts). Let $\big. S^{\rm{{CPU}}}\bigr.$ and $\big. S^{\rm{{I/O}}}\bigr.$ be the average service time of CPU burst and I/O burst, respectively. Fig. 4 depicts the closed queuing network corresponding to the reference model. A single job, representing the program in execution, circulates in the queuing network. The average response time of the circulating job is


$$T_{1}= S^{{\rm{CPU}}} n^{\rm{{I/O}}} + S^{\rm{{I/O}}}.$$

(1)


Note that $\big. T_{1}\bigr.$ is the average execution time of one cycle of the program. Throughout the paper, we use $\big. T_{1}\bigr.$ as a reference time in order to evaluate the speedup of different parallel programs and architectures.


Fig. 4. Queuing network model of a program executed on a single processor and disk.




Consider now a program executed on a parallel system with $\big. p\bigr.$ processors and $\big. d \bigr.$ disks. We can decompose the average execution time $\big. T(p,d) \bigr.$ into three terms


$$T(p,d) = N \left[ T^{\rm{{CPU}}}(p,d) + T^{\rm{{COM}}}(p,d) + T^{\rm{{I/O}}}(p,d) \right],$$

(2)


where $\big. N\bigr.$ is the number of cycles in the program (i.e., in the phase), $\big. T^{\rm{{CPU}}}(p,d)\bigr.$, $\big. T^{\rm{{COM}}}(p,d)\bigr.$, and $\big. T^{\rm{{I/O}}}(p,d)\bigr.$ represent the average time of the CPU, communication, and I/O bursts, respectively.
The speedup $\big. s\bigr.$ can be expressed as


$$s(p,d) = {\frac{ T^{\rm{{CPU}}}(1,1) + T^{\rm{{I/O}}}(1,1) }{ T^{\rm{{CPU}}}(p,d) + T^{\rm{{COM}}}(p,d) + T^{\rm{{I/O}}}(p,d)}}.$$

(3)


Fig. 5 shows the generic queuing networks used for modeling the parallel program. The blocks represent "submodels" corresponding to computation and I/O bursts. In AIO applications, the jobs representing the computation burst perform I/O independently. In SIO applications, a pair of fork/join stations are added to the computation burst subsystem. When the jobs representing the computation burst activities are collected by the join station, they are replaced by a single job that corresponds to the I/O burst activity.


Fig. 5. Structure of the queuing network used for modeling parallel programs.




3.5 CPU Submodel
The relationship between computation load and number of processors is well approximated by Amdahl's law. The CPU load $\big. S^{\rm{{CPU}}}\bigr.$ can be divided into a parallel and a serial component, $\big. S^{\rm{{CPU}}}_{\rm{par}}\bigr.$ and $\big. S^{\rm{{CPU}}}_{\rm{ser}}\bigr.$, respectively. According to Amdahl's law, the CPU load on $\big. p\bigr.$ processors is $\big. S^{\rm{{CPU}}}_{\rm{par}} / p + S^{\rm{{CPU}}}_{\rm{ser}}\bigr.$ (the parallel component scales perfectly with the number of processors, while the serial component is not affected).
Fig. 6 shows the queuing network that integrates CPU and communication submodels into a computational burst model. A job entering in the subnetwork represents a group of $\big. c\bigr.$ processors communicating among themselves by means of synchronous communication operations. Each job proceeds to a set of delays that model the processor CPU time, then to the two service centers representing the communication network (the communication submodel). On average, a job performs $\big. n^{\rm{{I/O}}}\bigr.$ cycles before leaving the computation burst subnetwork. The maximum number of jobs circulating in the computational submodel is $\big. p/c\bigr.$.


Fig. 6. Subnetwork that simulates the computational burst.




The upper part of Fig. 6 refers to the CPU submodel. The processors are represented by $\big. p/c\bigr.$ delays and as many pairs of fork/join stations. A job arriving in the fork station is split into $\big. c\bigr.$ jobs, each of them representing a single processor computation. Each job is collected by a delay station (with service time $\big. S^{\rm{{CPU}}}_{\rm{par}} / p + S^{\rm{{CPU}}}_{\rm{ser}}\bigr.$) that represents pure CPU calculation. Finally, the jobs are recollected by the join station and, when all the jobs have been received, they are replaced with a single job representing the group of synchronized processors.
The time elapsed between the arrival of the job in the fork station and the departure from the join station is a random variable defined as $\big. X = \max \{ X_1, \ldots, X_c \}\bigr.$, where $\big. X_1, \ldots, X_c\bigr.$ are random variables exponentially distributed with the same mean $\big. S^{\rm{{CPU}}}_{\rm{par}} / p + S^{\rm{{CPU}}}_{\rm{ser}}\bigr.$, representing the $\big. c\bigr.$ processor delays. $\big. X\bigr.$ is then a hypoexponential distributed random variable with mean given by [ 39]


$$E \left[X\right] = \left( {\frac{S^{\rm{{CPU}}}_{\rm{par}}}{p}} + S^{\rm{{CPU}}}_{\rm{ser}} \right) \sum_{i=1}^{c}{\frac{1}{i}}.$$

(4)


Therefore, the average response time of the fork/join pair, can be written as


$$T^{\rm{{CPU}}} = h(c) \left( {\frac{S^{\rm{{CPU}}}_{\rm{par}}}{p}} + S^{\rm{{CPU}}}_{\rm{ser}} \right),$$

(5)


where $\big. h(c)\bigr.$ is the synchronization cost function


$$h(c) = \sum_{i=1}^{c}{\frac{1}{i}}.$$

(6)


The assumption of exponentially distributed times in the fork-join model can lead to pessimistic results when the variability in work per processor of the real application is very low and the number of processors is small [ 40]. Different and less pessimistic distributions can be used, by changing the synchronization cost function $\big. h(c)\bigr.$. An example is the uniform distribution with average values $\big. S^{\rm{{CPU}}}_{\rm{par}}\bigr.$ and $\big. S^{\rm{{CPU}}}_{\rm{ser}}\bigr.$. In this case, the cost function becomes $\big. h(c) = 2 c / (c+1)\bigr.$. However, if the distribution is different from exponential, the queuing network solution techniques presented in Section 4 cannot be used.
3.6 Communication Submodel
We introduce a communication scale function $\big. g(p)\bigr.$, which measures how the average amount of data transmitted by each processor scales (during a communication burst). By definition, $\big. g(1) = 0\bigr.$. The choice of the scaling function $\big. g(p)\bigr.$ depends on the algorithm implemented but not on the target architecture. Through the scaling function $\big. g(p)\bigr.$, we are able to capture a broad class of parallel applications.
SPMD applications typically work with r-dimensional arrays that are distributed in a block fashion among processors. The amount of interprocessor communication depends on the hyperperimeter of the distributed data structure, while the computational load depends on the hypervolume. For instance, in low-level image processing, the working data structure is a two-dimensional matrix of pixels, partitioned into square windows among the processors. The size of the messages exchanged between pair of processors is proportional to the length of the internal border between adjacent windows ( $\big. 1/ \sqrt{p}\bigr.$), while the size of the overall communications is proportional to $\big. \sqrt{p}\bigr.$. We can generalize the idea for a generic number $\big. r\bigr.$ of dimensions in the application data space, obtaining


$$g(p) = {\frac{1}{p^{\frac{r-1}{r}}}}.$$

(7)


If the communication network has no contention, the communication service time $\big. T^{\rm{{COM}}}\bigr.$ can be well approximated by a linear relation


$$T^{\rm{{COM}}} = S^{\rm{{COM}}}_{0} + g(p) S^{\rm{{COM}}}_{\rm{R}},$$

(8)


where $\big. S^{\rm{{COM}}}_{0}\bigr.$ is the communication startup time and $\big. S^{\rm{{COM}}}_{\rm{R}}\bigr.$ is the reciprocal of the communication transfer rate.
The lower part of Fig. 6 shows the communication submodel, containing one delay station and one queuing station. The service time of the queuing station $\big. S^{\rm{{COM}}}_{{\rm{queue}}}\bigr.$ takes into account the fraction of the transfer time affected by the network contention


$$S^{\rm{{COM}}}_{{\rm{queue}}} = w g(p) S^{\rm{{COM}}}_{\rm{R}}.$$

(9)


The service time $\big. S^{\rm{{COM}}}_{{\rm{delay}}}\bigr.$ of the delay station is the sum of two components


$$S^{\rm{{COM}}}_{{\rm{delay}}} = S^{\rm{{COM}}}_{0} + (1-w) g(p) S^{\rm{{COM}}}_{\rm{R}}.$$

(10)


The first component is the startup time. The second component is the fraction of the transfer time that is not affected by the network contention.
3.7 I/O Submodel
It is difficult to provide a general discussion of parallel I/O because different parallel computers have radically different I/O architectures and hence parallel I/O mechanisms.
In SIO models, the I/O load depends on the number of disks but not on the number of processors. Amdahl's law can be applied to I/O parallel file systems as a function of the number $\big. d\bigr.$ of disks


$$T^{\rm{{I/O}}} = S^{\rm{{I/O}}}_{0} + {\frac{ S^{\rm{{I/O}}}_{\rm{R}}}{ d }},$$

(11)


where $\big. S^{\rm{{I/O}}}_{0}\bigr.$ and $\big. S^{\rm{{I/O}}}_{\rm{R}}\bigr.$ are the serial and parallel part of the I/O service time with respect to the number $\big. d\bigr.$ of disks.
In AIO models, I/O operations are divided into $\big. p/c\bigr.$ chunks. In BUS-AIO models, each chunk of I/O data is striped among the $\big. d\bigr.$ disks. The I/O service time of one chunk of I/O is given by


$$T^{\rm{{I/O}}} = S^{\rm{{I/O}}}_{0} + {\frac{ S^{\rm{{I/O}}}_{\rm{R}} / d }{ p/c }}.$$

(12)


In CLU-AIO models, processors are logically grouped into clusters and each cluster shares one I/O node. I/O clustering occurs mainly because of application design issues but can be also motivated by the parallel architecture configuration. Clustering occurs among the $\big. p/c\bigr.$ synchronization groups. Therefore, $\big. p/c\bigr.$ must be an integer multiple of $\big. d\bigr.$ (i.e., $\big. p/c = k d\bigr.$, where $\big. k\bigr.$ is the number of synchronization groups sharing one disk). The I/O service time for one chunk of I/O is given by


$$T^{\rm{{I/O}}} = S^{\rm{{I/O}}}_{0} + {\frac{ S^{\rm{{I/O}}}_{\rm{R}} }{ p/c }}.$$

(13)


We can summarize the I/O service time for different I/O architectures


$$T^{\rm{{I/O}}} = \left\{ \matrix{S^{\rm{{I/O}}}_{0} + S^{\rm{{I/O}}}_{\rm{R}} / d \hfill & {{\rm{SIO}}} \hfill\cr S^{\rm{{I/O}}}_{0} + {\frac{S^{\rm{{I/O}}}_{\rm{R}} / d }{p/c}} \hfill &{\rm{BUS-AIO}}\hfill \cr S^{\rm{{I/O}}}_{0} + {\frac{S^{\rm{{I/O}}}_{\rm{R}} }{p/c}} \hfill & {{\rm{CLU-AIO}}}.}\right.$$

(14)


4. Model Analysis
In this section, we describe how to integrate the submodels described in the previous sections and we develop and illustrate the algorithms required to evaluate the performance results.
4.1 SIO Model
The SIO model describes parallel algorithms performing synchronous I/O operations (see Fig. 7). In order to estimate the speedup of the SIO model, we first evaluate the time $\big. T^{\rm{{CPU}}} + T^{\rm{{COM}}}\bigr.$ spent by the program during the computation burst by computing the response time of the computation burst subnetwork.


Fig. 7. SIO queuing network model.




Let $\big. z\bigr.$ be the sum of all the delays belonging to the computation subnetwork


$$z = h(c) \left({\frac{S^{\rm{{CPU}}}_{\rm{par}}}{p}} + S^{\rm{{CPU}}}_{{\rm{seq}}}\right) + S^{\rm{{COM}}}_{0} + (1-w) g(p) S^{\rm{{COM}}}_{\rm{R}}.$$

(15)


The approximate response time for the computation fork-join subnetwork is


$$T^{\rm{{CPU}}} + T^{\rm{{COM}}} = n^{\rm{{I/O}}} \sum_{i=1}^{p/c}{\frac{z + R_1(i,z,S^{\rm{{COM}}}_{{\rm{queue}}})}{i}},$$

(16)


where $\big. R_1(i,z,x)\bigr.$ is the response time of a closed queuing network with population $\big. i\bigr.$ composed of a delay $\big. z\bigr.$ and a queuing station with service time $\big. x\bigr.$. The response time $\big. R_1(i,z,x)\bigr.$ is given by


$$R_1(i,z,x) = x {\frac{\sum_{(e_1,e_2) \in_{ } L_{i-1}^2}{\frac{e_1 + 1}{e_2!}}x^{e_1} z^{e_2}} {\sum_{(e_1,e_2) \in_{ } L_{i-1}^2}{\frac{1}{e_2!}} x^{e_1}z^{e_2}}},$$

(17)


where $\big. L_i^2\bigr.$ is the set of pairs $\big. (e_1,e_2)\bigr.$ of nonnegative integers such that $\big. e_1 + e_2 = i\bigr.$. A proof of (16) and (17) can be found in [ 41] and [ 42], respectively.
The response time for the I/O burst is given by (14)


$$T^{\rm{{I/O}}} = S^{\rm{{I/O}}}_{0} + S^{\rm{{I/O}}}_{\rm{R}} / d.$$

(18)


Upon substituting (16) and (18) into (3), we can evaluate the speedup $\big. s(p,d)\bigr.$.
4.2 BUS-AIO Mode
The BUS-AIO model represents centralized I/O architectures and parallel applications characterized by asynchronous I/O. Since in this case we do not have any I/O synchronization, we remove from the model the fork/join pair ( Fig. 8). The BUS-AIO queuing network can be solved by means of exact MVA algorithm. The queuing network is composed of a delay station and two queuing stations. The service time $\big. z\bigr.$ of the delay station is given by (15). The service time $\big. S^{\rm{{COM}}}_{{\rm{queue}}}\bigr.$ of the first queuing station is given by (9). The the second queuing station represents the I/O burst and its service time is given by (14)


$$T^{\rm{{I/O}}} = S^{\rm{{I/O}}}_{0} + {\frac{S^{\rm{{I/O}}}_{\rm{R}} / d }{p/c}}.$$

(19)


By using MVA, we have


$$T^{\rm{{CPU}}} + T^{\rm{{COM}}} = z n^{\rm{{I/O}}} + R_2\left(p,z n^{\rm{{I/O}}}, S^{\rm{{COM}}}_{{\rm{queue}}} n^{\rm{{I/O}}}, T^{\rm{{I/O}}}\right)$$

(20)


and


$$T^{\rm{{I/O}}} = R_2\left(p,z n^{\rm{{I/O}}},T^{\rm{{I/O}}},S^{\rm{{COM}}}_{{\rm{queue}}}n^{\rm{{I/O}}}\right),$$

(21)


where $\big. R_2(i,z,x,y)\bigr.$ is the response time of a closed queuing network with population $\big. i\bigr.$ composed of a delay $\big. z\bigr.$ and two queuing station with service time $\big. x\bigr.$ and $\big. y\bigr.$


$$R_2(i,z,x,y) = x {\frac{ \sum_{(e_1,e_2,e_3) \in_{ } L_{p/c-1}^3}{\frac{e_1 + 1}{e_3!}} x^{e_1} y^{e_2} z^{e_3}}{\sum_{(e_1,e_2,e_3) \in_{ } L_{p/c-1}^3}{\frac{1}{e_3!}} x^{e_1}y^{e_2} z^{e_3}}}$$

(22)


and $\big. L_p^3\bigr.$ is the set of triples $\big. (e_1,e_2,e_3)\bigr.$ of nonnegative integers such that $\big. e_1 + e_2 + e_3 = p\bigr.$. From (20) and (21), we can evaluate the speedup of the BUS-AIO model.


Fig. 8. BUS-AIO queuing network model.




4.3 CLU-AIO Model
The model for the CLU-AIO is a multiclass closed queuing network. The queuing network contains $\big. d\bigr.$ classes, each class representing the portion of a parallel application executed on one of the $\big. d\bigr.$ clusters of processors that shares a common disk (see Fig. 9). As mentioned, we consider only the case $\big. p/c = kd\bigr.$. Because of the load-balance hypothesis stated in Section 3, the $\big. d\bigr.$ components of the population vector $\big. \overrightarrow{n}\bigr.$ are identical


$$\overrightarrow{n} = \left( {\frac{p}{cd}}, {\frac{p}{cd}}, \ldots, {\frac{p}{cd}} \right).$$

The I/O subsystem is composed of $\big. d\bigr.$ queuing stations operating in parallel, each station representing a disk. The service time of each disk station is given by (14)


$$T^{\rm{{I/O}}} = S^{\rm{{I/O}}}_{0} + {\frac{ S^{\rm{{I/O}}}_{\rm{R}}}{p/c}}.$$

The service times of the queuing network are represented by a $\big. (d + 2) \times d\bigr.$ matrix $\big. G\bigr.$


$$G = \left( \matrix{ T^{\rm{{I/O}}} & 0 & \ldots & 0 \cr 0 & T^{\rm{{I/O}}} &\ldots & 0 \cr \ldots & \ldots & \ldots & \ldots \cr 0 & 0 & \ldots & T^{\rm{{I/O}}} \cr S^{\rm{{COM}}}_{{\rm{queue}}} n^{\rm{{I/O}}} & S^{\rm{{COM}}}_{{\rm{queue}}} n^{\rm{{I/O}}} & \ldots & S^{\rm{{COM}}}_{{\rm{queue}}} n^{\rm{{I/O}}} \cr z n^{\rm{{I/O}}} & z n^{\rm{{I/O}}} & \ldots & z n^{\rm{{I/O}}}}\right),$$

where $\big. S^{\rm{{COM}}}_{{\rm{queue}}}\bigr.$ and $\big. z\bigr.$ are defined in (9) and in (15), respectively. In order to solve the CLU-AIO queuing network, the MVA algorithm can be used.


Fig. 9. CLU-AIO queuing network model.




5. Analysis of Real Applications
We now apply our model to analyze the behavior of real parallel applications, demonstrating that the methodology is a powerful abstraction to evaluate application scalability as a function of its resource requirements.
In Section 5.1, we apply the model to the performance analysis of BTIO, a parallel I/O intensive application from the NAS Parallel Benchmark suite (NPB). The target architecture is the IBM SP-2. We investigate the structure of the algorithm and the architecture of the parallel machine in order to obtain the model parameters. A comparison between the results of the analytic model and the measures obtained from the execution of the benchmark is presented.
In Section 5.2, we infer the inner characteristics of a parallel applications by fitting the analytic model to observed speedup surfaces. The application considered (QCRD) is selected among those of the Scalable I/O Initiative.
5.1 The BTIO Application
In order to evaluate usefulness and accuracy of the model, we compare the model results with the performance measures of BTIO, a kernel benchmark from the NPB suite, as reported by Fineberg et al. in [ 43]. The NPB suite [ 44] consists of a set of seven kernels (BT, LU, FT, IS, EP, MG, and SP) extracted from typical computational fluid dynamics codes. Working together, IBM T.J. Watson Research Center and the Parallel Systems Group at the NAS facility (NASA Ames Research Center) have drafted MPI-IO, a proposal to address the portable parallel I/O problem [ 45]. BTIO is the NAS kernel benchmark for MPI-IO. The BTIO kernel extends the original BT kernel by using MPI-IO to write data to a file at regular time intervals [ 46].
The BTIO (Block Tridiagonal) pseudoapplication simulates high-speed compressible airflow by solving a system of partial differential equations using the ADI method. BTIO simulates airflow in time steps: Air is modeled in a 3D grid of (velocity, temperature, pressure) values. For each time step, airflow is approximated by traversing the grid in the X, Y, then Z dimension. For each traversal, all lines can be done in parallel. The parallel implementation adopts a multipartition scheme that requires the number of processors $\big. p\bigr.$ to be a perfect square. The kernel executes 200 time steps and writes the solution matrix every five time steps.
The BTIO kernel comes in three possible sizes (class A, B, and C), each class with a different grid size. The analysis described in the following subsection refers to a problem size of class A ( $\big. 64 \times 64 \times 64\bigr.$ grid points).

5.1.1 Parameters Estimation In order to calculate the parameters of the model (listed in listed in Table 1), we need to measure some performance features that characterize the IBM SP-2. We use three simple low-level kernel applications POLY1, COMMS1, and COMMS3 from the Parkbench benchmark suite [ 47].

Table 1. Models Parameters Summary


The POLY1 kernel executes a polynomial evaluation and can be used to calculate the CPU floating point instructions rate. The CPU rate measured for the IBM SP-2 is $\big. r^{\rm{{CPU}}} = 120\bigr.$ MFlop/s. The analysis of the BTIO algorithm shows that each time step requires approximatively 840 MFlop for a class A problem size (830 MFlop in the parallel fraction, 10 MFlop in the serial fraction). The $\big. S^{\rm{{CPU}}}\bigr.$ parameters can be computed as the ratio between the number of MFlop required by a CPU burst and the CPU performance rate $\big. r^{\rm{{CPU}}}\bigr.$. It follows that the serial and parallel CPU time are $\big. S^{\rm{{CPU}}}_{\rm{ser}} = 0.08\bigr.$ s and $\big. S^{\rm{{CPU}}}_{\rm{par}} = 6.9\bigr.$ s, respectively.

The communication kernel COMMS1 measures the basic communication properties of a parallel computer by ping-ponging a message of given length between two processors. The values obtained by COMMS1 are the startup time $\big. t_{0}\bigr.$ required to send a message and the communication transfer rate $\big. r_{\infty}\bigr.$ without contention. The measured values for the IBM SP-2 are $\big. t_{0} = 150 \mu s\bigr.$ and $\big. r_{\infty} = 27\bigr.$ MByte/s [ 48]. The $\big. S^{\rm{{COM}}}_{\rm{R}}\bigr.$ parameter is the reciprocal of the communication transfer rate $\big. S^{\rm{{COM}}}_{\rm{R}} = 1/r_{\infty} = 0,037\bigr.$s/MByte. The $\big. S^{\rm{{COM}}}_{0}\bigr.$ parameter can be calculated as the product between the startup time $\big. t_{0}\bigr.$ and the average number of messages sent by one processor during a communication burst. The analysis of the BTIO algorithm shows that the number of messages transmitted at each communication burst by one processor is proportional to $\big. p^{1/2}\bigr.$, while the average message size is proportional to $\big. p^{-2/3}\bigr.$. The scale function $\big. g(p)\bigr.$ is equal to $\big. p^{-1/6}\bigr.$. The number of messages transmitted by each processor and their average message size are reported in Table 2. Since the BTIO algorithm uses asynchronous communications, we choose $\big. c = 1\bigr.$.

Table 2. BTIO CharacteristicsThe Values Refer to One Processor


The communication contention level $\big. w\bigr.$ is the most difficult parameter to estimate, because its value depends on both the interconnection network and the communications pattern. However, it is possible to compute an upper bound for $\big. w\bigr.$ by means of the COMMS3 kernel. In the COMMS3 kernel, each processor sends a message to all the other processors, then waits to receive all the messages directed at it. The value obtained by COMMS3 is the total saturation bandwidth $\big. r_{\rm{sat}}\bigr.$ and corresponds to the maximum throughput of the communication submodel $\big. 1 / ( w S^{\rm{{COM}}}_{\rm{R}} )\bigr.$ [ 49]. The saturation bandwidth measured for the IBM SP-2 is $\big. r_{\rm{sat}} = 120\bigr.$ MByte/s [ 48], yielding to $\big. w \simeq r_{\infty}/r_{\rm {sat}}= 0.23\bigr.$.

The BTIO implementation described in [ 43] was run on an IBM SP-2 with $\big. d = 3\bigr.$ I/O nodes, each node capable of 10 MByte/s throughput during writing operations. The amount of data written by the kernel every five time steps ( $\big. p_{I/O} = 1/5\bigr.$) is 10 MByte, yielding to $\big. S^{\rm{{I/O}}}_{\rm{R}} = 1\bigr.$ s. During I/O bursts, BTIO uses a small number of large write operations, therefore we neglect the I/O startup time $\big. S^{\rm{{I/O}}}_{0} = 0\bigr.$. Since the MPI-IO implementation described in [ 43] used synchronized I/O, we choose to adopt the SIO model.

The results of the model are presented in Table 3. The Table shows a comparison between the experimental execution time and the execution time obtained from the SIO model. Note that the model yields to an overestimation of the execution time with 9 processors. This happens because the assumption of exponentially distributed CPU time leads to pessimistic results.

Table 3. BTIO Execution Time, Five Time Steps


5.2 Speedup Surfaces of Applications with Intensive I/O
In this section, we illustrate how the modeling technique can be used to study the inner characteristics of real parallel applications. The main goal of the section is not to predict the performance of a parallel application, but rather to analyze its scalability in order to find its major bottlenecks.
The application considered is selected among those of the Scalable I/O Initiative and is described in [ 9]. The Scalable I/O Initiative [ 6] is an effort to collect a suite of I/O intensive challenge scientific applications in order to design and evaluate policies for the management of parallel file systems. The experimental platform used is the 512-processor Intel Paragon XP/S with 64 4GB Seagate disks attached to an I/O processor, at the Caltech Center of Advanced Computing Research. Performance measures are collected using Pablo [ 50], a performance analysis environment that provides trace data for the I/O and CPU requests of the parallel applications.
The application considered is QCRD (Quantum Chemical Reaction Dynamics) that solves the Schroedinger equation for the differential and integral cross section of the scattering of an atom by a diatomic molecule. QCRD implements the method of symmetrical hyperspherical coordinates and local hyperspherical surface functions with a typical SPMD structure. All processors execute the same code on different portions of the data set each of equal size so as to keep the load balanced. The execution is divided into five consecutive stages that proceed in a pipeline fashion. The analysis focus on stage 2 because it achieves reasonable speedups. During stage 2, each processor independently computes a subset of the integrals that are needed to evaluate the two-dimensional quadratures involving the primitive basis functions. Both read and write operations are performed by all processors.
Since in the QCRD stage 2 the processors uses asynchronous communications and perform I/O operations independently, we use the BUS-AIO model with $\big. c = 1\bigr.$. The analysis of the data partition scheme adopted in the QCRD stage 2 yields to define the communication scale function as $\big. g(p) = 1/p\bigr.$.
Given a set of 42 observed speedup $\big. \bar{s}(p,d)\bigr.$, we estimate the values of the remaining seven model parameters by means of a least square fitting (the number of observations required must be greater or at least equal to the number of unknown parameters). The parameters are shown in Table 4. All the parameters but $\big. w\bigr.$ are normalized with respect to the application execution time on one processor and one disk $\big. T_1\bigr.$.
The analysis of Table 4 shows that the application shall not be considered I/O intensive but rather CPU intensive because $\big. S^{\rm{{I/O}}}_{\rm{R}}\bigr.$ and $\big. S^{\rm{{I/O}}}_{0}\bigr.$ are negligible with respect to $\big. S^{{\rm{CPU}}} n^{\rm{{I/O}}}\bigr.$. Moreover, the main performance bottleneck lays in the communication phase. In fact $\big. S^{\rm{{COM}}}_{\rm{R}} n^{\rm{{I/O}}}\bigr.$ and $\big. w\bigr.$ are relevant and their overhead effects increase with the number of processors.

Table 4. QCRD Stage 2: Model Parameters


Fig. 10 shows the speedup measured (dashed line) versus the speedup obtained from the model (solid line). We define the average error as


$${\frac{1}{N_P}} \sqrt{\sum_{p,d}{\frac{\left[s(p,d)-\bar{s}(p,d)\right]^2}{\bar{s}(p,d)^2}}},$$

where $\big. N_P\bigr.$ is the number of experimental points in the speedup surface. We observe that the fitted model is a good match for the observed data (the average error being 0.2 percent).


Fig. 10. QCRD stage 2Experimental speedup surface (dashed line) versus analytical (solid line).




6. Conclusion
In this paper, we have described a family of queuing network models for the performance analysis of parallel programs when executed on different type of systems. The purpose of the models is to estimate the speedup of a parallel application through some high level parameters characterizing the application and the target architecture.
The underlying idea is that the performance of a parallel application strictly depends on the coupling between the application and the architecture characteristics.
From these models we obtain the qualitative and quantitative behavior of programs that alternate computations and I/O in a cyclic fashion. The models allow to study the impact of both communication and I/O contention, showing the dependence among speedup, number of processors and number of disks in the parallel machine. Various aspects of the communication and I/O have been analyzed and different hardware architectures have been taken into consideration.
Moreover, we have shown that we can infer the programming characteristics of an application by fitting the model to observed speedup. This can help programmers to analyze the scaling properties of an application with respect to: number of processors, number of disks, CPU performance, disk performance and problem size. The fitting of experimental speedup surfaces produced very small errors; thus it would be appropriate to use these estimated parameters for allocation and scheduling.
Future works are planned to extend the computation subsystem model in order to take into account the effects of deep load unbalance (by using multiclass queuing networks) and the effects of processor multiprogramming (by substituting the CPU delay centers with queue centers).

    P. Cremonesi is with Dipartimento di Elettronica e Informazione, Politecnico di Milano, Milano, Italy. E-mail: cremones@elet.polimi.it.

    C. Gennaro is with Istituto di Elaborazione della Informazione, CNR, Pisa, Italy, E-mail: gennaro@iei.pi.cnr.it.

Manuscript received 5 Jan. 2000; revised 17 Aug. 2001; accepted 13 Feb. 2002.

For information on obtaining reprints of this article, please send e-mail to: tpds@computer.org, and reference IEEECS Log Number 111195.

1. EDITOR'S NOTE: This paper originally appeared in TPDS, Vol. 13, No. 7. Unfortunately, due to extraordinary circumstances, errors were introduced into the figure captions of the paper. We are reprinting the paper in its entirety here.

REFERENCES



Paolo Cremonesi received the MSc degree in aerospace engineering in 1992 and the PhD degree in computer science in 1996, both from the Politecnico di Milano, Milan, Italy. He is currently an assistant professor of computer science with the Politecnico di Milano. His research interests include high-performance computing modeling and other topics related to the performance evaluation of computer systems and networks.



Claudio Gennaro received the MSc degree in electronic engineering from University of Pisa in 1994 and the PhD degree in computer science from Politecnico di Milano in 1999. He is now a researcher with the National Research Council (CNR), Pisa, Italy. His current main research interests are performance evaluation of computer systems and parallel applications, similarity retrieval, and storage structures for multimedia information retrieval and multimedia document modeling.
17 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool