
SEPTEMBER/OCTOBER 2005 (Vol. 7, No. 5) pp. 1423
15219615/05/$31.00 © 2005 IEEE
Published by the IEEE Computer Society
Published by the IEEE Computer Society
CrossSite Computations on the TeraGrid
Article Contents  
Grand Challenges  
Using Code  
Computation Algorithms  
Simulation Results  
Performance Results  
Conclusion  
References  
Download Citation  
Download Content  
PDFs Require Adobe Acrobat  
The TeraGrid's collective computing resources can help researchers perform verylargescale simulations in computational fluid dynamics (CFD) applications, but doing so requires tightly coupled communications among different sites. The authors examine ascaleddown turbulent flow problem, investigating the feasibility and scalability of crosssite simulation paradigms, targeting grand challenges such as blood flow in the entire human arterial tree.
Launched by the US National Science Foundation in August 2001, the TeraGrid ( www.teragrid.org) integrates the most powerful open resources in the US, providing 50 Tflops in processing power and 1.5 Pbytes of online storage connected via a 40Gbitspersecond network. Although the TeraGrid offers potentially unlimited scalability, the key question facing computational scientists is how to effectively adapt their applications to such a complex and heterogeneous network. Indeed, parallel scientific computing is currently at a crossroads. The emergence of the messagepassing interface (MPI) as well as domain decomposition algorithms and their corresponding freeware (such as ParMETIS ^{1} ) has made parallel computing available to the wider scientific community, allowing firstprinciples simulations of turbulence at very fine scales, ^{2} blood flow in the human heart, ^{3} and global climate change. ^{4} Unfortunately, simulations designed to capture detailed physicochemical, mechanical, or biological processes have also demonstrated widely varying characteristics. ^{5} ^{,} ^{6} Some applications are computation intensive, requiring extremely powerful computing systems, whereas others are data intensive, ^{7} necessitating the creation or mining of multiterabyte data archives.
Efficiently and effectively harnessing grid computing's power requires applications that can exploit ensembles of supercomputers, which in turn requires the ability to match application requirements and characteristics with grid resources. The challenges in developing gridenabled applications lie primarily in the high degree of system heterogeneity and the grid environment's dynamic behavior. For example, a grid can have a highly heterogeneous and unbalanced communication network whose bandwidth and latency characteristics vary widely over time and space. Computers in grid environments can also have radically different operating systems and utilities.
Grid technology—primarily in the form of Globusfamily services ( www.globus.org)—has largely overcome the difficulties in managing such a heterogeneous environment. With these services' uniform mechanisms for user authentication, accounting, resource access, and data transfer, users and applications can discover and utilize disparate resources in coordinated ways. In particular, the emergence of scientificapplicationoriented grid middleware, such as MPICHG2, ^{8} has significantly spared computational scientists from lowlevel details about communication handling, network topology, resource allocation, and management. However, in spite of these advancements, devising efficient algorithms for computational fluid dynamics (CFD) applications that can exploit the TeraGrid's scalability remains an enormously challenging problem. In this article, we'll present computations performed on the TeraGrid machines across a continent and show that grid computing can pave the way for the solution of future grandchallenge problems in biological and physical sciences.
Grand Challenges
Our work is motivated by two grandchallenge problems in biological and physical sciences: the simulation of blood flow in the entire human arterial tree, and the direct numerical simulation (DNS) of bluffbody turbulent wake flows. Both problems are significant from both fundamental and application viewpoints, and their resolution will have profound scientific and societal impacts.
The human arterial tree simulation problem originates from the widely accepted causal relationship between blood flow and the formation of arterial diseases such as atherosclerotic plaques. ^{13} ^{–} ^{15} These disease conditions seem to preferentially develop in separated and recirculating flow regions such as arterial branches and bifurcations. Bloodflow interaction in the human arterial system can occur on widely different scales; it can also occur on similar scales in different regions of the vascular system. At the largest scale, the human arterial system is coupled via the wavelike nature of the pulse information traveling from the heart through the arteries. Surgical interventions, such as bypass grafts, can block the system and alter the wave reflections, which in turn can modify the flow waveforms at seemingly remote locations. Subsequently, the modification of a local waveform can lead to the onset of undesirable stresses on the arterial walls, possibly starting another pathological event.
The challenge of modeling these interactions lies in the demand for supercomputing to model the threedimensional (3D) unsteady fluid dynamics within the arterial branches. What makes this type of application amenable to TeraGrid computing is that we can reasonably model the waveform coupling between the sites of interest with a reduced set of 1D equations that capture the crosssectional area and sectional velocity properties. ^{9} We can thus simulate the entire arterial tree by using a hybrid approach based on a reduced set of 1D equations for the overall system and detailed 3D NavierStokes equations for arterial branches and bifurcations.
To capture the flow dynamics in an artery bifurcation reasonably well, the grid resolution typically requires a mesh of 70,000 to 200,000 highorder finite elements (spectral elements with a polynomial order of 10 to 12 on each element). ^{10} The human arterial tree model in Figure 1 a contains the largest 55 arteries in the human body with 27 artery bifurcations. The inclusion of all 27 bifurcations in the simulation with the grid resolution we just descried requires a total memory of 3 to 7 Tbytes, which is beyond any single supercomputer's current capacity. The TeraGrid is the only possible way to accommodate such a simulation.
The second problem, DNS of the "drag crisis" in turbulent bluffbody flows, is a fundamental grandchallenge problem in fluid dynamics. (The drag crisis is the sudden drop of drag force near Reynolds number Re = 300,000; see Figure 1 b) The need to resolve all the energetic scales in the DNS down to the Kolmogorov scale dictates that the number of grid points should be on the order of Re ^{9/4}, or roughly a trillion grid points at dragcrisis conditions. Concentration of turbulence in the bluffbody wake and nonuniform meshing effectively reduces the required number of grid points to a few billion. The appropriate mesh now consists of approximately 512 to 768 Fourier modes along the cylinder axis and 50,000 to 80,000 spectral elements in the nonhomogeneous planes, with a spectral polynomial order of 6 to 10 on each element. A monolithic simulation with such resolutions requires more than 4 Tbytes of memory, exceeding any open supercomputer's current capacity. Just as in the human arterial tree simulation, the TeraGrid is the only viable option.
These extremely large biological and physical simulations share a common characteristic: the solution process requires tightly coupled communications among different TeraGrid sites. This is in sharp contrast to other grid application scenarios in which a monolithic application runs on one grid site while the data it produces is moved to another site for visualization or postprocessing (for example, the TeraGyroid project, www.realitygrid.org/TeraGyroid.html). A big question here is the scalability of an application involving multiple TeraGrid sites and the slowdown factor of crosssite runs compared to singlesite runs under otherwise identical conditions.
Using Code
To investigate these issues and the feasibility of crosssite runs on the TeraGrid, we'll look at a scaleddown version of the dragcrisis problem—specifically, we'll use the simulation of turbulent flow past a circular cylinder at lower Reynolds numbers ( Re = 3,900 and 10,000) as a prototype. We'll employ Fourier spectral expansions in the homogeneous direction and a spectral element discretization in the nonhomogeneous planes to efficiently handle a multiconnected computational domain. ^{10}
To conduct such a DNS, we use a highorder CFD code called Nektar ( www.nektar.info/2nd_edition/) in our computations. It employs a spectral/hp element method to discretize in space and a semiimplicit scheme to discretize in time. ^{10} The mesh consists of structured or unstructured grids (or a combination of both) similar to those used in standard finite element and finite volume methods. We employ Jacobi polynomial expansions to represent flow variables; these expansions provide multiresolution, a way to hierarchically refine numerical solutions by increasing the order of the expansion ( prefinement) within each element without needing to regenerate the mesh, thus avoiding a significant overhead cost.
We can use MPICHG2 for the crosssite communication: it's a Globusbased MPI library that extends MPICH to use the Globus Toolkit's services. ^{8} During the computation, MPICHG2 selects the most efficient communication method possible between two processes, using vendorsupplied MPI whenever available. MPICHG2 uses information in the Globus Resource Specification Language (RSL) script to create multilevel clustering of the processes based on the underlying network topology; it stores this information as attributes in the MPI communicators for applications to retrieve and exploit.
Computation Algorithms
Two crosssite parallel algorithms based on different data distribution strategies can both minimize the number of crosssite communications and overlap crosssite communications with insite computations and communications. Both algorithms are designed based on a twolevel parallelization strategy. ^{11 } Let's consider the turbulent flow past a circular cylinder. The following incompressible NavierStokes equations govern the flow:
where x _{i} ( i = 1, 2, 3) are the interchangeable coordinates used with x, y, z; u _{i} ( i = 1, 2, 3) are the three velocity components; and p is pressure. We normalize all length scales with the cylinder diameter D, all velocity components with the freestream velocity U _{0}, and the pressure with (where ρ is the fluid density). The Reynolds number Re is expressed by
where v is the kinematic viscosity of the fluid. We refer to the term
as the nonlinear term in the following discussions. Because the flow is homogeneous along the cylinder axis (assumed to be the zdirection), we perform the following Fourier expansion for the three velocity components and the pressure (letting f denote one of these variables):
Fourier ModalBased Algorithm
The Fourier modalbased algorithm distributes different groups of Fourier modes onto different TeraGrid sites. It's based on the observation that a physical variable's different Fourier modes (three velocity components and the pressure) are decoupled except when evaluating the fast Fourier transform (FFT) as we compute the nonlinear terms in the NavierStokes equations. At each site, we compute a subset of the Fourier modes for all physical variables. As a result, solutions of any physical variable on different sites are largely independent. Coupling among different sites (hence crosssite communication) only occurs in the transposition of distributed matrices—an alltoall type of communication—when evaluating the FFT during nonlinear term calculation.
The application takes special care to minimize the crosssite latency impact and improve crosssite bandwidth utilization. It uses the network topology information in MPICHG2's initial communicator to enforce the data distribution strategy and ensure that, in the twolevel parallelization, computations within nonhomogeneous planes involve processors from the same site only. ^{11} A special implementation avoids unnecessary Transmission Control Protocol (TCP) polling for MPI_ANY_SOURCE exchanges on communicators involving insite communications only. ^{8} We also agglomerate the data of different physical variables such that we perform a single crosssite matrix transposition instead of several separate transpositions for different variables. Therefore, we only need two crosssite communications (one forward and one backward transform) when computing nonlinear terms. Compared to the usual approach of performing the FFTs of different physical variables separately, data agglomeration minimizes the number of crosssite communications and increases each message's size, thus reducing the latency effect. The larger message size also improves crosssite bandwidth utilization.
Physical VariableBased Algorithm
The physical variablebased algorithm computes physical variables on different TeraGrid sites and exploits the coupling characteristics among them in the NavierStokes equations. Computations of various velocity components are independent except for their interdependence in the nonlinear term. A mutual dependence exists between the velocity and the pressure, for example: computation of pressure depends on both the velocity divergence and the nonlinear terms and velocity gradients on the boundaries; computation of velocity depends on the pressure gradient.
For simplicity, let's assume we're using three TeraGrid sites; in the physical variablebased algorithm, we compute all of a velocity component's Fourier modes along with a third of the pressure Fourier modes on a different site. The computation will involve three crosssite communications, with the first occurring prior to the nonlinear solve. (Here, the nonlinear solve—and the pressure and velocity solves in subsequent discussions—refers to a threestep timeintegration scheme. ^{10} ) Each site must communicate its own velocity component to—and receive other velocity components from—the other sites for nonlinear term calculation. With Nektar's twolevel parallelization, ^{11} a processor at one site communicates only with the other two sites' corresponding processors, and thus crosssite communication involves three processors. Different processors at the same site participate in parallel independent communications.
The second crosssite communication, a SUM reduction for velocity divergence and pressure boundary conditions, occurs prior to the pressure solve. Again, a processor participates in the reduction only with corresponding processors at the other two sites (different processors at the same site participate in parallel independent reductions). The third crosssite communication occurs prior to the velocity solve: it distributes the pressure gradient data to the other sites and receives the pressure gradient component that it computes from the other sites.
Simulation Results
We've obtained a large amount of simulation results, but this article's emphasis is on the computing aspect (algorithm and performance), so we've chosen to include only one comparison with experiment here. Specifically, let's examine some simulation results for turbulent flows past bluff bodies obtained on the TeraGrid clusters. We can compare the statistical characteristics in the turbulent wake of a circular cylinder at Reynolds number Re = 10,000 between our 3D DNS and the particleimagevelocimetry (PIV) experiments. ^{12} This Reynolds number is the highest that the DNS has achieved for this flow so far.
Figure 2 shows a comparison of the normalized streamwise rootmeansquare (RMS) velocity fluctuation u′ normalized by the freestream velocity U _{0} between the experiment ( Figure 2 a) and the simulation ( Figure 2 b). We plotted experimental and DNS results on identical contour levels, with a minimum RMS value u′/ U _{0} = 0.1 and an incremental value of 0.025 between contour lines. The distribution patterns show strong fluctuations in the separating shear layers and two maxima associated with the vortex formation. The downstream locations of the RMS maxima are essentially the same for both experiment and simulation (at x = 1.14 for the experiment and at x = 1.13 for the simulation), and the respective peak values are also the same: u′ _{max}/ U _{0} = 0.5.
At Reynolds numbers in the subcritical range (above 1,000 and below the "drag crisis" Reynolds number), the separating shear layers behind the cylinder become unstable and smallscale vortices develop in the shear layers (socalled shearlayer vortices). Our simulation data shows that the frequency of shearlayer vortices follows a scaling of Re ^{0.67} with respect to the Reynolds number, in agreement with experimental observations. ^{17} Furthermore, the values of shearlayer frequencies from our computation agree with the experimental values; for example, at Re = 10,000, our simulation predicts a value (normalized by Strouhal frequency) of 11.83 whereas it's 11.25 from experiment. ^{17}
Performance Results
To evaluate the efficiency of computation algorithms, we conducted singlesite runs on three TeraGrid sites—the US National Center for Supercomputing Applications (NCSA), the San Diego Supercomputer Center (SDSC), and the Pittsburgh Supercomputing Center (PSC). We also ran crosssite runs between the SDSC and the NCSA, and between the two TeraGrid machines at the PSC. The NCSA and SDSC TeraGrid machines have Intel IA64 processors (Itanium2, 1.5 GHz) whereas those at the PSC have Compaq Alpha processors (Alpha EV68, 1 GHz).
SingleSite Performance
Figure 3 a demonstrates Nektar's scalability, showing parallel speedup with respect to the number of processors on the PSC TeraGrid cluster for a fixed problem size with 300 million degrees of freedom. The parallel efficiency exceeds 95 percent on 1,024 processors and 85 percent on 1,536 processors. The test problem here is the turbulent cylinder flow at Reynolds number Re = 10,000 based on the freestream velocity and cylinder diameter. We used a spectral element mesh with 9,272 triangular elements in the nonhomogeneous planes; the number of Fourier planes in the spanwise direction is 256 in this test.
Scalability for a fixed workload per processor is another important measure. In this set of tests, as the problem size increases, the number of processors increases proportionally such that the workload on each processor remains unchanged. The test problem is still the turbulent cylinder flow at Re = 10,000; in the nonhomogeneous plane, we use the same grid resolution, but vary the problem size by varying the number of Fourier planes in the spanwise direction. In this test, the number of Fourier planes increases from 8 to 128, and the number of processors increases proportionally from 32 to 512 to keep the workload per processor constant. For the case of 512 processors (128 Fourier planes), we use 64 processors for computations in the homogeneous direction and eight processors for computations in the twolevel parallelization's nonhomogeneous planes. Figure 3 b shows the wallclock time per step (in seconds) as a function of the number of processors from the test. The ideal result would be a constant wallclock time per step for any number of processors (flat curve), but we see (only) a slight increase in wallclock time as the number of processors increases from 32 to 512, indicating good scalability.
We used native (vendor) MPI libraries in all these tests, but because crosssite communications are based on the MPICHG2 library, we also want to test the performance differences between MPICHG2 and the native MPI implementation in a singlesite environment. MPICHG2 hides lowlevel operational details from the application, including communication channel selection (vendor or TCP), data conversion, resource allocation, and computation management. MPICHG2 has taken measures such as eliminating memory copies and unnecessary message polling to minimize the overhead cost. ^{8}
Figure 4 shows a performance comparison of Nektar compiled with MPICHG2 and native MPI on the SDSC's TeraGrid cluster (we expect the results would be valid for all TeraGrid machines, particularly the ones that have identical compute and switch hardware). The test problem is the turbulent cylinder flow at Reynolds number Re = 3,900. We plot the wallclock time per step as a function of the total number of processors for a fixed problem size, and we use a spectral element mesh with 902 triangular elements in the nonhomogeneous planes and 128 planes in the spanwise direction; the polynomial order is 8 on all elements. MPICHG2 demonstrates a performance virtually identical to the native MPI, indicating a negligible overhead cost.
Figure 5 compares the performances of the Fourier modalbased and physical variablebased algorithms discussed earlier—specifically, it compares the wallclock time per step and the speedup factor with respect to the number of processors for a fixed problem size with 24 Fourier planes in the spanwise direction. Due to the configuration of the physical variablebased algorithm, the number of processors tested ranges from 3 to 96, and the parallel speedup is computed based on the wallclock time on three processors. The Fourier modalbased algorithm consistently performs better, most likely due to the data agglomeration and reduced number of crosssite communications.
CrossSite Performance
To assess the efficiency on crosssite platforms, we use the Fourier modalbased crosssite algorithm to conduct a series of crosssite runs with Nektar and MPICHG2 on the SDSC and NCSA TeraGrid machines.
Our test problem was the turbulent flow past a cylinder at Reynolds number Re = 3,900. As we did earlier, we use spectral element mesh with 902 triangular elements in the nonhomogeneous planes (with a spectral element order of 8 on all elements); the number of Fourier planes in the spanwise direction varies from 16 to 128. We first investigate the scaling for a fixed problem size with 128 Fourier planes in the spanwise direction. Figure 6 shows the wallclock time per step (in seconds) as a function of the total number of processors for crosssite runs between the NCSA and SDSC TeraGrid machines, together with results for singlesite runs on the NCSA machine under identical configurations. The total number of processors varies from 16 to 256. For crosssite runs, the processors are split between the NCSA and SDSC TeraGrid machines—in a 256CPU crosssite run, for example, 128 processors are from both the NCSA and the SDSC. We use MPICHG2 in both singlesite and crosssite runs; we performed at least three independent runs for each case. In singlesite runs, the wallclock time essentially shows a linear relationship with respect to the number of processors, indicative of a nearlinear speedup.
In crosssite runs, the wallclock time decreases significantly with the increasing number of processors. In fact, the wallclock time–CPU curve shows a dramatic decrease, nearly an order of magnitude, as the number of processors increases from 32 to 64. To check if software errors cause this performance jump, we took special care to ensure that we obtained identical, correct computation results in all test cases, including those on 32 and 64 processors, and that we conducted several independent runs for each case (at least three for smaller CPU counts, at least five for larger ones). We conducted the tests at a special reserved time for both machines, with exclusive access to about a third of the TeraGrid machine at the NCSA and the whole TeraGrid machine at the SDSC. The performance results are repeatable, with only slight variation in exact values ( Figure 6 shows the mean values). We're convinced that these aren't spurious data points. The exact reason for this performance jump isn't totally clear at this point. It can result from several factors such as the communication characteristics of the network connecting the NCSA and SDSC machines. As expected, a crosssite run is slower than the corresponding singlesite run on the same total number of processors. The slowdown ratio, however, decreases dramatically as the number of processors increases. Beyond 32 processors, the slowdown ratio of the crosssite runs ranges from 1.5 to 2.0.
Now let's look at the scaling for a fixed workload per processor. The problem size is varied by changing the number of Fourier planes in the spanwise direction. In this set of tests, we started with eight Fourier planes in the spanwise direction, and doubled the number for each test until we reached 128. Correspondingly, we increased the total number of processors proportionally, from eight to 128, such that the workload on each processor remained unchanged. Figure 7 plots the wallclock time per step (in seconds) as a function of the total number of processors for crosssite runs between the NCSA and SDSC TeraGrid machines, as well as results for singlesite runs on the NCSA machine under identical configurations. In crosssite runs, again, half the processors are from the NCSA and the other half are from the SDSC; we used MPICHG2 in both cross and singlesite runs. In singlesite runs, the wallclock time increases very slightly as the number of processors increases from eight to 128, indicating excellent scalability. In crosssite runs, we observe a larger increase in wallclock time as the number of processors increases from eight to 32. Again, we see a dramatic decrease in wallclock time as the number of processors increases from 32 to 64. Compared to singlesite runs on the same number of processors, the slowdown ratio of crosssite runs decreases significantly beyond 32 processors.
We also examined the influence of processor configuration on crosssite run performance. Table 1 lists the wallclock time per step on a total of 256 processors in crosssite runs between the NCSA and SDSC machines with a fixed problem size for turbulent cylinder flow at Re = 10,000. We tested several different configurations with a different number of processors from each site, but the wallclock timing for different configurations is essentially the same; we didn't observe any significant influence of processor configuration on performance.
Table 1. The performance of crosssite runs for flow past a cylinder at Reynolds number Re = 10,000.
Conclusion
Compared to singlesite runs, the slowdown ratio of crosssite runs decreases dramatically as the number of processors increases. We performed the crosssite runs with a Fouriermodal based algorithm, which is characterized by a stressful alltoall type of crosssite communication and in a sense represents the worstcase scenario. For applications characterized by less stressful communication patterns, we can achieve even better performance for crosssite runs. The human arterial tree simulation shows a much less demanding communication characteristic; we're currently developing and testing this application in crosssite computations on the TeraGrid as well as on machines between the US and UK. Several techniques can further boost crosssite performance to possibly even match singlesite performance—for example, multithreading to truly overlap crosssite communications with insite computations, and User Datagram Protocol (UDP)based messaging to improve crosssite communication's bandwidth utilization. Developers are currently incorporating these approaches into the nextgeneration implementation of MPICHG2. Grid computing enabled by Globus/MPICHG2 and other grid services and middleware (see NaReGi, www.naregi.org; DEISA, www.deisa.org; and PACXMPI16) hold the key to the solution of grandchallenge problems in biological and physical sciences.
The US National Science Foundation, the Office of Naval Research, and the Defense Advanced Research Projects Agency (DARPA) supported this work. Computer time for the TeraGrid was provided through the US National Center for Supercomputing Applications (NCSA), the San Diego Supercomputer Center (SDSC), and the Pittsburgh Supercomputing Center (PSC). We thank John Towns (NCSA), David O'Neal (PSC), Donald Frederick (SDSC), Frank Wells (NCSA), Dongju Choi (SDSC), and the TeraGrid support team for their assistance in setting up the TeraGrid for the benchmark tests.
References
Suchuan Dong is a research assistant professor in the Division of Applied Mathematics at Brown University. His research interests include highperformance computing, flow structure interaction, biofluids and turbulence simulations, and bubbly flows. Dong has a PhD in mechanical engineering from the State University of New York at Buffalo. He is a member of the American Society of Mechanical Engineers, the American Institute of Aeronautics and Astronautics, and the American Physical Society. Contact him at sdong@dam.brown.edu.
George Em Karniadakis is a professor of applied mathematics at Brown University. His research interests include spectral methods in unstructured grids, parallel simulations of turbulence in complex geometries, and microfluid simulations. Karniadakis has a PhD in mechanical engineering from MIT. He is a fellow of the American Physical Society and the American Society of Mechanical Engineers, and a member of the American Institute of Aeronautics and Astronautics and SIAM. Contact him at gk@dam.brown.edu.
Nicholas T. Karonis is an associate professor at Northern Illinois University and a resident associate guest of Argonne National Laboratory's Mathematics and Computer Science Division. His current research interest is messagepassing systems in computational grids. Karonis has a BS in finance and a BS in computer science from Northern Illinois University, an MS in computer science from Northern Illinois University, and a PhD in computer science from Syracuse University. Contact him at karonis@niu.edu.
 x  