A Parallel Cosmological Hydrodynamics Code

Paul W. Bode
Massachusetts Institute of Technology
Dept. of Physics, Rm. 6-216, Cambridge, MA 02139
bode@alcor.mit.edu
http://arcturus.mit.edu/~bode/

Guohong Xu
University of California at Santa Cruz
Board of Studies in Astronomy and Astrophysics, Santa Cruz, CA 95064
xu@bigdog.ucolick.org
http://www.astro.princeton.edu/~xu/

Renyue Cen
Princeton Univ.
Peyton Hall, Princeton, NJ 08544
cen@astro.princeton.edu
http://www.astro.princeton.edu/~cen/

Abstract:
Understanding the formation by gravitational collapse of galaxies and the large-scale structure of the universe is a nonlinear, multi-scale, multi-component problem. This complex process involves dynamics of the gaseous baryons as well as of the gravitationally dominant dark matter. We discuss an implementation of a parallel, distributed memory, cosmological hydrodynamics code using MPI message passing; it is based on the Total-Variation-Diminishing [10] serial code of Ryu et al. [16]. This parallel code follows the motion of gas and dark matter simultaneously, combining a mesh based Eulerian hydrodynamics code and a Particle-Mesh -body code. A new, flexible matrix transpose algorithm is used to interchange distributed and local dimensions of the mesh. Timing results from runs on an IBM SP2 supercomputer are given.

Keywords:
parallel applications, parallel computers, algorithms, MPI, large-scale structure, X-ray clusters, cosmology

1 Introduction

In the most widely favored models of cosmology the structures seen in the universe today formed because of self-gravity: initially very small perturbations in matter and radiation density were gravitationally amplified into galaxies, clusters of galaxies, and large-scale structure. By following how an initial set of primordial fluctuations grows in time one can give testable predictions for the distribution of matter today. Given the complex non-linear dynamics of gravitational collapse, simulations are a frequently used tool for making these predictions, particularly since the the underlying physics is well understood and the problem is numerically well-defined [1]. See Figure 1a and Movie 1 for an example of such a simulation.

Of particular interest is the number and properties of clusters of galaxies. A large cluster can contain thousands of galaxies, as well as hot gas distributed throughout the cluster which is visible in the X-ray (cf. [17]). It has become clear that this luminous matter alone cannot produce the observed dynamics in clusters, so there must be another, non-luminous component. This ``missing mass'' is some form of collisionless dark matter--- see [21] for a review of possible candidates. Around a third of a typical cluster mass is in the X-ray emitting gas, with about 5% in stars and galaxies and the rest in dark matter (e.g. [4]). Cluster numbers and masses (and hence X-ray luminosities) depend sensitively on the assumed cosmological model; thus predictions concerning cluster formation can provide a discriminant between models ([15], [22]).

Another issue arises on smaller scales. Spectra from distant quasars show absorption lines from intervening clouds of gas; the origin of these ``Lyman-alpha'' absorption systems is still unclear (see [3] or [18] for a review of the observations). A number of numerical calculations have recently been carried out which are providing insights into the formation and structure of these clouds ([2], [11], [14], [23]). These models will provide another means of constraining cosmological theories.

A variety of hydrodynamical methods have been devised for studying cosmological structure formation (cf. [20]). A recently developed algorithm is the TVD code of [16]. This code follows both collisionless dark matter (which interacts via gravity only) and baryons (the gas, affected by both gravity and gas pressure). The gas is represented by a fixed grid; hydrodynamical equations are solved with the Total Variation Diminishing (TVD) method [10]. Dark matter is represented by particles which move with respect to the grid, and the force of gravity is found using the Particle-Mesh (PM) method (cf. [12]).

In this paper we describe a parallel, distributed memory version of the TVD code which uses Fortran 77 with MPI [9] message passing, and give timing results from running the code on the IBM SP2. In Section 2 we describe how the gas is dealt with; since the code uses a fixed grid this part is relatively straightforward. Section 3 deals with making the PM solver parallel, which is harder to do in an efficient way; our implementation follows that of Ferrell and Bertschinger [5]. Timing and scaling results are given in Section 4.

2 Hydrodynamics

Structure formation simulations follow the collapse of regions which are initially only slightly denser than average and their growth through accretion of surrounding material. Thus the hydrodynamic code must be able to follow extremely supersonic flows as gas falls into gravitational potential wells and strong shocks when the flow collides with previously accreted gas (for an example see Movie 2 and Figure 1b).

The TVD scheme solves a hyperbolic system of conservation equations (conservation of mass, momentum, and energy or pseudo-entropy). It is an explicit, second-order Eulerian finite difference scheme; it is shock-capturing, so one need not introduce artificial viscosity terms into the equations of motion. Rather than going into the details of the algorithm, we will refer the reader to [16] and [10] for a thorough description.

2.1 Data Structure

The TVD equations are solved on a fixed grid; the length of the grid in each direction is given by , , and . The relevant gas variables are the density , three components of the velocity , and temperature (or equivalently the pressure). In the serial code these are contained in an array ; periodic boundary conditions are assumed so that and so on. The extra buffers are useful because in the 1-D case, solving the equations for a given point requires data from all the points between and . The 3-D version of the TVD code is obtained from the 1-D version by a Strang [19] type dimensional splitting.

The most straightforward way of splitting up this array for a distributed memory computer is a block distribution in the direction. Each processor will have -planes; process 0 having planes , process 1 , and so on. In this case all computations done on the and directions are completely local and (since the same equations are being solved for every point) load balanced. It might seem desirable to keep the boundary planes in the -direction, but this would use too much memory when performing the the large simulations we require. For example, a mesh on 128 nodes would have ; adding four boundary planes would nearly double the memory requirement. Thus each node declares an array .

2.2 Matrix Transposition

When finding a quantity (e.g. the Courant condition) for a point requires the point , the plane corresponding to can be obtained when needed using mpi_sendrecv_replace. When solving the TVD equations, this approach will not work because the array is being updated as the equations are solved. Also, when doing the update along the -direction either two planes would be required from both the upper and lower neighbors, or else messages would have to be sent for every pair . Furthermore, a great deal of computation is required for each grid point, including the boundary points; this could nearly double the computation time. Therefore what is done is to transpose the and axes of the array , making the gas solution on the -direction local to one processor. Once the gas variables are updated, the transpose is repeated.

A communication model for the transpose was designed to handle the large traffic imposed while minimizing the message buffer size. We require: (a) all the messages sent out by the source node be received by the destination node immediately, (b) all the processors are sending/receiving messages without interruption, and (c) no blocked messages.

Foster [6] describes the hyper-cube matrix transpose algorithm. This algorithm uses only messages to accomplish the transpose but it demands that , where is an integer--- an overly restrictive requirement. We have developed what we will call the exchange-matrix algorithm, which does require messages but transfers the same total amount of data as the hyper-cube algorithm without requiring be a power of 2. The idea is to design a symmetric matrix , where , that has the following properties:

  1. ,
  2. ,
  3. , if ,
  4. for .
Once such a matrix is constructed, means that at the -th iteration, nodes and should exchange messages. That is, at iteration , node will find a such that ; it then interchanges data with node . After iterations, each pair of nodes will have exchanged data once and the transpose is complete.

For odd values of no matrix will satisfy the above conditions. We have not attempted to rigorously prove that such a matrix can be found for any even value of , but our program has constructed one during runs of many different sizes. The message buffer required in each node is times the size of the whole array. Examples of such a matrix are given below for = 4, 6, and 8.

3 Gravity

The second component which must be included is the collisionless dark matter, which is represented as a number of point masses free to move with respect to the mesh (cf. [12]; a web-based overview of particle methods is given by [8]). The particle-mesh gravity solver can be divided into three main steps:

  1. find the density on a fixed grid
  2. compute the gravitational potential by solving Poisson's equation on the grid. (here and is the gravitational constant).
  3. find the acceleration on each particle given the potential on the grid ().

For a large number of grid points step 2) can most efficiently be done by taking the Fast Fourier Transform of the density and computing in Fourier space; an inverse transform then gives the potential in physical space. This does assume periodic boundary conditions, which is appropriate for our simulations.

3.1 Data Structure

The particles are equally divided among nodes. For total particles, each node will define an array of length containing particle data (an integer key, position , and velocity ). An array is needed to hold the total density and then potential (the FFT can be done in place). This array is allocated in the same manner as the gas mesh, only now we do use a boundary layer in the -direction: .

3.2 Parallel Algorithm

Making this algorithm parallel is more challenging than in the case of the gas, since the force of gravity is long range whereas gas pressure is local. In the case of gravitational clustering, there will be large density contrasts: some grid points will contain tens of thousands of particles, while many will be completely empty (see Figure 1a). This means all three steps outlined above will in general require significant communication. We followed the algorithm outlined by [5]; this is a load-balanced parallel version of PM, which they implemented on the CM-5 using CM-Fortran. The following steps are used:

3.3 Sorting

A particle with a NGP of is given an integer key , and particles are sorted by this key. The choice of key is arbitrary; for this key assignment the particles are arranged by -planes in the same manner as the grid. It may be possible to choose a different ordering which is more likely to put a particles on the same node as its NGP, which would reduce communication. A bucket sort is used, which requires extra space; however, since the sorting is done before the density assignment, we can use the array as temporary storage to facilitate the sort. The sort contains the following steps:

  1. First each node does a local sort, creating segments of particles with the same key. The length of each segment is the number of particles on that node which belong to a given grid point.
  2. Each segment sends this number to the appropriate grid point. In a worst-case scenario, each grid point would get updates, but since in practice the particles are nearly sorted to begin with this does not happen. (In other words, for a typical time step , is less than the grid spacing. However, the algorithm does not require this to be the case--- it is the Courant condition of the gas which usually limits .)
  3. Each mesh cell now contains the number of particles in that cell. From this it is straightforward to find a running sum such that each grid point contains the number of particles with keys less than or equal to its key. Finally the keys are divided up between nodes so that each node has the same number of particles. (On occasion a segment will be divided between two nodes, which complicates matters only slightly.)
  4. Each node goes through its list of particles and sends particles to the appropriate nodes based on the keys.

Steps 2) and 4) can be performed using non-blocking sends and receives. To further hide the latency of messages, a node can alternate between message passing and performing updates where all the data is local. In pseudo--code the basic algorithm for step 2) is:

post non-blocking receive
while ( grid points left to update || particle data needed elsewhere )
  if ( expecting a message && message has arrived )
            update grid point using data from message
            if ( expecting a message )
               post non-blocking receive
            end if
  end if
  if ( particle data needed elsewhere && send buffer available )
	    put number of particles at grid point into send buffer
	    post non-blocking send to target node
  end if
  if ( local grid point left to update with local particle segment)
            update grid point using local segment
  end if
end while

If all of these non-blocking messages are sent in the ``eager'' mode it is quite possible that the receiving processor will run out of buffer space; thus it is necessary to run MPI with flow control enabled, putting a limit on the number of non-acknowledged messages sent [7].

It is not known beforehand just how many messages each node will send or receive; how this problem is dealt with is described in the next section.

3.4 Density assignment

First the gas density is placed in the array . Since is distributed in the same manner as , this step is local. The dark matter density assignment is quite similar to step 2) of the sort, only using particle mass instead of just number. Since the particles are thoroughly sorted, each grid point will receive only one update. (If a segment lies along a node boundary then there may be two updates).

One improvement which reduces the number of messages sent is to calculate density for an entire row of grid points ( for fixed and ) at a time; this can be done for the sort as well. Then the update for the entire row is sent as a single message. This almost always insures that some nodes will receive two updates for a single row, because the particles belonging to that row are divided between two processors. In consequence, a node cannot know beforehand precisely how many messages it will receive. Thus in order to let each node know when to stop, each processor sends a ``finished'' signal to all other nodes when it no longer has any messages to send. Each node continues collecting messages until it has received of these signals, at which point it knows that no further messages will arrive. A second improvement is to use more than one receive buffer and post several receives at once; this is easy to implement with MPI routines.

If the particle density is softened then a particle contributes density to more than one cell; for example, in the cloud-in-cell approach [12] the density of a given particle is spread over a cube of 8 grid points. We account for this by including the extra cells in the message that is sent. That is, the send buffer is an array ; the row is the row containing the actual NGPs and the other rows have non-zero density only because of the softening. When a message is received, 4 rows are updated instead of just one.

3.5 Potential Calculation and Return

On the SP2, the 1-D ESSL library FFT routine SRCFT is used [13]. After performing the FFT on the two local dimensions, the array is transposed using the exchange-matrix algorithm (Section 2.2), so that the distributed dimension becomes local. The Poisson operator can then be applied with the array still in its transposed state. To return to the physical domain the -direction inverse FFT is performed, followed by another matrix transpose which again makes the -direction distributed. The remaining inverse FFT's are then done locally.

During the sorting step, a table was set up telling which node had particles with given keys. Thus each node can calculate the number of messages it will receive for particle updates, and which particles the messages are for. During this step each node must do three things:

  1. update particles for which both the NGP and the particle are local.
  2. send potential from local grid points to other nodes.
  3. get grid points for particles whose NGP is not local and update these particles.

Once again non-blocking communication can be used, and latency can to some degree be hidden by having each node alternate between message passing and completely local updates. The basic algorithm is:

post non-blocking receive
while ( particles left to update || grid points needed elsewhere )
  if ( expecting a message && message has arrived )
            update particles using grid points from message
            if ( expecting a message )
               post non-blocking receive
            end if
  end if
  if ( grid points needed elsewhere && send buffer available )
	    pack grid points into buffer
	    post non-blocking send to target node
  end if
  if ( segment left to update with local NGP )
            update particles using local grid points 
  end if
end while

As with the sort and density assignment, the number of messages is reduced by sending an entire row's worth of data at one time. Some rows will be sent to two different nodes, but this can easily be taken into account.

4 Timing and Scaling

A number of test and production runs were carried out on the IBM SP2 at the Cornell Theory Center. Figure 2 shows the wallclock time consumed per step as a function of the number of processors used. To compare runs with differently sized grids the times are divided by a factor , i.e. the mesh run times (shown in red) are divided by 8 and the run times (shown in blue) are divided by 64, and so on. The number of particles is in each case.

The purely computational portion of the hydrodynamical code (shown as triangles) speeds up in a nearly linear fashion with ; this is as expected since the number of grid points for which a node must do computation is proportional to and the same amount of work is required for each grid point. This part of the code consumes roughly 80% of the total time, so the total time also speeds up in a nearly linear fashion. On a single SP2 node the speed is 30 Mflop/sec; by tuning the TVD computation routine to the IBM RS/6000 processor a significant speedup could be achieved.

The time required for message passing in solving the TVD equations takes roughly of the computation time, as is shown by the squares in Figure 2. The message passing consists mainly of transposing the matrix . Keeping the size of the mesh constant and increasing , the speedup scales as ; but if the mesh size is increased along with then the scaling is closer to linear. The PM part of the code also takes roughly 10% of the total time, as shown by the crosses in Figure 2. This portion scales better than the gas message passing when the mesh size is kept constant; when the mesh size is also increased the scaling is similar, at about . Most of the communication involves sending a set of grid points (for instance a block in the matrix transpose, and a row in the density assignment). When the mesh size is increased the number of mesh points handled per message increases, thereby improving the timing on a per mesh point basis. On the other hand, increasing increases the total number of messages sent.

We are currently improving the PM scaling. For the sorting, density computation, and FFT steps of the PM algorithm consume roughly equal amounts of time. Communication dominates the time required by the density assignment, as was the case for [5]. On 64 nodes it takes 8 seconds to sort data for 17 million particles. The final step in the sort (sending particles) is accomplished with a call to mpi_alltoallv; it may possible to use similar collective communication in the previous sorting step (finding number density) and in the density assignment. Thus all the data would be sent at once instead of one message per row, which could lead to substantial improvement in the PM timing--- both the total time used and the scaling. However, the amount of buffer space needed for these messages could be prohibitive.

In production runs the need for IO slows down the overall timing by about 10%. The amount of data written is significant; to save the entire state of the simulations requires 5 reals for each gas mesh point (, , and ) plus 6 reals for each particle ( and ); this is over 3 Gbyte for the simulation shown in Figure 1.

5 Conclusion

The TVD cosmological hydrodynamic code of [16] has been successfully made parallel, with the PM algorithm following that of [5]. This distributed-memory implementation uses Fortran 77 and MPI, and so should be easily portable to other multiprocessor machines such as the Cray T3D/E. The main communication cost is in a matrix transpose (of the gas grid) and a global sort (of the particles). This code will allow hydrodynamical cosmological structure formation simulations larger than any so far attempted; these can be carried out on existing supercomputers such as the SP2.

Acknowledgements

This work was supported by NSF grant AST-9318185 as part of the Grand Challenge Cosmology Consortium (GC3). Supercomputing time was provided by the Cornell Theory Center.

References

[1] Bertschinger, E. 1994, Phyisca D, 77, 354, http://xxx.lanl.gov/abs/astro-ph/9311069.

[2] Cen, R., J. Miralda-Escude, J.P. Ostriker & M. Rauch. 1994, Ap. J. Let., 437, L9, http://xxx.lanl.gov/abs/astro-ph/9409017.

[3] Cristiani, R. 1996, In Dark Matter in the Universe: International School of Physics ``Enrico Fermi'' Course CXXXII. eds. S. Bonometto, J.R. Primack, A. Provenzale, Elsevier (New York), http://xxx.lanl.gov/abs/astro-ph/9512086.

[4] David, L.P., C. Jones & W. Forman. 1995, Ap. J., 445, 578.

[5] Ferrell, R. & E. Bertschinger. 1994, Int. Jour. of Mod. Phys. C, 5, 933, http://xyz.lanl.gov/abs/comp-gas/9310002.

[6] Foster, I. 1995, Designing and Building Parallel Programs. Addison-Wesley (Reading, MA), http://www.mcs.anl.gov/dbpp/.

[7] Franke, H., E. Wu, M. Riviere, P. Pattnaik & M. Snir. 1995, Proceedings of ICDCS '95 (Vancouver), http://www.tc.cornell.edu/UserDoc/Software/PTools/mpi/.

[8] Graps, A. 1996, N-Body / Particle Simulation Methods. Web page URL http://www.amara.com/papers/nbody.html.

[9] Gropp, W., E. Lusk, & A. Skjellum. 1994, Using MPI. MIT Press (Cambridge MA), http://www.mcs.anl.gov/mpi/usingmpi/index.html.

[10] Harten, A. 1983, J. Comp. Phys., 49, 357.

[11] Hernquist, L., N. Katz, D.H. Weinberg & J. Miralda-Escude. 1996, Ap. J. Let., 457, L51, http://www.aas.org/ApJ/v457n2/5617/sc0.html.

[12] Hockney, R.W. & J.W. Eastwood. 1988, Computer Simulations Using Particles. Adam Hilger (Bristol).

[13] IBM Corp., 1994. Engineering and Scientific Subroutine Library Guide and Reference. Version 2 Release 2, IBM order number SC23-0526-01. http://www.austin.ibm.com/software/Apps/essl.html.

[14] Katz, N., D.H. Weinberg, L. Hernquist & J. Miralda-Escude. 1996, Ap. J. Let., 457, 57, http://www.aas.org/ApJ/v457n2/5617/sc0.html.

[15] Ostriker, J.P. & R. Cen. 1996 Ap. J., 464, 27, http://xxx.lanl.gov/abs/astro-ph/9601021.

[16] Ryu, D., J.P. Ostriker, H. Kang & R. Cen. 1993, Ap. J., 414, 1.

[17] Sarazin, C. 1986, Rev. Mod. Phys., 58, 1.

[18] Sargent, W.L.W. 1988, in QSO Absorption Lines: Probing the Universe. Ed. J.C. Blades, D.A. Turnshek, C.A. Norman, Cambridge University Press (Cambridge, New York).

[19] Strang, G. 1968, SIAM J. Num. Anal., 5, 506.

[20] Tormen, G. 1996, in Dark Matter in Cosmology, Quantum Measurements, Experimental Gravitation: Proceedings of the XXXIst Rencontres de Moriond. Editions Frontieres (Gif-sur-Yvette, France), http://xxx.lanl.gov/abs/astro-ph/9604081.

[21] Trimble, V. 1987, Ann. Rev. Astron. Astrophys., 25, 425.

[22] Tsai, J.C. & D.A. Buote. 1996, MNRAS, in press, http://xxx.lanl.gov/abs/astro-ph/9510057.

[23] Zhang, Y., P. Anninos & M.L. Norman. 1996, Ap. J. Let., 453, L57, http://www.aas.org/ApJ/v453n2/5473/sc0.html.

* Paul Bode is currently a Postdoctoral Associate in the MIT Physics Department, doing computational astrophyics as a member of the Grand Challenge Cosmology Consortium.
Education:
Ph.D. in Astrophysics, Indiana University, 1994
M.A. in Astrophysics, Indiana University, 1992
B.A. in Physics, Carleton College, Northfield, MN, 1984
Home Page:
http://arcturus.mit.edu/~bode/
Email:
bode@alcor.mit.edu
Guohong Xu is a Postdoctoral Associate with the Board of Studies in Astronomy and Astrophysics, University of California at Santa Cruz.
Education:
Ph.D. in Astrophysics, Princeton University, 1995
Home Page:
http://www.astro.princeton.edu/~xu/
Email:
xu@bigdog.ucolick.org
Renyue Cen is currently a Research Astronomer at Princeton University Observatory
Education and Experience:
July, 1992 -- June 1994: Research Staff Member, Princeton Univ. Obs.
October, 1990 -- June 1992: Postdoctoral Research Fellow, Princeton Univ. Obs.
Ph.D. in Astrophysics, Princeton University, 1991
B.S. in Physics, Fudan University, Shanghai, China, 1986
Home Page:
http://www.astro.princeton.edu/~cen/
Email:
cen@astro.princeton.edu