Paul W. Bode
bode@alcor.mit.edu
http://arcturus.mit.edu/~bode/
Guohong Xu
xu@bigdog.ucolick.org
http://www.astro.princeton.edu/~xu/
Renyue Cen
cen@astro.princeton.edu
http://www.astro.princeton.edu/~cen/
In the most widely favored models of cosmology the structures seen in the universe today formed because of self-gravity: initially very small perturbations in matter and radiation density were gravitationally amplified into galaxies, clusters of galaxies, and large-scale structure. By following how an initial set of primordial fluctuations grows in time one can give testable predictions for the distribution of matter today. Given the complex non-linear dynamics of gravitational collapse, simulations are a frequently used tool for making these predictions, particularly since the the underlying physics is well understood and the problem is numerically well-defined [1]. See Figure 1a and Movie 1 for an example of such a simulation.
Of particular interest is the number and properties of clusters of galaxies. A large cluster can contain thousands of galaxies, as well as hot gas distributed throughout the cluster which is visible in the X-ray (cf. [17]). It has become clear that this luminous matter alone cannot produce the observed dynamics in clusters, so there must be another, non-luminous component. This ``missing mass'' is some form of collisionless dark matter--- see [21] for a review of possible candidates. Around a third of a typical cluster mass is in the X-ray emitting gas, with about 5% in stars and galaxies and the rest in dark matter (e.g. [4]). Cluster numbers and masses (and hence X-ray luminosities) depend sensitively on the assumed cosmological model; thus predictions concerning cluster formation can provide a discriminant between models ([15], [22]).
Another issue arises on smaller scales. Spectra from distant quasars show absorption lines from intervening clouds of gas; the origin of these ``Lyman-alpha'' absorption systems is still unclear (see [3] or [18] for a review of the observations). A number of numerical calculations have recently been carried out which are providing insights into the formation and structure of these clouds ([2], [11], [14], [23]). These models will provide another means of constraining cosmological theories.
A variety of hydrodynamical methods have been devised for studying cosmological structure formation (cf. [20]). A recently developed algorithm is the TVD code of [16]. This code follows both collisionless dark matter (which interacts via gravity only) and baryons (the gas, affected by both gravity and gas pressure). The gas is represented by a fixed grid; hydrodynamical equations are solved with the Total Variation Diminishing (TVD) method [10]. Dark matter is represented by particles which move with respect to the grid, and the force of gravity is found using the Particle-Mesh (PM) method (cf. [12]).
In this paper we describe a parallel, distributed memory version of the TVD code which uses Fortran 77 with MPI [9] message passing, and give timing results from running the code on the IBM SP2. In Section 2 we describe how the gas is dealt with; since the code uses a fixed grid this part is relatively straightforward. Section 3 deals with making the PM solver parallel, which is harder to do in an efficient way; our implementation follows that of Ferrell and Bertschinger [5]. Timing and scaling results are given in Section 4.
Structure formation simulations follow the collapse of regions which are initially only slightly denser than average and their growth through accretion of surrounding material. Thus the hydrodynamic code must be able to follow extremely supersonic flows as gas falls into gravitational potential wells and strong shocks when the flow collides with previously accreted gas (for an example see Movie 2 and Figure 1b).
The TVD scheme solves a hyperbolic system of conservation equations (conservation of mass, momentum, and energy or pseudo-entropy). It is an explicit, second-order Eulerian finite difference scheme; it is shock-capturing, so one need not introduce artificial viscosity terms into the equations of motion. Rather than going into the details of the algorithm, we will refer the reader to [16] and [10] for a thorough description.
The TVD equations are solved on a fixed grid;
the length of the grid in each direction is given by
,
, and
.
The relevant gas variables are the density
, three components of the velocity
, and temperature
(or equivalently the pressure).
In the serial code these are contained in an array
;
periodic boundary conditions are assumed so that
and so on.
The extra buffers are useful because
in the 1-D case,
solving the equations for a given
point
requires data from all the points between
and
.
The 3-D version of the TVD code is obtained from the 1-D version
by a Strang
[19]
type dimensional splitting.
The most
straightforward way of splitting up this array for a
distributed memory computer is a block distribution in the
direction.
Each processor will have
-planes;
process 0
having planes
,
process 1
, and so on.
In this case all computations done on the
and
directions
are completely local and (since the same equations are being
solved for every point) load balanced.
It might seem desirable to keep the boundary planes in the
-direction,
but this would use too much memory when performing the
the large simulations we require.
For example, a
mesh on 128 nodes would have
;
adding
four boundary planes would nearly double the memory requirement.
Thus each node declares an array
.
When finding a quantity (e.g. the Courant condition)
for a point requires the point
, the plane corresponding to
can be obtained when needed
using
mpi_sendrecv_replace.
When solving the TVD equations, this approach will not work because
the array
is being updated as the equations are solved.
Also, when doing the update along the
-direction
either two
planes would be required from both the upper and lower neighbors,
or else messages would have to be sent for every pair
.
Furthermore, a
great deal of computation is required for each grid point, including
the boundary points; this could nearly double the computation time.
Therefore what is done is to transpose the
and
axes of the
array
,
making the
gas solution on the
-direction
local to one processor. Once the
gas variables are updated, the transpose is repeated.
A communication model for the transpose was designed to handle the large traffic imposed while minimizing the message buffer size. We require: (a) all the messages sent out by the source node be received by the destination node immediately, (b) all the processors are sending/receiving messages without interruption, and (c) no blocked messages.
Foster [6]
describes the hyper-cube matrix transpose algorithm.
This algorithm uses only
messages to accomplish the transpose but it
demands that
, where
is an integer--- an
overly restrictive requirement.
We have developed what we will call the exchange-matrix algorithm,
which does require
messages but transfers
the same total amount of data as the hyper-cube algorithm
without requiring
be a power of 2.
The idea is to design a symmetric matrix
,
where
,
that has the following properties:
For odd values of
no matrix will satisfy the above conditions.
We have not attempted to rigorously prove that such a matrix can
be found for any even value of
,
but our program has constructed
one during runs of many different sizes.
The message buffer required in each node is
times
the size of the whole array.
Examples of such a matrix are given below for
=
4, 6, and 8.
The second component which must be included is the collisionless dark matter, which is represented as a number of point masses free to move with respect to the mesh (cf. [12]; a web-based overview of particle methods is given by [8]). The particle-mesh gravity solver can be divided into three main steps:
For a large number
of grid points step 2) can most efficiently be done by taking the
Fast Fourier Transform of the density and computing in
Fourier space; an inverse transform then gives the potential
in physical space.
This does assume periodic boundary conditions, which is appropriate
for our simulations.
The particles are equally divided among nodes. For
total
particles, each node will define an array of length
containing particle data (an integer key, position
, and
velocity
).
An array is needed to hold the total
density and then potential (the FFT can be done in place).
This array is allocated in the
same manner as the gas mesh, only now we do use a boundary
layer in the
-direction:
.
Making this algorithm parallel is more challenging than in the
case of the gas, since the force of gravity is long range
whereas gas pressure is local.
In the case of gravitational clustering, there will be large density
contrasts: some grid points will contain tens of thousands of particles,
while many will be completely empty
(see Figure 1a).
This means all three steps outlined
above will in general require significant communication.
We followed the algorithm outlined by
[5];
this is
a load-balanced parallel version of PM, which they implemented
on the CM-5 using CM-Fortran.
The following steps are used:
A particle with a NGP of is given an integer key
,
and particles are sorted by this key.
The choice of key is arbitrary;
for this key assignment the particles are arranged by
-planes
in the
same manner as the grid. It may be possible to choose a
different ordering which
is more likely to put a particles on the same node as its NGP, which
would reduce communication.
A bucket sort is used, which requires extra space;
however, since the sorting is done
before the density assignment, we can use the array
as temporary
storage to facilitate the sort.
The sort contains the following steps:
Steps 2) and 4) can be performed using non-blocking sends and receives. To further hide the latency of messages, a node can alternate between message passing and performing updates where all the data is local. In pseudo--code the basic algorithm for step 2) is:
post non-blocking receive
while ( grid points left to update || particle data needed elsewhere )
if ( expecting a message && message has arrived )
update grid point using data from message
if ( expecting a message )
post non-blocking receive
end if
end if
if ( particle data needed elsewhere && send buffer available )
put number of particles at grid point into send buffer
post non-blocking send to target node
end if
if ( local grid point left to update with local particle segment)
update grid point using local segment
end if
end while
If all of these non-blocking messages are sent in the ``eager'' mode it is quite possible that the receiving processor will run out of buffer space; thus it is necessary to run MPI with flow control enabled, putting a limit on the number of non-acknowledged messages sent [7].
It is not known beforehand just how many messages each node will send or receive; how this problem is dealt with is described in the next section.
First the gas density is placed in the array
.
Since
is distributed in the same manner as
,
this step is local.
The dark matter
density assignment is quite similar to step 2) of the sort,
only using particle mass instead of just number.
Since the particles are thoroughly sorted, each grid point will
receive only one update. (If a segment lies along a node boundary
then there may be two updates).
One improvement which
reduces the number of messages sent is to calculate density for an entire
row of grid points
(
for fixed
and
)
at a time;
this can be done for the sort as well.
Then the update for the entire row
is sent as a single message. This almost always insures that some nodes
will receive two updates for a single row, because the particles
belonging to that row are divided between two processors.
In consequence, a node cannot know beforehand
precisely how many messages it will
receive. Thus in order to let each node know when to stop,
each processor sends a ``finished'' signal to
all other nodes when it no longer has any messages to send. Each node
continues collecting messages until it has received
of these
signals, at which point it knows that no further messages will arrive.
A second improvement is to use more than one receive buffer and post
several receives at once; this is easy to implement with MPI routines.
If the particle density is softened then a particle contributes density
to more than one cell;
for example, in the cloud-in-cell
approach
[12]
the density of a given particle
is spread over a cube of 8 grid points.
We account for this by including the extra cells
in the message that is sent. That is, the send buffer is an array
; the row
is the row containing the actual
NGPs and the other rows have non-zero density only because of the softening.
When a message is received, 4 rows are updated instead of just one.
On the SP2, the 1-D ESSL library FFT routine SRCFT is used
[13].
After performing the FFT on the two local dimensions, the array
is transposed using the
exchange-matrix algorithm
(Section 2.2),
so that the distributed dimension becomes
local.
The Poisson operator can then be applied with the array still in
its transposed state. To return to the physical domain the
-direction
inverse FFT is performed, followed by another
matrix transpose which again makes the
-direction
distributed.
The remaining inverse FFT's are then done locally.
During the sorting step, a table was set up telling which node had particles with given keys. Thus each node can calculate the number of messages it will receive for particle updates, and which particles the messages are for. During this step each node must do three things:
Once again non-blocking communication can be used, and latency can to some degree be hidden by having each node alternate between message passing and completely local updates. The basic algorithm is:
post non-blocking receive
while ( particles left to update || grid points needed elsewhere )
if ( expecting a message && message has arrived )
update particles using grid points from message
if ( expecting a message )
post non-blocking receive
end if
end if
if ( grid points needed elsewhere && send buffer available )
pack grid points into buffer
post non-blocking send to target node
end if
if ( segment left to update with local NGP )
update particles using local grid points
end if
end while
As with the sort and density assignment, the number of messages is reduced by sending an entire row's worth of data at one time. Some rows will be sent to two different nodes, but this can easily be taken into account.
A number of test and production runs were carried out on the IBM SP2 at the
Cornell Theory Center.
Figure 2
shows the wallclock time consumed per step as a function
of the number of processors used. To compare runs with differently
sized grids the times are divided by a factor , i.e.
the
mesh
run times (shown in red) are divided by 8 and the
run
times (shown in blue) are divided by 64, and so on.
The number of particles is
in each case.
The purely computational portion of the hydrodynamical code (shown as
triangles) speeds up in a nearly linear fashion with
;
this is as
expected since the number of grid points for which a node must do
computation is proportional to
and the same amount of work
is required for each grid point.
This part of the code consumes roughly 80% of the total time, so the
total time also speeds up in a nearly linear fashion.
On a single SP2 node the speed is 30 Mflop/sec; by tuning the
TVD computation routine to the IBM RS/6000 processor a significant
speedup could be achieved.
The time required for message passing in solving the TVD equations
takes roughly of the
computation time, as is shown by the squares
in Figure 2.
The message passing consists mainly of transposing the matrix
.
Keeping the size of the mesh
constant and increasing
,
the speedup scales as
;
but if the mesh size is increased along with
then the scaling
is closer to linear.
The PM part of the code also takes roughly 10% of the total time,
as shown by the crosses
in Figure 2.
This portion scales
better than the gas message passing when the mesh size is kept constant;
when the mesh size is also increased the scaling is similar, at
about
.
Most of the communication involves sending a set of grid points
(for instance a block in the matrix transpose, and a row in the
density assignment). When the mesh size is increased the number
of mesh points handled per message increases, thereby improving
the timing on a per mesh point basis.
On the other hand, increasing
increases the total number
of messages sent.
We are currently improving the PM scaling. For
the sorting, density computation, and FFT steps
of the PM algorithm consume roughly equal amounts of time.
Communication dominates the time required by the
density assignment, as was the case for
[5].
On 64 nodes it takes 8 seconds to sort data for
17 million particles.
The final step in the sort (sending particles) is
accomplished with a call to
mpi_alltoallv;
it may possible to use similar collective communication
in the previous sorting step (finding number density)
and in the density assignment.
Thus all the data would be sent at once instead of one message per
row, which could lead to substantial
improvement in the PM timing--- both the total time used and the
scaling. However, the amount of buffer space needed for these
messages could be prohibitive.
In production runs the need for IO slows down the overall timing
by about 10%. The amount of data written is significant;
to save the entire state of the simulations requires 5 reals
for each gas mesh point (,
, and
)
plus 6 reals
for each particle (
and
); this is over
3 Gbyte for the simulation shown in
Figure 1.
The TVD cosmological hydrodynamic code of [16] has been successfully made parallel, with the PM algorithm following that of [5]. This distributed-memory implementation uses Fortran 77 and MPI, and so should be easily portable to other multiprocessor machines such as the Cray T3D/E. The main communication cost is in a matrix transpose (of the gas grid) and a global sort (of the particles). This code will allow hydrodynamical cosmological structure formation simulations larger than any so far attempted; these can be carried out on existing supercomputers such as the SP2.
This work was supported by NSF grant AST-9318185 as part of the Grand Challenge Cosmology Consortium (GC3). Supercomputing time was provided by the Cornell Theory Center.
[1] Bertschinger, E. 1994, Phyisca D, 77, 354, http://xxx.lanl.gov/abs/astro-ph/9311069.
[2] Cen, R., J. Miralda-Escude, J.P. Ostriker & M. Rauch. 1994, Ap. J. Let., 437, L9, http://xxx.lanl.gov/abs/astro-ph/9409017.
[3] Cristiani, R. 1996, In Dark Matter in the Universe: International School of Physics ``Enrico Fermi'' Course CXXXII. eds. S. Bonometto, J.R. Primack, A. Provenzale, Elsevier (New York), http://xxx.lanl.gov/abs/astro-ph/9512086.
[4] David, L.P., C. Jones & W. Forman. 1995, Ap. J., 445, 578.
[5] Ferrell, R. & E. Bertschinger. 1994, Int. Jour. of Mod. Phys. C, 5, 933, http://xyz.lanl.gov/abs/comp-gas/9310002.
[6] Foster, I. 1995, Designing and Building Parallel Programs. Addison-Wesley (Reading, MA), http://www.mcs.anl.gov/dbpp/.
[7] Franke, H., E. Wu, M. Riviere, P. Pattnaik & M. Snir. 1995, Proceedings of ICDCS '95 (Vancouver), http://www.tc.cornell.edu/UserDoc/Software/PTools/mpi/.
[8] Graps, A. 1996, N-Body / Particle Simulation Methods. Web page URL http://www.amara.com/papers/nbody.html.
[9] Gropp, W., E. Lusk, & A. Skjellum. 1994, Using MPI. MIT Press (Cambridge MA), http://www.mcs.anl.gov/mpi/usingmpi/index.html.
[10] Harten, A. 1983, J. Comp. Phys., 49, 357.
[11] Hernquist, L., N. Katz, D.H. Weinberg & J. Miralda-Escude. 1996, Ap. J. Let., 457, L51, http://www.aas.org/ApJ/v457n2/5617/sc0.html.
[12] Hockney, R.W. & J.W. Eastwood. 1988, Computer Simulations Using Particles. Adam Hilger (Bristol).
[13] IBM Corp., 1994. Engineering and Scientific Subroutine Library Guide and Reference. Version 2 Release 2, IBM order number SC23-0526-01. http://www.austin.ibm.com/software/Apps/essl.html.
[14] Katz, N., D.H. Weinberg, L. Hernquist & J. Miralda-Escude. 1996, Ap. J. Let., 457, 57, http://www.aas.org/ApJ/v457n2/5617/sc0.html.
[15] Ostriker, J.P. & R. Cen. 1996 Ap. J., 464, 27, http://xxx.lanl.gov/abs/astro-ph/9601021.
[16] Ryu, D., J.P. Ostriker, H. Kang & R. Cen. 1993, Ap. J., 414, 1.
[17] Sarazin, C. 1986, Rev. Mod. Phys., 58, 1.
[18] Sargent, W.L.W. 1988, in QSO Absorption Lines: Probing the Universe. Ed. J.C. Blades, D.A. Turnshek, C.A. Norman, Cambridge University Press (Cambridge, New York).
[19] Strang, G. 1968, SIAM J. Num. Anal., 5, 506.
[20] Tormen, G. 1996, in Dark Matter in Cosmology, Quantum Measurements, Experimental Gravitation: Proceedings of the XXXIst Rencontres de Moriond. Editions Frontieres (Gif-sur-Yvette, France), http://xxx.lanl.gov/abs/astro-ph/9604081.
[21] Trimble, V. 1987, Ann. Rev. Astron. Astrophys., 25, 425.
[22] Tsai, J.C. & D.A. Buote. 1996, MNRAS, in press, http://xxx.lanl.gov/abs/astro-ph/9510057.
[23] Zhang, Y., P. Anninos & M.L. Norman. 1996, Ap. J. Let., 453, L57, http://www.aas.org/ApJ/v453n2/5473/sc0.html.