Y. Chen
ying@cs.uiuc.edu
http://bunny.cs.uiuc.edu/CDR/people/Chen.html
M. Winslett
winslett@cs.uiuc.edu
http://bunny.cs.uiuc.edu/CDR/people/Winslett.html
S. Kuo
s-kuo@cs.uiuc.edu
http://bunny.cs.uiuc.edu/CDR/people/Kuo.html
Y. Cho
ycho@cs.uiuc.edu
http://bunny.cs.uiuc.edu/CDR/people/Cho.html
M. Subramaniam
masubram@us.oracle.com
http://bunny.cs.uiuc.edu/CDR/people/Subraman.html
K. E. Seamons
seamons@transarc.com
http://bunny.cs.uiuc.edu/CDR/people/Seamons.html
Panda (URL http://bunny.cs.uiuc.edu/CDR/panda/) is an i/o library motivated by the needs of high-performance SPMD scientific applications that must input and output multidimensional arrays on distributed memory parallel platforms or networks of workstations. Panda supports collective i/o operations where all the processors used by an application are closely synchronized and periodically cooperate in conceptually simple i/o operations, such as reading or writing an entire array. Panda provides application developers with a high-level interface for array i/o and offers an efficient commodity-parts-based implementation of the interface across a variety of computer architectures using server-directed i/o, which allows long sequential reads and writes at each i/o node.
Despite numerous implementations of parallel i/o, few detailed performance analyses are available. The behavior of a parallel i/o system is usually difficult to analyze empirically, let alone characterize abstractly. Yet an abstract characterization is essential for `automatic' optimization of i/o request handling; otherwise the parallel i/o system must be tuned by hand for each platform and workload encountered, and will not be able to adapt automatically to changing conditions. Furthermore, the absence of a performance model makes it much more difficult to identify potential performance bottlenecks, necessitating time-consuming empirical experimentation under realistic conditions using often-scarce hardware and software resources, and the limited number of experiments possible given limited system resources may not tell the whole performance story. As another complication, in the absence of standard benchmarks and performance models, cross-library, cross-version, cross-platform, and even cross-installation predictions and comparisons of performance are extremely difficult. Finally, without a performance model for a parallel i/o system, it is hard to learn from past successes and failures when building a new system or a new system version.
In section 2 of this paper, we present a generic performance model for Panda, then customize it for Panda 2.0.5. In section 3, we validate the model by comparing its performance predictions to the real measured performance on an SP2. We then give examples of the use of the model in identifying potential performance bottlenecks, and show how model outputs can help make optimal plans for i/o operations. Section 4 discusses the related work in this field and section 5 concludes the paper.
In this section, we briefly present Panda's architecture, followed by an abstract performance model and then a customized model for Panda 2.0.5. A more detailed description of Panda's architecture can be found in [Seamons95], along with documentation of Panda's excellent performance on the IBM SP2.
Panda consists of clients and servers (one per compute node and i/o node, respectively). Under Panda's server-directed i/o, the application communicates with its local Panda client via a high-level interface for collective i/o of multidimensional arrays. The clients then pass a high-level description of the i/o request to the servers, so the servers understand both the array data distribution on disk and in memory. Then servers sequentially read data from files, scattering the data to clients as it arrives from disk, and employing the reverse strategy for writes.
Panda currently supports applications that distribute arrays across all clients using HPF-style BLOCK- and *-based array schemas (distributions). We call the subarrays resulting from a distribution ``chunks''. The schemas for an array on disk and in memory can be completely different, and Panda converts the data from one schema to another automatically during the i/o request. The application can request i/o of an entire set of arrays in one Panda call.
During i/o, Panda divides very large array chunks into sequences of smaller pieces, called `subchunks'. (A chunk that does not exceed the subchunk size limit can be thought of as a single subchunk.) Panda's unit of reads and writes is the subchunk, not a disk block, so for writes each server gathers one subchunk before issuing a file system level write. Similarly, after reading one subchunk, the server scatters it to the appropriate clients before performing the next file system level read. Thus the size chosen for subchunks determines server buffer requirements, and is a tunable parameter.
Conceptually, each i/o request has three phases---startup, intermediate, and closing---at each client and server. The clients and servers enter their startup phase at approximately the same time when the application calls the Panda clients and the clients tell the servers about the request by sending a high-level description of the i/o request. Clients then enter the intermediate phase. After digesting the high-level description, each server enters its intermediate phase.
During the intermediate phase, each server understands which chunks are its responsibility, and gathers/scatters its assigned subchunks one at a time, issuing a file system level write after each subchunk is gathered (or a read before each subchunk is scattered). For writes, for each assigned disk subchunk, the server computes which clients own the pieces of the subchunk (the overlapping clients). For each overlapping client, the server calculates the indices of the relevant portion of the array chunk at that client and then requests that subarray from the client. On receiving a request message, the overlapping client extracts the overlap and sends it to the requesting server. Reads are the reverse. When the server's last assigned subchunk is flushed to disk or scattered to the clients, its intermediate phase is over. After it synchronizes with the other servers it enters its closing phase. Synchronization ensures that all data have been written to disk before Panda clients return from a write request, and all data have been scattered to the requesting clients for a read request (the blocking i/o execution model of Panda as discussed in section 2.2). The synchronization is unnecessary if a non-blocking i/o execution model is supported in Panda. The closing phase of a client begins once all the data it wants to read are in its buffer space or all the data it wants to write have been sent to the servers. During the closing phase, servers send a `quit' message to the clients, leave the closing phase, and await the next collective i/o request. Each client returns from the i/o request after it receives a `quit' message, and its closing phase ends at that point.
All Panda versions currently only support blocking Panda calls, i.e., clients do not return control to the application until all data are written to disk or read into application memory, and this assumption is built into the performance model. Thus the cost of a sequence of application i/o requests is the sum of the cost of each request, and our goal in this paper is to model the cost of a single collective i/o request. Under our model, nodes can be heterogeneous, e.g., with different file system or CPU characteristics. We choose the response time for a single collective i/o request and the throughput per Panda server as the performance metrics of interest. To ease the discussion, the array write operation is used as the running example; reads are very similar.
The performance model has three categories of parameters: for the request configuration (R), Panda internals (P), and the underlying system (S). R characterizes the collective i/o request, including the number of clients and servers, the number of arrays, array ranks and sizes, the client and server meshes, and the schemas for memory and disk. S characterizes the underlying hardware and software (UNIX, MPI) that Panda uses heavily. Strategies used in the three phases described in section 2.1 can be viewed as Panda's internal parameters P. For example, P includes the subchunk size.
The generic Panda model is simple; the complexities arise when customizing the model as in section 2.3. We use i to designate a client and l to designate a server. The elapsed time of a collective i/o request to Panda is measured at the clients:
is the elapsed time at client
i, and
and
are the times
server l spent in the startup and intermediate phases, and
is the time i spent in the closing phase.
There are
servers, numbered 0 to
.
In this section, we tailor the generic model to fit Panda 2.0.5, using writes as the running example.
The startup phase of Panda 2.0.5 can be modeled as follows:
where, assuming no special broadcast facilities used in Panda 2.0.5,
can be broken into the
time spent at a selected client, the client master
(denoted as
in Formula 2),
at a selected server, the server master
(numbered 0 in Formula 2),
on the physical network, and at l.
At the client master,
includes
the time to construct the schema message,
of length schema,
and the time to send it to the server master,
.
Here,
and
are the
times spent in the send and receive calls provided by the
underlying communication system on node k when sending and receiving
a message of
length n bytes without
network contention.
and
may or may not include the time spent in
message transmission on the physical network
depending on the communication
protocols used by the underlying communication system.
If
or
already includes the transmission time,
is zero; otherwise,
is
the time to transmit n bytes:
,
where
is
the transmission rate of the physical network.
We assume
does not depend on the particular
sender and recipient, as is appropriate in the environments
where we currently run Panda.
and
in Formula 2
are the transmission time for sending the schema message from the
client master to the server master, and the time for
the server master to receive the schema message assuming no
delay from network contention.
also includes the
time for l
to parse the schema message,
.
In Panda 2.0.5, the server master forwards the schema message
to the other servers. Thus,
the server master has an extra cost for sending the schema message
to all other servers,
, as shown in
Formula 3.
Non-master server l
does not receive the schema string until the server master
has sent it to servers 1 through l-1,
assuming that the server master can only send to one recipient
at a time and it sends
the schema string in increasing server number order.
in Formula 3 is the transmission time
from the server master to l, and
is the time to receive the schema message.
All formulas in this paper
ignore the possibility of network contention,
which has only been an issue for Panda when running on an
FDDI ring; we have developed a model for FDDI contention
that is not presented here.
In the closing phase of Panda 2.0.5, the server master informs the client master of the end of the collective i/o request by sending the `quit' message, which the client master propagates to the other clients. Client i returns from the collective i/o request once it receives the `quit' message. Thus i returns from the i/o request:
Here, is the number of clients for
the i/o request, the client
master is node 0, and the client master sends the
`quit' message to other clients in node number order.
The first component of
Formula 4 is the time required for the
server master
to send a `quit' message to the client
master and for the client master to receive it.
in Formula 4 reflects
the time for the client master to propagate the `quit'
message, assuming it can only send to one recipient
at a time.
The intermediate phase in Panda 2.0.5 can be modeled using Formula 6:
where is the total number of subchunks at server l,
and
is the
cost associated with processing a subchunk k at server l:
Here, is the time to determine the overlapping
clients for a subchunk on l. For a particular i/o request,
is a constant for all subchunks at l
since the computational steps in computing overlapping
clients are the same for all subchunks.
For each overlapping client of k,
the server calculates
the boundary of overlap on that client,
and sends the client a request describing the overlap.
is the time to calculate the overlap boundary on l.
is the time for l
to send the request message to an overlapping client.
is the number of request messages sent by l for
subchunk k, which is also the number of overlapping
clients for k.
In Panda 2.0.5, when a server gathers subchunk k, it
sends requests to
all overlapping clients for k
before receiving any overlap.
This allows the
overlapping clients to work in parallel
as soon as they receive the request messages.
is the delay before l
receives the requested
from
i.
After l sends a request
to i, l will not be able to
receive the requested overlap until i
receives and processes the request
and sends the requested data.
We define these costs seen at i for
handling a request as follows:
where the first component of Formula 8 is
the time to receive the request from a server at i, is
the request processing time at i, and
is the
time to send the requested overlap to the
requesting server.
If the overlapping clients
are numbered from 1 to
, and l
always sends and receives messages in increasing
node number order, then
can be defined
as follows:
In Formula 11, is
the time required for server l to send requests to
overlapping clients numbered above i
and
to receive the overlaps from overlapping
clients numbered below i.
Intuitively, Formula 11 says that
as long as the time to handle the request,
,
at i is shorter than
, then by the time
l is ready to receive i's overlap,
i has already sent it,
and the delay is 0.
Otherwise, when l is ready to receive from i,
during the delay
due to slow processing
at i, l must sit idle waiting for data from
i.
is also the difference between the processing
time
at i and the time at which
l has finished sending requests to overlapping clients
numbered above i and has
also received the data from
overlapping clients numbered below i,
.
Figure 1 illustrates
this scenario. In (a),
l requests overlaps from
clients 1 and 2. Client
1 finishes sending the overlap before l
is ready to receive, thus
is zero.
Client 2 also finishes sending its overlap
before l is ready, so the
delay caused by client 2 is also zero.
Thus l sees no delay. In (b),
client 1 is slow to handle l's request.
The resulting delay seen by l
is the difference between the time needed for client 1
to handle the request from l and the sum of the times for
l to compute the overlap boundary on client 2
and send the request to client 2.
Client 2 causes no delay, since by the time
l is ready to receive client 2's overlap,
it has already been sent.
Figure 1: The delay scenarios when gathering a subchunk on server l in Panda 2.0.5. Tsend and Treceive are the times for l to send a request to a client and receive an overlap from a client.
is the time required to receive the overlap
from i.
is the time spent in the file system
call to write out subchunk k at l.
In this section, we model the hardware and software underlying Panda on the IBM SP2 at the Numerical Aeronautics Simulation (NAS) division of NASA Ames Research Center, using the AIX 4.1.1.3 JFS file system and an IBM commercial version of the MPI message passing system.
Panda 2.0.5 uses standard blocking MPI message passing [MPI94], under which MPI_Send does not return until the sender process can safely reuse the message buffer. Often small messages use a `buffering' protocol where MPI_Send returns after the message is buffered by MPI, even though the message may not be sent to the receiver yet. Large messages often use a non-buffering protocol, as described in the `rendezvous' protocol of [Franke94] for MPI-F. We found that our MPI switches from the buffering to the non-buffering protocol when message size reaches 4 KB.
The `communication latency'
is the time spent in the send and receive calls by a process
when it sends and receives a message.
Since the NAS SP2 has homogeneous nodes and the computing environment
on the SP2 nodes is homogeneous as well,
can be simplified to
for all i,
and similarly for
.
We present the model for
message sends; receives are similar:
Here, n is the number of bytes in the message,
is the startup cost (message latency) for sending a message,
is the physical network transmission rate,
is the per packet software messaging
overhead
per packet, and p is the packet size.
The per packet software messaging
overhead includes the computational costs
for processing a packet, such as generating message headers,
doing checksums, and copying messages
from the user buffer to the communication subsystem buffers.
This overhead can be removed from Formula 12 if
the packetization activities can be performed on the network adapter
card. However, the network cards on the NAS SP2 do not perform
any packet processing, and all the processing is done on the host node
processor.
The parameters we need to model message passing cost accurately in Panda
are
,
,
, and p.
AIX JFS is a memory mapped file system. Normal file accesses bypass
kernel buffers, allowing file caching in memory whenever
extra memory is available.
The maximum and minimum amounts of memory reserved for file caching are
specified by the MAXPERM and MINPERM parameters in the system.
A `write-behind'
policy is used to
increase write performance, and normal file system accesses
are non-blocking.
(AIX divides each file into
16 KB partitions. The pages of a partition are not written to
disk until the application writes the first byte of the next partition.
At that point, the file system schedules the write of the
first partition to disk.)
Thus when room is available in the file cache,
in Formula 7
is the time required for AIX to buffer the data in the
cache during the fwrite call for subchunk k.
The subsequent disk activity can be
overlapped with Panda's processing of the k+1st subchunk.
When the file cache is already full,
the fwrite forces dirty cached
pages to disk before returning.
Since Panda ensures that all data is on disk
before the application resumes,
each server issues an fsync after writing out its final subchunk.
Thus , the time spent in
file system calls after gathering subchunk k, is as follows:
Here, is the time to buffer subchunk k
in the file cache for l,
and
is the time to write dirty pages to disk after
gathering k.
is the number of bytes
written to disk in the fwrite call for k, and
is the
peak file system throughput.
We assume that Panda writes utilize the full bandwidth
of the underlying file system.
This assumption would be false for most parallel i/o systems, but
Panda's server-directed i/o architecture employs logically sequential
file reads and writes, which in the presence of plenty of disk space
on file systems like AIX JFS, translates to physically sequential
reads and writes.
Thus the parameters we use to characterize JFS
behavior are the write-behind policy, the peak file
system throughput,
and the file cache limits defined by MAXPERM and MINPERM.
Our initial goal in developing the performance model is to ease the job of Panda developers by providing them with accurate performance predictions as they study design trade-offs, do performance analysis, identify bottlenecks, and seek optimal strategies for handling i/o requests. In this section, we validate the model and then show how to use it to identify potential bottlenecks and suggest optimal execution plans.
All the experiments and predictions described here are for the NAS SP2 described earlier. Each node is an RS6000/590 with a 66.7 MHz POWER2 multi-chip RISC processor, running AIX 4.1.1.3, with a 256 KB data cache, 128 MB memory, and a local disk of about 2 GB with 3 MB/s disk bandwidth. The SP2 interconnect is a high-performance switch with peak bidirectional bandwidth of 40 MB/s. As each node has its own file system, each can be a Panda server, using a local disk partition of approximately 1.2 GB that was 16-22% full at the time of our experiments.
Table 1: The system parameters of the performance model for Panda 2.0.5
on the NAS IBM SP2.
Table 1 lists the values for system parameters. All empirically obtained measurements use the MPI_Wtime() timer, which offers microseconds accuracy, and we report the average of multiple trials.
We measured latency for MPI_Send and MPI_Recv in standard mode, MPI throughput, per packet software messaging overhead, and peak file system throughput per node. Experiments not described here led us to choose 1 MB as the maximum subchunk size.
To predict performance for a particular
request, we also need values for the computational costs
,
,
,
, and
in Formulas 2, 7, and
8.
Since we have integrated the performance model into a Panda
simulator,
we determine these costs empirically, by running
the part of Panda
that takes the request configuration
as input, does the actual computations,
and outputs
,
,
,
,
and
.
To predict the performance of a particular collective i/o
request, we run the computational cost predictor and
feed its outputs into the remainder of the integrated
performance model.
We use the same strategy to compute
,
,
, and
in
Formulas 3, 5, 6, and 7.
To validate the model, we predicted Panda 2.0.5's performance
on the baseline benchmark we used for Panda 2.0 and also on a
``blackhole benchmark'' that abstracts the
i/o behavior of a
black hole simulation application.
In both cases, the experiments measure the
performance of Panda's high-level
array i/o routines that write n arrays.
The baseline benchmark reads and writes one large 3D
array while varying the number of
clients, the number of servers, the schema for in-memory
arrays and the schema for arrays on disk.
The blackhole benchmark
reads and writes a large number of
small arrays under the same conditions.
More precisely, in the blackhole benchmark, Panda writes
1024 or 2048 3D arrays of size 64 KB or 128 KB.
In all cases,
the logical client mesh is
for 16 clients,
for 32 clients,
and
for 48 clients.
The logical server mesh for n
servers is
. The in-memory
array schema is (BLOCK, BLOCK, BLOCK) and the on-disk
schema is (BLOCK, *, *), i.e., row-major or column-major.
Tables 2 and 3 show the normalized differences between the predictions of the performance model and Panda 2.0.5's actual performance for representative samples of the baseline benchmark and the blackhole benchmark. Due to space limitations, the tables present the results for write requests; reads are similar. The table entries are the difference between predicted and average measured elapsed time, divided by the average measured elapsed time. In all cases the prediction is within 20% of the measured performance result.
Table 2: The normalized difference
between predicted and actual average elapsed times for Panda 2.0.5
writes on the NAS SP2 for the baseline benchmarks. The leftmost column gives
total data size.
Table 3: The normalized difference
between predicted elapsed times using the Panda 2.0.5
performance model and the average of the actual elapsed times
measured on NAS's SP2 for the blackhole benchmark writes.
To facilitate identifying the potential performance bottlenecks, we have developed a Panda simulator that takes values for the model parameters and outputs the predicted values of performance parameters. The simulator reports not only the cost of collective i/o requests, but also the costs of individual components, such as costs for communication and the computational components. Moreover, the simulator generates the predicted performance results for Panda running under a variety of simulated system conditions, as explained below.
The blackhole benchmark exposed several performance problems in Panda (the file system utilization dropped to 30% in some cases). To track down the problems, we used the simulator to report the cost of a collective i/o request under three simulated system conditions: the normal NAS environment in which Panda currently runs, the normal environment but with infinitely fast disks (`disk simulation'), and the normal environment with infinitely fast disks and an infinitely fast network (`network simulation'). The disk simulations clarify the relative speed of assembly of subchunks on each server, and the network simulations clarify the relative computational cost for assembling subchunks, letting us identify bottlenecks under different operating conditions.
Figures 2 and 3 show sample results
from simulations of the blackhole and the baseline
benchmarks respectively.
Each group of bars gives predicted
Panda throughputs for one i/o request.
The left bar is throughput
under normal system conditions,
the middle bar is for the disk simulation, and the
right bar is for the network simulation.
In the baseline benchmarks,
the network simulation throughput
is extremely high ( MB/s), so under normal
system conditions Panda's performance is not
limited by the computational components.
Similarly, the
disk simulation indicates that Panda could sustain
its high file system utilization even if the disks were running
at 25 MB/s rather than 3 MB/s.
Clearly the system bottleneck in
the baseline benchmark is disk speed, with current
file system utilization
of the possible maximum.
On the other hand, Panda's blackhole benchmark throughput for the network simulations is under 3.9 MB/s, and in one case (32 clients with 8 servers) is even lower than the real disk throughput. But with communication cost included, all throughputs drop below 2.06 MB/s. Clearly, Panda is now limited by the speed of its internal computation and communication.
We asked the simulator to report the number of subchunks on each server and the number of messages needed to gather all a server's subchunks. The results show that the blackhole benchmark runs slowly because each array in the collective i/o request is fairly small (64 KB) and is broken into even smaller chunks and subchunks on the clients and the servers. Note that most applications will not have as many arrays as the blackhole benchmark. However, the benchmark is reflective of applications that use cyclic distributions to divide arrays into very small chunks for load balancing, and also of the small data items found in non-grid computation. Sending and receiving small overlaps is extremely slow. Furthermore, there is a fixed computational overhead for assembling a subchunk. Even though each server writes out the same total amount of data in the blackhole and baseline benchmarks in Figure 2 and 3, each server has many fewer subchunks in the baseline benchmark, and each subchunk is larger. The speed of writes in the blackhole experiments is limited by the speed of the assembly of the subchunks, and the speed of writes in the baseline experiments is limited by the speed of the underlying file system. Based on this analysis, [Chen96] devised a new technique to reduce subchunk assembly cost and showed how it improved performance.
Figure 2: Predicted throughput per server for writing 2048 3D arrays of size 64 KB each using different numbers of clients and servers under three simulated system conditions in the blackhole benchmark.
Figure 3: Predicted throughput per server for writing one 3D array of size 128 MB using different numbers of clients and servers under three simulated system conditions in the baseline benchmark.
Figure 4: More blackhole benchmark predicted results. The graph shows the elapsed time of the startup and intermediate phases on each server writing 2048 arrays of size 64 KB or 128 KB, using different numbers of clients and servers. Each group of bars details the performance of a server. With 24 clients, Panda writes 2048 64 KB arrays, and with 32 clients, Panda writes 2048 128 KB arrays. The client and server meshes and the array schemas are the same as in the previous figures.
Figure 5: The decomposition of a small 3D array into client chunks (solid lines) and server subchunks (dotted lines). Clients use a (BLOCK, BLOCK, BLOCK) schema and a (3 x 4 x 2) mesh, and servers use a (BLOCK, *, *) schema, a (4 x 1 x 1) mesh (left) and a (8 x 1 x 1) mesh (right). In both graphs, two of the subchunks have 16 overlapping clients, while the others only have 8.
Identifying potential performance bottlenecks and simplifying performance analysis is important for Panda developers, but in the long term we want to replace hand-tuning of Panda on a case by case basis by suggesting optimal parameter values to users or automatically choosing the best parameter values. In this section we show two examples of using the performance model to find optimal plans.
As the first example, we use the performance model to
study the performance of each i/o
node separately.
In Figure 4,
each group of bars represents the predicted elapsed time in
the startup and intermediate phase on
each server.
Since
all servers
synchronize before entering the closing phase,
Panda's performance is largely determined by the
speed of the slowest server.
Many servers utilize the
file system extremely well (),
but in Figure 4 with 24 clients, moving from
4 to 8 servers brings no significant performance improvement,
because two of the eight servers are twice as slow as the others.
The simulator's report indicates that the slow servers
send twice as many messages as the others,
even though the number of subchunks and the amount of
data written to disk are about the same as for the faster servers.
Examination of the request configurations
explains the load-imbalance
problem.
The experiments with 24 clients
used a client
mesh with a (BLOCK, BLOCK, BLOCK) distribution, while servers
used a
mesh
and a (BLOCK, *, *) distribution. With 4 and 8 servers,
two of the servers each have to send 16 messages to assemble their
subchunks, since each subchunk has 16 overlapping
clients, while others only
need to send 8 messages as shown in Figure 5.
In our test cases,
the subchunks are small, and small messages do not
utilize the available MPI bandwidth well, so
the extra communication cost
is very significant, and the bottleneck
on those servers is in communication and computation.
A similar problem is also seen when 32 clients and 6 servers are used.
The client mesh is
and the server
mesh is
, as shown in Figure 4 with
128 KB arrays.
One way to improve file system utilization is to choose a different
server mesh and a different number of servers. Figure 4
shows that
3 servers and a
server mesh (under which each subchunk only overlaps
with one array chunk on one client)
give very high file system utilization (
of the peak
possible) for 24 clients. With 3 servers
the elapsed time of the startup and intermediate phases
is about the same as if more servers are used,
even though each of the 3 servers has more data to write
than if there were 4 or 8 servers.
Furthermore, with 3 servers data messages are more quickly
constructed than with 4 or 8 servers (no subarray extraction), and
the larger message sizes that result give better MPI
utilization.
Our predicted performance results
using the performance model and the simulator are within
8% of the measured results.
Figure 6: The decomposition of a small 3D array into client chunks (solid lines) and server subchunks (dotted lines). Clients use a (BLOCK, BLOCK, BLOCK) in-memory schema and a (2 x 4 x 4) mesh. Server meshes and disk schemas are indicated in the figure.
Figure 7: Predicted response times for writing 2048 3D arrays each of size 64 KB or 128 KB, using 32 clients and 8 servers with different server meshes and disk schemas. The logical client mesh is (2 x 4 x 4), and the in-memory schema is (BLOCK, BLOCK, BLOCK). The logical server meshes are indicated in the figure. The disk schemas are (*, BLOCK, BLOCK), (BLOCK, BLOCK, BLOCK), (BLOCK, BLOCK, *), (BLOCK, BLOCK, BLOCK), (BLOCK, *, *), and (*, *, BLOCK) from left to right. The server meshes are indicated in the figure.
Sometimes response time, rather than file system utilization,
is the measure most important to minimize.
Figure 7 suggests that the use of 6 servers can accomplish
this goal for the 24-client case.
Alternatively, if the number of servers is predetermined by
outside forces, the use of a different disk schema or mesh
can offer response time improvement. In the previous examples
in this paper, we only considered (BLOCK, *, *) disk schemas
since they allow easy offloading of data into a single
row-major or column-major file for postprocessing.
However, if postprocessing will be done through a Panda
interface, other disk schemas
are worth consideration.
When Panda clients extract
subarrays from their chunks and send them
to the requesting servers, the elements of the
subarrays may not be in contiguous memory locations.
If the subchunks on the servers are aligned with the
array chunks on the clients, a client can send its
entire array chunk or the subarray that is located in a contiguous
memory region, saving on assembly time.
Figure 7
shows the predictions of the performance model for writing
2048 arrays of 64 KB and 128 KB arrays using 32 clients and
8 servers with 6 different server meshes and disk schemas,
a
client mesh, and a (BLOCK, BLOCK, BLOCK) memory
schema. Most configurations give
similar response times, with over
96% of peak JFS write
throughput;
these have subchunks that
align nicely with client chunks, reducing the number of messages
and the amount of copying required.
On the other hand, with
server mesh
and disk schema
(BLOCK, *, *), each server has 16 overlapping clients (Figure 6),
and
computational and communication costs dominate;
the fifth group of bars in Figure 7 shows a factor of 2.5 slowdown
in writing 64 KB arrays, and a 50% slowdown for 128 KB arrays.
The response time using server mesh
and a (*, *, BLOCK) schema falls between the best case and the worst
case when writing 64 KB arrays, because each subchunk has 8 overlapping clients
(Figure 6).
In the last two groups of bars in Figure 7,
128 KB arrays perform better than 64 KB arrays,
because the larger
message sizes utilize MPI better and the file system becomes
the bottleneck.
These two different approaches to improving Panda performance give a flavor of the space of possible optimizations that we will use as a search space for Panda's future collective i/o request optimizer.
Several performance studies target recently developed parallel i/o systems and algorithms. [Kotz94] compares simulated performance using disk-directed i/o with that of traditional caching on irregular distributions of data. Vesta file system performance is analyzed in [Feitelson95]. Panda 2.0's performance is studied in [Seamons95]. None of these uses an abstract model, which would help to identify potential bottlenecks that did not show up in the particular experiments performed, and could lead to fair cross-system comparisons.
Much work targets performance analysis and modeling of computation intensive applications for sequential and parallel platforms. [Clement93] developed an analytical performance prediction model to predict the performance of parallel applications on parallel platforms and help programmers make better use of the power of multicomputers. The approach they took to model the performance of an parallel application is similar to our approach; however, they emphasized the prediction of the performance of parallel applications, while we focus on using the model to guide the system development. [Saavedra89] describes an approach to predict the performance of arbitrary Fortran programs by measuring the performance of a given system in terms of a set of Fortran abstract machine in a microbenchmark suite. The Fortran abstract machine measures the performance of a set of basic Fortran language constructs. [Worley95] extended the abstract machine approach to parallel machines. [Menasce92] presented an analytic methodology for measuring performance of parallel applications running on shared memory machines. The performance model takes into account the application structure and machine parameters. The Pablo performance analysis environment [Reed93] can be used to study the i/o characteristics of user applications. The integration of this type of user application level performance analysis tools with Panda performance models can greatly help users understand Panda behavior and also help Panda understand the application's characteristics and requirements. The integrated system can narrow the gap between Panda developers and the application developers so that Panda can best be utilized to provide high performance.
Several efforts have evaluated in i/o performance for different types of workloads on sequential platforms. [Chen93] proposed a self-scaling parameterized benchmark for i/o performance prediction and evaluation. Their work addresses issues such as scalability of i/o benchmarks to future machines and different workloads. [Ganger93] classified i/o requests into three classes based on their effects on system performance. A system-level trace-driven simulation model is developed to study the disk scheduling algorithms. The emphasis is placed on the overall system performance using different algorithms instead of the improvement of the i/o subsystem alone.
In this paper we first presented a generic performance model for the Panda parallel i/o library, and then customized the model to match the implementation of Panda 2.0.5 and particular versions of the AIX file system and MPI message passing library that underlie Panda. The performance model greatly simplifies performance analysis for Panda. Its accurate predictions reduce the amount of empirical experimentation needed by helping the experimenter to quickly zero in on potential trouble areas and bench-test fixes for them. As an example, we showed the use of the model to predict load balance problems due to differing message loads for individual i/o nodes. We also showed how to use the model to suggest optimal plans for i/o requests.
A run-time version of the performance model has been integrated into Panda, and will be the cornerstone of our approach to automatic optimization of i/o requests at run time, our next focus of effort. To the best of our knowledge, Panda is currently the only parallel i/o system with a performance model, but as models are developed for other parallel i/o systems in the future, we would also like to use the models for direct comparison of approaches to parallel i/o.