Performance Modeling for
the Panda Array I/O Library

Y. Chen
Computer Science Department
University of Illinois
Urbana, IL 61801


ying@cs.uiuc.edu

http://bunny.cs.uiuc.edu/CDR/people/Chen.html

M. Winslett
Computer Science Department
University of Illinois
Urbana, IL 61801


winslett@cs.uiuc.edu

http://bunny.cs.uiuc.edu/CDR/people/Winslett.html

S. Kuo
Computer Science Department
University of Illinois
Urbana, IL 61801


s-kuo@cs.uiuc.edu

http://bunny.cs.uiuc.edu/CDR/people/Kuo.html

Y. Cho
Computer Science Department
University of Illinois
Urbana, IL 61801


ycho@cs.uiuc.edu

http://bunny.cs.uiuc.edu/CDR/people/Cho.html

M. Subramaniam
Oracle Corporation
Redwood Shores, CA 94065


masubram@us.oracle.com

http://bunny.cs.uiuc.edu/CDR/people/Subraman.html

K. E. Seamons
Transarc Corporation
Pittsburgh, PA 15219


seamons@transarc.com

http://bunny.cs.uiuc.edu/CDR/people/Seamons.html

Abstract:: We present an analytical performance model for Panda, a library for synchronized i/o of large multidimensional arrays on parallel and sequential platforms, and show how the Panda developers use this model to evaluate Panda's parallel i/o performance and guide future Panda development. The model validation shows that system developers can simplify performance analysis, identify potential performance bottlenecks, and study the design trade-offs for Panda on massively parallel platforms more easily than by conducting empirical experiments. More importantly, we show that the outputs of the performance model can be used to help make optimal plans for handling application i/o requests, the first step toward our long-term goal of automatically optimizing i/o request handling in Panda.
Keywords:: Parallel I/O, performance evaluation, performance modeling, automatic performance optimization

Introduction

Panda (URL http://bunny.cs.uiuc.edu/CDR/panda/) is an i/o library motivated by the needs of high-performance SPMD scientific applications that must input and output multidimensional arrays on distributed memory parallel platforms or networks of workstations. Panda supports collective i/o operations where all the processors used by an application are closely synchronized and periodically cooperate in conceptually simple i/o operations, such as reading or writing an entire array. Panda provides application developers with a high-level interface for array i/o and offers an efficient commodity-parts-based implementation of the interface across a variety of computer architectures using server-directed i/o, which allows long sequential reads and writes at each i/o node.

Despite numerous implementations of parallel i/o, few detailed performance analyses are available. The behavior of a parallel i/o system is usually difficult to analyze empirically, let alone characterize abstractly. Yet an abstract characterization is essential for `automatic' optimization of i/o request handling; otherwise the parallel i/o system must be tuned by hand for each platform and workload encountered, and will not be able to adapt automatically to changing conditions. Furthermore, the absence of a performance model makes it much more difficult to identify potential performance bottlenecks, necessitating time-consuming empirical experimentation under realistic conditions using often-scarce hardware and software resources, and the limited number of experiments possible given limited system resources may not tell the whole performance story. As another complication, in the absence of standard benchmarks and performance models, cross-library, cross-version, cross-platform, and even cross-installation predictions and comparisons of performance are extremely difficult. Finally, without a performance model for a parallel i/o system, it is hard to learn from past successes and failures when building a new system or a new system version.

In section 2 of this paper, we present a generic performance model for Panda, then customize it for Panda 2.0.5. In section 3, we validate the model by comparing its performance predictions to the real measured performance on an SP2. We then give examples of the use of the model in identifying potential performance bottlenecks, and show how model outputs can help make optimal plans for i/o operations. Section 4 discusses the related work in this field and section 5 concludes the paper.

Performance Modeling for the Panda library

In this section, we briefly present Panda's architecture, followed by an abstract performance model and then a customized model for Panda 2.0.5. A more detailed description of Panda's architecture can be found in [Seamons95], along with documentation of Panda's excellent performance on the IBM SP2.

Panda system architecture

Panda consists of clients and servers (one per compute node and i/o node, respectively). Under Panda's server-directed i/o, the application communicates with its local Panda client via a high-level interface for collective i/o of multidimensional arrays. The clients then pass a high-level description of the i/o request to the servers, so the servers understand both the array data distribution on disk and in memory. Then servers sequentially read data from files, scattering the data to clients as it arrives from disk, and employing the reverse strategy for writes.

Panda currently supports applications that distribute arrays across all clients using HPF-style BLOCK- and *-based array schemas (distributions). We call the subarrays resulting from a distribution ``chunks''. The schemas for an array on disk and in memory can be completely different, and Panda converts the data from one schema to another automatically during the i/o request. The application can request i/o of an entire set of arrays in one Panda call.

During i/o, Panda divides very large array chunks into sequences of smaller pieces, called `subchunks'. (A chunk that does not exceed the subchunk size limit can be thought of as a single subchunk.) Panda's unit of reads and writes is the subchunk, not a disk block, so for writes each server gathers one subchunk before issuing a file system level write. Similarly, after reading one subchunk, the server scatters it to the appropriate clients before performing the next file system level read. Thus the size chosen for subchunks determines server buffer requirements, and is a tunable parameter.

Conceptually, each i/o request has three phases---startup, intermediate, and closing---at each client and server. The clients and servers enter their startup phase at approximately the same time when the application calls the Panda clients and the clients tell the servers about the request by sending a high-level description of the i/o request. Clients then enter the intermediate phase. After digesting the high-level description, each server enters its intermediate phase.

During the intermediate phase, each server understands which chunks are its responsibility, and gathers/scatters its assigned subchunks one at a time, issuing a file system level write after each subchunk is gathered (or a read before each subchunk is scattered). For writes, for each assigned disk subchunk, the server computes which clients own the pieces of the subchunk (the overlapping clients). For each overlapping client, the server calculates the indices of the relevant portion of the array chunk at that client and then requests that subarray from the client. On receiving a request message, the overlapping client extracts the overlap and sends it to the requesting server. Reads are the reverse. When the server's last assigned subchunk is flushed to disk or scattered to the clients, its intermediate phase is over. After it synchronizes with the other servers it enters its closing phase. Synchronization ensures that all data have been written to disk before Panda clients return from a write request, and all data have been scattered to the requesting clients for a read request (the blocking i/o execution model of Panda as discussed in section 2.2). The synchronization is unnecessary if a non-blocking i/o execution model is supported in Panda. The closing phase of a client begins once all the data it wants to read are in its buffer space or all the data it wants to write have been sent to the servers. During the closing phase, servers send a `quit' message to the clients, leave the closing phase, and await the next collective i/o request. Each client returns from the i/o request after it receives a `quit' message, and its closing phase ends at that point.

Abstract performance model for Panda

All Panda versions currently only support blocking Panda calls, i.e., clients do not return control to the application until all data are written to disk or read into application memory, and this assumption is built into the performance model. Thus the cost of a sequence of application i/o requests is the sum of the cost of each request, and our goal in this paper is to model the cost of a single collective i/o request. Under our model, nodes can be heterogeneous, e.g., with different file system or CPU characteristics. We choose the response time for a single collective i/o request and the throughput per Panda server as the performance metrics of interest. To ease the discussion, the array write operation is used as the running example; reads are very similar.

The performance model has three categories of parameters: for the request configuration (R), Panda internals (P), and the underlying system (S). R characterizes the collective i/o request, including the number of clients and servers, the number of arrays, array ranks and sizes, the client and server meshes, and the schemas for memory and disk. S characterizes the underlying hardware and software (UNIX, MPI) that Panda uses heavily. Strategies used in the three phases described in section 2.1 can be viewed as Panda's internal parameters P. For example, P includes the subchunk size.

The generic Panda model is simple; the complexities arise when customizing the model as in section 2.3. We use i to designate a client and l to designate a server. The elapsed time of a collective i/o request to Panda is measured at the clients:

is the elapsed time at client i, and and are the times server l spent in the startup and intermediate phases, and is the time i spent in the closing phase. There are servers, numbered 0 to .

The Panda 2.0.5 performance model

In this section, we tailor the generic model to fit Panda 2.0.5, using writes as the running example.

The startup phase of Panda 2.0.5 can be modeled as follows:

where, assuming no special broadcast facilities used in Panda 2.0.5,

can be broken into the time spent at a selected client, the client master (denoted as in Formula 2), at a selected server, the server master (numbered 0 in Formula 2), on the physical network, and at l. At the client master, includes the time to construct the schema message, of length schema, and the time to send it to the server master, . Here, and are the times spent in the send and receive calls provided by the underlying communication system on node k when sending and receiving a message of length n bytes without network contention. and may or may not include the time spent in message transmission on the physical network depending on the communication protocols used by the underlying communication system. If or already includes the transmission time, is zero; otherwise, is the time to transmit n bytes: , where is the transmission rate of the physical network. We assume does not depend on the particular sender and recipient, as is appropriate in the environments where we currently run Panda. and in Formula 2 are the transmission time for sending the schema message from the client master to the server master, and the time for the server master to receive the schema message assuming no delay from network contention. also includes the time for l to parse the schema message, . In Panda 2.0.5, the server master forwards the schema message to the other servers. Thus, the server master has an extra cost for sending the schema message to all other servers, , as shown in Formula 3. Non-master server l does not receive the schema string until the server master has sent it to servers 1 through l-1, assuming that the server master can only send to one recipient at a time and it sends the schema string in increasing server number order. in Formula 3 is the transmission time from the server master to l, and is the time to receive the schema message. All formulas in this paper ignore the possibility of network contention, which has only been an issue for Panda when running on an FDDI ring; we have developed a model for FDDI contention that is not presented here.

In the closing phase of Panda 2.0.5, the server master informs the client master of the end of the collective i/o request by sending the `quit' message, which the client master propagates to the other clients. Client i returns from the collective i/o request once it receives the `quit' message. Thus i returns from the i/o request:

Here, is the number of clients for the i/o request, the client master is node 0, and the client master sends the `quit' message to other clients in node number order. The first component of Formula 4 is the time required for the server master to send a `quit' message to the client master and for the client master to receive it. in Formula 4 reflects the time for the client master to propagate the `quit' message, assuming it can only send to one recipient at a time.

The intermediate phase in Panda 2.0.5 can be modeled using Formula 6:

where is the total number of subchunks at server l, and is the cost associated with processing a subchunk k at server l:

Here, is the time to determine the overlapping clients for a subchunk on l. For a particular i/o request, is a constant for all subchunks at l since the computational steps in computing overlapping clients are the same for all subchunks. For each overlapping client of k, the server calculates the boundary of overlap on that client, and sends the client a request describing the overlap. is the time to calculate the overlap boundary on l. is the time for l to send the request message to an overlapping client. is the number of request messages sent by l for subchunk k, which is also the number of overlapping clients for k. In Panda 2.0.5, when a server gathers subchunk k, it sends requests to all overlapping clients for k before receiving any overlap. This allows the overlapping clients to work in parallel as soon as they receive the request messages. is the delay before l receives the requested from i. After l sends a request to i, l will not be able to receive the requested overlap until i receives and processes the request and sends the requested data. We define these costs seen at i for handling a request as follows:

where the first component of Formula 8 is the time to receive the request from a server at i, is the request processing time at i, and is the time to send the requested overlap to the requesting server. If the overlapping clients are numbered from 1 to , and l always sends and receives messages in increasing node number order, then can be defined as follows:

In Formula 11, is the time required for server l to send requests to overlapping clients numbered above i and to receive the overlaps from overlapping clients numbered below i. Intuitively, Formula 11 says that as long as the time to handle the request, , at i is shorter than , then by the time l is ready to receive i's overlap, i has already sent it, and the delay is 0. Otherwise, when l is ready to receive from i, during the delay due to slow processing at i, l must sit idle waiting for data from i. is also the difference between the processing time at i and the time at which l has finished sending requests to overlapping clients numbered above i and has also received the data from overlapping clients numbered below i, . Figure 1 illustrates this scenario. In (a), l requests overlaps from clients 1 and 2. Client 1 finishes sending the overlap before l is ready to receive, thus is zero. Client 2 also finishes sending its overlap before l is ready, so the delay caused by client 2 is also zero. Thus l sees no delay. In (b), client 1 is slow to handle l's request. The resulting delay seen by l is the difference between the time needed for client 1 to handle the request from l and the sum of the times for l to compute the overlap boundary on client 2 and send the request to client 2. Client 2 causes no delay, since by the time l is ready to receive client 2's overlap, it has already been sent.

Figure 1: The delay scenarios when gathering a subchunk on server l in Panda 2.0.5. Tsend and Treceive are the times for l to send a request to a client and receive an overlap from a client.

is the time required to receive the overlap from i. is the time spent in the file system call to write out subchunk k at l.

Customizing the model for the NAS SP2

In this section, we model the hardware and software underlying Panda on the IBM SP2 at the Numerical Aeronautics Simulation (NAS) division of NASA Ames Research Center, using the AIX 4.1.1.3 JFS file system and an IBM commercial version of the MPI message passing system.

Panda 2.0.5 uses standard blocking MPI message passing [MPI94], under which MPI_Send does not return until the sender process can safely reuse the message buffer. Often small messages use a `buffering' protocol where MPI_Send returns after the message is buffered by MPI, even though the message may not be sent to the receiver yet. Large messages often use a non-buffering protocol, as described in the `rendezvous' protocol of [Franke94] for MPI-F. We found that our MPI switches from the buffering to the non-buffering protocol when message size reaches 4 KB.

The `communication latency' is the time spent in the send and receive calls by a process when it sends and receives a message. Since the NAS SP2 has homogeneous nodes and the computing environment on the SP2 nodes is homogeneous as well, can be simplified to for all i, and similarly for . We present the model for message sends; receives are similar:

Here, n is the number of bytes in the message, is the startup cost (message latency) for sending a message, is the physical network transmission rate, is the per packet software messaging overhead per packet, and p is the packet size. The per packet software messaging overhead includes the computational costs for processing a packet, such as generating message headers, doing checksums, and copying messages from the user buffer to the communication subsystem buffers. This overhead can be removed from Formula 12 if the packetization activities can be performed on the network adapter card. However, the network cards on the NAS SP2 do not perform any packet processing, and all the processing is done on the host node processor. The parameters we need to model message passing cost accurately in Panda are , , , and p.

AIX JFS is a memory mapped file system. Normal file accesses bypass kernel buffers, allowing file caching in memory whenever extra memory is available. The maximum and minimum amounts of memory reserved for file caching are specified by the MAXPERM and MINPERM parameters in the system. A `write-behind' policy is used to increase write performance, and normal file system accesses are non-blocking. (AIX divides each file into 16 KB partitions. The pages of a partition are not written to disk until the application writes the first byte of the next partition. At that point, the file system schedules the write of the first partition to disk.) Thus when room is available in the file cache, in Formula 7 is the time required for AIX to buffer the data in the cache during the fwrite call for subchunk k. The subsequent disk activity can be overlapped with Panda's processing of the k+1st subchunk. When the file cache is already full, the fwrite forces dirty cached pages to disk before returning. Since Panda ensures that all data is on disk before the application resumes, each server issues an fsync after writing out its final subchunk. Thus , the time spent in file system calls after gathering subchunk k, is as follows:

Here, is the time to buffer subchunk k in the file cache for l, and is the time to write dirty pages to disk after gathering k. is the number of bytes written to disk in the fwrite call for k, and is the peak file system throughput. We assume that Panda writes utilize the full bandwidth of the underlying file system. This assumption would be false for most parallel i/o systems, but Panda's server-directed i/o architecture employs logically sequential file reads and writes, which in the presence of plenty of disk space on file systems like AIX JFS, translates to physically sequential reads and writes. Thus the parameters we use to characterize JFS behavior are the write-behind policy, the peak file system throughput, and the file cache limits defined by MAXPERM and MINPERM.

Validation of the performance model for Panda 2.0.5

Our initial goal in developing the performance model is to ease the job of Panda developers by providing them with accurate performance predictions as they study design trade-offs, do performance analysis, identify bottlenecks, and seek optimal strategies for handling i/o requests. In this section, we validate the model and then show how to use it to identify potential bottlenecks and suggest optimal execution plans.

All the experiments and predictions described here are for the NAS SP2 described earlier. Each node is an RS6000/590 with a 66.7 MHz POWER2 multi-chip RISC processor, running AIX 4.1.1.3, with a 256 KB data cache, 128 MB memory, and a local disk of about 2 GB with 3 MB/s disk bandwidth. The SP2 interconnect is a high-performance switch with peak bidirectional bandwidth of 40 MB/s. As each node has its own file system, each can be a Panda server, using a local disk partition of approximately 1.2 GB that was 16-22% full at the time of our experiments.

Table 1: The system parameters of the performance model for Panda 2.0.5 on the NAS IBM SP2.

Performance predictions for Panda 2.0.5

Table 1 lists the values for system parameters. All empirically obtained measurements use the MPI_Wtime() timer, which offers microseconds accuracy, and we report the average of multiple trials.

We measured latency for MPI_Send and MPI_Recv in standard mode, MPI throughput, per packet software messaging overhead, and peak file system throughput per node. Experiments not described here led us to choose 1 MB as the maximum subchunk size.

To predict performance for a particular request, we also need values for the computational costs , , , , and in Formulas 2, 7, and 8. Since we have integrated the performance model into a Panda simulator, we determine these costs empirically, by running the part of Panda that takes the request configuration as input, does the actual computations, and outputs , , , , and . To predict the performance of a particular collective i/o request, we run the computational cost predictor and feed its outputs into the remainder of the integrated performance model. We use the same strategy to compute , , , and in Formulas 3, 5, 6, and 7.

To validate the model, we predicted Panda 2.0.5's performance on the baseline benchmark we used for Panda 2.0 and also on a ``blackhole benchmark'' that abstracts the i/o behavior of a black hole simulation application. In both cases, the experiments measure the performance of Panda's high-level array i/o routines that write n arrays. The baseline benchmark reads and writes one large 3D array while varying the number of clients, the number of servers, the schema for in-memory arrays and the schema for arrays on disk. The blackhole benchmark reads and writes a large number of small arrays under the same conditions. More precisely, in the blackhole benchmark, Panda writes 1024 or 2048 3D arrays of size 64 KB or 128 KB. In all cases, the logical client mesh is for 16 clients, for 32 clients, and for 48 clients. The logical server mesh for n servers is . The in-memory array schema is (BLOCK, BLOCK, BLOCK) and the on-disk schema is (BLOCK, *, *), i.e., row-major or column-major.

Tables 2 and 3 show the normalized differences between the predictions of the performance model and Panda 2.0.5's actual performance for representative samples of the baseline benchmark and the blackhole benchmark. Due to space limitations, the tables present the results for write requests; reads are similar. The table entries are the difference between predicted and average measured elapsed time, divided by the average measured elapsed time. In all cases the prediction is within 20% of the measured performance result.

Table 2: The normalized difference between predicted and actual average elapsed times for Panda 2.0.5 writes on the NAS SP2 for the baseline benchmarks. The leftmost column gives total data size.

Table 3: The normalized difference between predicted elapsed times using the Panda 2.0.5 performance model and the average of the actual elapsed times measured on NAS's SP2 for the blackhole benchmark writes.

Identifying potential performance bottlenecks with the performance model

To facilitate identifying the potential performance bottlenecks, we have developed a Panda simulator that takes values for the model parameters and outputs the predicted values of performance parameters. The simulator reports not only the cost of collective i/o requests, but also the costs of individual components, such as costs for communication and the computational components. Moreover, the simulator generates the predicted performance results for Panda running under a variety of simulated system conditions, as explained below.

The blackhole benchmark exposed several performance problems in Panda (the file system utilization dropped to 30% in some cases). To track down the problems, we used the simulator to report the cost of a collective i/o request under three simulated system conditions: the normal NAS environment in which Panda currently runs, the normal environment but with infinitely fast disks (`disk simulation'), and the normal environment with infinitely fast disks and an infinitely fast network (`network simulation'). The disk simulations clarify the relative speed of assembly of subchunks on each server, and the network simulations clarify the relative computational cost for assembling subchunks, letting us identify bottlenecks under different operating conditions.

Figures 2 and 3 show sample results from simulations of the blackhole and the baseline benchmarks respectively. Each group of bars gives predicted Panda throughputs for one i/o request. The left bar is throughput under normal system conditions, the middle bar is for the disk simulation, and the right bar is for the network simulation. In the baseline benchmarks, the network simulation throughput is extremely high ( MB/s), so under normal system conditions Panda's performance is not limited by the computational components. Similarly, the disk simulation indicates that Panda could sustain its high file system utilization even if the disks were running at 25 MB/s rather than 3 MB/s. Clearly the system bottleneck in the baseline benchmark is disk speed, with current file system utilization of the possible maximum.

On the other hand, Panda's blackhole benchmark throughput for the network simulations is under 3.9 MB/s, and in one case (32 clients with 8 servers) is even lower than the real disk throughput. But with communication cost included, all throughputs drop below 2.06 MB/s. Clearly, Panda is now limited by the speed of its internal computation and communication.

We asked the simulator to report the number of subchunks on each server and the number of messages needed to gather all a server's subchunks. The results show that the blackhole benchmark runs slowly because each array in the collective i/o request is fairly small (64 KB) and is broken into even smaller chunks and subchunks on the clients and the servers. Note that most applications will not have as many arrays as the blackhole benchmark. However, the benchmark is reflective of applications that use cyclic distributions to divide arrays into very small chunks for load balancing, and also of the small data items found in non-grid computation. Sending and receiving small overlaps is extremely slow. Furthermore, there is a fixed computational overhead for assembling a subchunk. Even though each server writes out the same total amount of data in the blackhole and baseline benchmarks in Figure 2 and 3, each server has many fewer subchunks in the baseline benchmark, and each subchunk is larger. The speed of writes in the blackhole experiments is limited by the speed of the assembly of the subchunks, and the speed of writes in the baseline experiments is limited by the speed of the underlying file system. Based on this analysis, [Chen96] devised a new technique to reduce subchunk assembly cost and showed how it improved performance.

Figure 2: Predicted throughput per server for writing 2048 3D arrays of size 64 KB each using different numbers of clients and servers under three simulated system conditions in the blackhole benchmark.

Figure 3: Predicted throughput per server for writing one 3D array of size 128 MB using different numbers of clients and servers under three simulated system conditions in the baseline benchmark.

Figure 4: More blackhole benchmark predicted results. The graph shows the elapsed time of the startup and intermediate phases on each server writing 2048 arrays of size 64 KB or 128 KB, using different numbers of clients and servers. Each group of bars details the performance of a server. With 24 clients, Panda writes 2048 64 KB arrays, and with 32 clients, Panda writes 2048 128 KB arrays. The client and server meshes and the array schemas are the same as in the previous figures.

Figure 5: The decomposition of a small 3D array into client chunks (solid lines) and server subchunks (dotted lines). Clients use a (BLOCK, BLOCK, BLOCK) schema and a (3 x 4 x 2) mesh, and servers use a (BLOCK, *, *) schema, a (4 x 1 x 1) mesh (left) and a (8 x 1 x 1) mesh (right). In both graphs, two of the subchunks have 16 overlapping clients, while the others only have 8.

Choosing optimal parameter values with the performance model

Identifying potential performance bottlenecks and simplifying performance analysis is important for Panda developers, but in the long term we want to replace hand-tuning of Panda on a case by case basis by suggesting optimal parameter values to users or automatically choosing the best parameter values. In this section we show two examples of using the performance model to find optimal plans.

As the first example, we use the performance model to study the performance of each i/o node separately. In Figure 4, each group of bars represents the predicted elapsed time in the startup and intermediate phase on each server. Since all servers synchronize before entering the closing phase, Panda's performance is largely determined by the speed of the slowest server. Many servers utilize the file system extremely well (), but in Figure 4 with 24 clients, moving from 4 to 8 servers brings no significant performance improvement, because two of the eight servers are twice as slow as the others. The simulator's report indicates that the slow servers send twice as many messages as the others, even though the number of subchunks and the amount of data written to disk are about the same as for the faster servers.

Examination of the request configurations explains the load-imbalance problem. The experiments with 24 clients used a client mesh with a (BLOCK, BLOCK, BLOCK) distribution, while servers used a mesh and a (BLOCK, *, *) distribution. With 4 and 8 servers, two of the servers each have to send 16 messages to assemble their subchunks, since each subchunk has 16 overlapping clients, while others only need to send 8 messages as shown in Figure 5. In our test cases, the subchunks are small, and small messages do not utilize the available MPI bandwidth well, so the extra communication cost is very significant, and the bottleneck on those servers is in communication and computation. A similar problem is also seen when 32 clients and 6 servers are used. The client mesh is and the server mesh is , as shown in Figure 4 with 128 KB arrays.

One way to improve file system utilization is to choose a different server mesh and a different number of servers. Figure 4 shows that 3 servers and a server mesh (under which each subchunk only overlaps with one array chunk on one client) give very high file system utilization ( of the peak possible) for 24 clients. With 3 servers the elapsed time of the startup and intermediate phases is about the same as if more servers are used, even though each of the 3 servers has more data to write than if there were 4 or 8 servers. Furthermore, with 3 servers data messages are more quickly constructed than with 4 or 8 servers (no subarray extraction), and the larger message sizes that result give better MPI utilization. Our predicted performance results using the performance model and the simulator are within 8% of the measured results.

Figure 6: The decomposition of a small 3D array into client chunks (solid lines) and server subchunks (dotted lines). Clients use a (BLOCK, BLOCK, BLOCK) in-memory schema and a (2 x 4 x 4) mesh. Server meshes and disk schemas are indicated in the figure.

Figure 7: Predicted response times for writing 2048 3D arrays each of size 64 KB or 128 KB, using 32 clients and 8 servers with different server meshes and disk schemas. The logical client mesh is (2 x 4 x 4), and the in-memory schema is (BLOCK, BLOCK, BLOCK). The logical server meshes are indicated in the figure. The disk schemas are (*, BLOCK, BLOCK), (BLOCK, BLOCK, BLOCK), (BLOCK, BLOCK, *), (BLOCK, BLOCK, BLOCK), (BLOCK, *, *), and (*, *, BLOCK) from left to right. The server meshes are indicated in the figure.

Sometimes response time, rather than file system utilization, is the measure most important to minimize. Figure 7 suggests that the use of 6 servers can accomplish this goal for the 24-client case. Alternatively, if the number of servers is predetermined by outside forces, the use of a different disk schema or mesh can offer response time improvement. In the previous examples in this paper, we only considered (BLOCK, *, *) disk schemas since they allow easy offloading of data into a single row-major or column-major file for postprocessing. However, if postprocessing will be done through a Panda interface, other disk schemas are worth consideration. When Panda clients extract subarrays from their chunks and send them to the requesting servers, the elements of the subarrays may not be in contiguous memory locations. If the subchunks on the servers are aligned with the array chunks on the clients, a client can send its entire array chunk or the subarray that is located in a contiguous memory region, saving on assembly time. Figure 7 shows the predictions of the performance model for writing 2048 arrays of 64 KB and 128 KB arrays using 32 clients and 8 servers with 6 different server meshes and disk schemas, a client mesh, and a (BLOCK, BLOCK, BLOCK) memory schema. Most configurations give similar response times, with over 96% of peak JFS write throughput; these have subchunks that align nicely with client chunks, reducing the number of messages and the amount of copying required. On the other hand, with server mesh and disk schema (BLOCK, *, *), each server has 16 overlapping clients (Figure 6), and computational and communication costs dominate; the fifth group of bars in Figure 7 shows a factor of 2.5 slowdown in writing 64 KB arrays, and a 50% slowdown for 128 KB arrays. The response time using server mesh and a (*, *, BLOCK) schema falls between the best case and the worst case when writing 64 KB arrays, because each subchunk has 8 overlapping clients (Figure 6). In the last two groups of bars in Figure 7, 128 KB arrays perform better than 64 KB arrays, because the larger message sizes utilize MPI better and the file system becomes the bottleneck.

These two different approaches to improving Panda performance give a flavor of the space of possible optimizations that we will use as a search space for Panda's future collective i/o request optimizer.

Related Work

Several performance studies target recently developed parallel i/o systems and algorithms. [Kotz94] compares simulated performance using disk-directed i/o with that of traditional caching on irregular distributions of data. Vesta file system performance is analyzed in [Feitelson95]. Panda 2.0's performance is studied in [Seamons95]. None of these uses an abstract model, which would help to identify potential bottlenecks that did not show up in the particular experiments performed, and could lead to fair cross-system comparisons.

Much work targets performance analysis and modeling of computation intensive applications for sequential and parallel platforms. [Clement93] developed an analytical performance prediction model to predict the performance of parallel applications on parallel platforms and help programmers make better use of the power of multicomputers. The approach they took to model the performance of an parallel application is similar to our approach; however, they emphasized the prediction of the performance of parallel applications, while we focus on using the model to guide the system development. [Saavedra89] describes an approach to predict the performance of arbitrary Fortran programs by measuring the performance of a given system in terms of a set of Fortran abstract machine in a microbenchmark suite. The Fortran abstract machine measures the performance of a set of basic Fortran language constructs. [Worley95] extended the abstract machine approach to parallel machines. [Menasce92] presented an analytic methodology for measuring performance of parallel applications running on shared memory machines. The performance model takes into account the application structure and machine parameters. The Pablo performance analysis environment [Reed93] can be used to study the i/o characteristics of user applications. The integration of this type of user application level performance analysis tools with Panda performance models can greatly help users understand Panda behavior and also help Panda understand the application's characteristics and requirements. The integrated system can narrow the gap between Panda developers and the application developers so that Panda can best be utilized to provide high performance.

Several efforts have evaluated in i/o performance for different types of workloads on sequential platforms. [Chen93] proposed a self-scaling parameterized benchmark for i/o performance prediction and evaluation. Their work addresses issues such as scalability of i/o benchmarks to future machines and different workloads. [Ganger93] classified i/o requests into three classes based on their effects on system performance. A system-level trace-driven simulation model is developed to study the disk scheduling algorithms. The emphasis is placed on the overall system performance using different algorithms instead of the improvement of the i/o subsystem alone.

Conclusion

In this paper we first presented a generic performance model for the Panda parallel i/o library, and then customized the model to match the implementation of Panda 2.0.5 and particular versions of the AIX file system and MPI message passing library that underlie Panda. The performance model greatly simplifies performance analysis for Panda. Its accurate predictions reduce the amount of empirical experimentation needed by helping the experimenter to quickly zero in on potential trouble areas and bench-test fixes for them. As an example, we showed the use of the model to predict load balance problems due to differing message loads for individual i/o nodes. We also showed how to use the model to suggest optimal plans for i/o requests.

A run-time version of the performance model has been integrated into Panda, and will be the cornerstone of our approach to automatic optimization of i/o requests at run time, our next focus of effort. To the best of our knowledge, Panda is currently the only parallel i/o system with a performance model, but as models are developed for other parallel i/o systems in the future, we would also like to use the models for direct comparison of approaches to parallel i/o.

References

Chen93: P. Chen and D. Patterson. A New Approach to I/O Performance Evaluation---Self-scaling I/O Benchmarks, Predicted I/O Performance. SIGMETRICS, 1993.
Chen96: Y. Chen, M. Winslett, K. E. Seamons, S. Kuo, Y. Cho, and M. Subramaniam. Scalable Message Passing in Panda. IOPADS, 1996.
Ganger93: G. R. Ganger and Y. N. Patt. The Process-flow Model: Examining I/O Performance from the System's Point of View. SIGMETRICS, 1993.
Clement93: M. J. Clement and M. Quinn. Analytical Performance Prediction on Multicomputers. Proceedings of Supercomputing 1993.
Franke94: H. Franke, P. Hochschild, P. Pattnaik and M. Snir. MPI-F: An Efficient Implementation of MPI on IBM-SP1. International Conference on Parallel Processing, 1994.
Feitelson95: D. Feitelson, P. Corbett, and J-P. Prost. Performance of the Vesta File System. IPPS, 1995.
Kotz94: D. Kotz. Disk-directed I/O for MIMD Multiprocessors. OSDI, 1994.
Menasce92: D. A. Menasce and L. A. Barroso. A Methodology for Performance Evaluation of Parallel Applications on Multiprocessors. J. Parallel and Distributed Computing 14, 1992.
MPI94: MPI: A Message-Passing Interface Standard. Message Passing Interface Forum, 1994.
Reed93: D. Reed, R. Aydt, R. Noe, P. Roth, K. Shields, B. Schwartz and L. Tavera. Scalable Performance Analysis: The Pablo Performance Analysis Environment. Scalable Parallel Libraries Conf., 1993.
Seamons95: K. E. Seamons, Y. Chen, P. Jones, J. Jozwiak and M. Winslett. Server-directed Collective I/O in Panda. Supercomputing, 1995.
Saavedra89: R. H. Saavedra-Barrera, A. J. Smith, and E. Miya. Machine Characterization Based on an Abstract High-level Language Machine. IEEE TOC 38:12, 1989.
Worley95: S. J. Worley and A. J. Smith. Microbenchmarking and Performance Prediction for Parallel Computers. UC Berkeley TR CSD-95-873, 1995.

Performance Modeling for the Panda Array I/O Library

Performance Modeling for
the Panda Array I/O Library