Hirotaka Ogawa
ogawa@ipl.t.u-tokyo.ac.jp
http://www.ipl.t.u-tokyo.ac.jp/~ogawa
Satoshi Matsuoka
matsu@ipl.t.u-tokyo.ac.jp
Since MPI was announced two years ago, it is gaining increasing popularity, thanks to its powerful & flexible support of communication, such as different communication contexts via communicators, various synchronous/asynchronous communication modes, derived datatypes, group communications, etc. There have been a number of recent implementations of MPI as well. But, due to the inherent design of its API, the incurred software overhead is large, even compared to previous message passing libraries such as P4 or PVM. This is especially problematic when the hardware latency is low, because much of the benefits of fast networks are lost because of software overhead. This phenomenon not only applies to MPPs, but also to NOWs, where the availability of low-cost, low-latency networks such as the Myrinet is making low-latency communication possible.
As a result, application area of MPI has been somewhat restricted to regular, coarse-grained, and computation-intensive applications. In other words, attaining efficiency in fine-grained, irregular problems using MPI has been difficult. This is unfortunate, since standard message-passing libraries should encompass a wide variety of platforms and applications, including non-numerical applications, which are typically irregular and communication intensive.
There has also been a string of work that has focused on reducing software overhead in message passing as much as possible [9]. Notably, with Berkeley Active Messages, the incurred software overhead is in the order of several microseconds. The drawback is the relative lack of power and flexibility, and portability to some extent. Programming with native Active Messages library is much more difficult compared to programming with MPI, because primitives are 'lower-level'. Furthermore, current Active Message does not support OS-level multithreading nor network heterogeneity well1.
The question then is, would it be possible to have the best of both approaches, i.e., would it be possible to have a low-latency, high-performance message passing library, while retaining the flexibility and power of MPI? The answer is effectively yes---in this paper, we present our OMPI (Optimizing MPI) system, where much of the software overhead is eliminated with partial evaluation techniques, attaining performance which approaches that of Active Messages. C programs that contain MPI function calls are statically analyzed in order to determine which arguments are static, and specialized with respect to those arguments. Because the current partial evaluation techniques are not sufficiently powerful to eliminate all the software overhead, we propose a technique where partial evaluation is combined with selection of pre-optimized template functions. As a result, OMPI guarantees generality and portability of MPI programs, while allowing architecture-specific optimizations to be incorporated at compile-time. OMPI itself is also semi-portable, in that only template functions need to be reimplemented for a particular architecture. This is in contrast to traditional research on MPI implementation, where optimizations were highly architecture-specific.
To validate the effectiveness of our OMPI system, we performed some baseline benchmarks, and more extensive benchmarks on a set of application cores with different communication characteristics, on the 64-node Fujitsu AP1000 MPP. The basic point-to-point latency improved from 338 microsec. to 76 microsec., for communication intensive CG solver core, speedup of over a factor of two has been achieved. Even compared to traditional run-time optimization employing dynamic caching techniques, our OMPI system was consistently faster. The results show that our system is effective for various patterns of communication, significantly reducing the software overhead.
The rest of the paper is as follows: Section 2
analyzes the source of software overhead in MPI and opportunities
of elimination. Section 3 describes
our OMPI system. Section 4 presents results of the
benchmark.
Section 5 covers some related work,
and we conclude in Section 6.
2. Analysis of Software Overhead in MPI and Opportunities of Elimination
We first analyze the source of the software overhead, identifying
problems pertinent to message passing libraries in general, and those
that are specific to MPI. We then investigate the opportunities for
optimization by removing the overhead when static information is
exploited.
MPI_Send(buf, count, type, dest, tag, comm)
void *buf; /* Pointer to Send Buffer */
int count; /* Data Count */
MPI_Datatype datatype; /* Data Type */
int dest; /* Rank (Target Process) */
int tag; /* Message Tag */
MPI_Comm comm; /* Communicator */
MPI_Recv(buf, count, type, source, tag, comm, status)
void *buf; /* Pointer to Receive Buffer */
int count; /* Data Count */
MPI_Datatype datatype; /* Data Type */
int source; /* Rank (Target Process) */
int tag; /* Message Tag */
MPI_Comm comm; /* Communicator */
MPI_Status *status; /* Status */
Here we identify the 4 sets of parameters, which we call communication sets, that are necessary for message passing. As described below, MPI incurs additional overhead compared to traditional, simpler message passing interfaces such as PVM, in order to obtain the communication sets. Later on we describe how OMPI utilizes the communication sets to optimize MPI code.
The source of software overhead due to lack of exploiting static information can be categorized into the followings:
As mentioned above, OMPI is built as a preprocessor which passes the optimized MPI program to the backend C compiler of the target machine. We have found SUIF to be well-suited for the purpose, saving us considerable development time and achieving portability at the same time. Furthermore, as other research groups develop relevant optimizers, they could be integrated easily into our system to further optimize MPI programs.
Figure 2: Overview of Optimization Passes
The post-tsel peval pass optimizes the internals of instantiated templates, which will be explained in the following sections.
We note that not all the template functions need not be created. In fact,
any function can be substituted by a conservative version w.r.t. its
arguments being dynamic. For example, X_long could be safely
substituted for X_long_remote.
Table 1: List of building template functions
The algorithm of selecting a template function depending on
static/dynamic argument information is as below:
MPI_Send(buf, 10, MPI_INT, 2, tag, MPI_COMM_WORLD)
MPI_Send_Short_Remote_0001(buf, 10, MPI_INT, 2, tag, MPI_COMM_WORLD)
#define MPI_Send_Short_Remote_0001(buf, count, type, dest, tag, comm) \
MPI_Send_short_remote_0001(buf, tag)
MPI_Send_short_remote_0001(_buf, _tag)
{
/* preamble */
const int _count = 10;
const MPI_Datatype _type = MPI_INT;
const int _dest = 2;
const MPI_Comm _comm = MPI_COMM_WORLD;
/* body */
....
}
Instantiation of specialized template functions does increase code size. However, since the expansion of MPI functions is not recursive, the only potential problem is expansion of loops and recursive functions with embedded MPI calls. OMPI avoids this problem by limiting loop expansions to 4 iterations and not expanding mutually recursive functions more than once; as a result code increase is almost proportional to the original code size, and is minimal in practice so far.
4. Performance Measurements
In order to validate the effectiveness of our OMPI system, we performed
some baseline benchmarks, and also more extensive benchmarks on application
cores. The results show that our system is effective for various patterns
of communication, and significantly reduces the software overhead, even
compared with traditional optimization techniques.
We chose Fujitsu AP1000 as a target of our prototype implementation. As mentioned earlier, communication performance of AP1000 relative to its processor performance is considerably higher (25MBytes/sec node-to-node vs. 25Mhz Sparc IU + FPU). Thus, small software overhead will dominate loss in communication performance. AP1000 facilitates two modes of message communication. One is interrupt-based, and employs the DMA controller. Although the startup overhead is large, the send is nonblocking. The other is Line-sending, where values contained in a cache line could be sent directly with explicit cache flushing, eliminating the need for copying, interrupts, and DMA setup. On the other hand, the sender must block until the entire line is sent. Also, although the receive is transparently done into a ring buffer without interrupts, the value must be copied.
As a baseline for comparison, we chose MPICH [7], which is a public domain MPI implementation by Argonnne National Laboratory and Mississippi State University. In order to port MPICH onto AP1000, one only needs to implement to lower-level ADI functions. Since MPICH relies heavily on the performance of ADI functions, it is critical for performance comparisons that ADI functions are implemented efficiently. We reused the source code of tuned native AP1000 message passing libraries to achieve this requirement.
Figure 4: Cache manager handle
For derived datatypes, the argument comparison in the preamble could be costly, as the entire dynamic data structure must be traversed. The simple solution is not to cache such arguments; an alternative strategy is to use a fast but conservative matching function; for example, for systems where the operating system could trap on writes, any writes to a page containing a buffer could invalidate a match by setting some flag.
In the current prototype, template functions for the 9 cases involving the communication sets were hand-created. Because we do not yet have tool support for creating template functions, only a small subset of MPI has so far been implemented. Here describe the specifics for MPI_Send and MPI_Recv. For MPI_Send, the templates for five cases have the specializations/optimizations described in Table 2. The remaining 4 cases (X_long_remote, X_short_remote, X_long_local, X_short_local) were created by combining the optimizations. Similarly, for MPI_Recv, the specializations/optimizations are described in Table 3, and likewise the remaining 4 cases were derived by their combination. Optimizations in other MPI functions are similar, which has given us confidence that at least some semi-automated tool could be created to greatly ease the task of template creation.
We also make a note here that, X_generic (i.e. unoptimized)
template functions of OMPI are essentially the same as the
corresponding MPICH implementations both in the robustness (e.g. error
checking) and execution time.
Table 2: Specializing MPI_Send for AP1000
Table 3: Specializing MPI_Recv for AP1000
We also tailor the selection heuristics of tsel. When there are
multiple message transports, selection of the transport is dependent on
CommSize and the characteristics of the underlying message passing
architecture. On AP1000, in general it is better to employ Line-sending
for short messages, whereas DMA+Interrupts is preferable for long messages.
For the 64-node platform we employed, our tests showed that the threshold
is at 60 Kbytes. This threshold is used for determining whether either
_short_ or _long_ would be faster.
The leftmost two columns indicate the performance of native AP1000 message passing library for sending a null message. DMA requires 193 microsec. vs. 37 microsec. for Line-sending, and we can see, both hardware and software setup time of DMA is significantly greater. Line-sending latency is close to what one obtains from Active Messages (Latency for polling-based Active Messages on AP1000 has been reported to be approximately 9 microsec.[8]).
The middle three columns are dynamic cache optimized MPI. The software overhead for the initial setup time (i.e., cache miss) is significantly greater compared to DMA, but for cache hits involving basic datatypes, both software and hardware overhead is reduced significantly, coming close to that of DMA. Software overhead reduction is mainly due to elimination of error checks and dynamic computation of target nodes from communication set, as PCR allows such communication contexts to be cached and passed almost directly to the underlying DMA routine. On the other hand, software overhead for derived datatype is significant, due to the interpretation/traversal overhead for cache comparison check mentioned earlier, nullifying the gains obtained with caching.
The rightmost columns are OMPI results. The 'Generic' column is when there is no static information available; 'CommNode' and 'CommSize' indicate the cases where the respective communication sets are identified to be static; and 'Both' means that both are known statically. Here, we see that even with partial information, our optimizations result in significant overhead reduction. In particular, when CommSize is known, since our message is below the 60Kbytes threshold, the Line-sending was selected, which greatly reduces the mandatory hardware latency (from 112 microseconds to 24 microseconds). For the 'Both' case, the results are dramatic: both hardware and software overhead have been reduced to 1/4 of the unoptimized generic case, down to 76 microsec. from 338 microsec. Although the software overhead is still larger than the native AP1000 message passing library, we strongly believe that we could close this gap with further improvements in partial evaluation.
Figure 5: Ping-Pong overhead result
An interesting result was obtained for LU; we initially speculated that this benchmark would be disadvantageous for OMPI, as the communication pattern cannot be determined at compile time. The result surprisingly indicated otherwise, significantly improving over the generic MPI and winning over dynamic cache MPI by a notable factor. Closer analysis revealed that the PCR cache was being invalidated due to irregularity in the communication pattern of LU, and also that the cost of cache management was adding considerable overhead. By tuning the dynamic cache optimization, e.g., by not relying on the PCR, such overhead could be reduced. Still, the benchmark shows that, even for irregular communication, our partial evaluation strategy eliminates considerable portion of software overhead of MPI.
Figure 6: Numeric application result
5. Past Reports of Fast MPI Implementations on MPPs
As far as we know, all the efforts to lower communication latency
in MPI have been to tune the libraries so that their software overhead
becomes minimal, given the fact that all the arguments are dynamic. None
have employed static compiler techniques to improve performance.
Franke and Hochschild report [1] that the lowest latencies achieved with their MPI implementations on SP1 and SP2 are 30 microsec. and 40 microsec. respectively, with throughputs of 9 Mb/sec. and 35Mb/sec. However, the figures are based on polling, and do not apply to multi-process environments in practice. In such as case, interrupt-based implementation must be used, where the overhead increases to 200 microsec.
MPI/DE [2] is an implementation of MPI on NEC's Cenju-3 MPP, where the underlying operating system is Mach 3.0. Because Mach has kernel-supported threads which could be notified via kernel upcalls, interrupt handling could be made faster. Konishi reports that lowest latencies are 60 microsec., 90 microsec., and 140 microsec. for polling-based, upcall-based, and standard interrupt-based implementations, respectively. However, because Cenju-3 that network DMA controllers operate in physical space while MPI/DE works in logical space and no user-level facilities are provided for performing the necessary translation, the messages must always be copied between user buffers and kernel buffers, incurring significant overhead. Furthermore, polling-based implementation is not useable in a multi-process setting, and upcall optimization is not portable in that it relies heavily on Mach functionalities.
Sitsky describes the implementation of MPI on AP1000 [3], where the underlying CellOS is slightly modified,
and the broadcast network is utilized to lower group
communication.
Still, the latency are reported to be 171.8 microsec. and 64 microsec.
for the in-place method similar to DMA+Interrupts and the protocol
method respectively, and the throughput are 2.69 Mbytes/sec. and 14.83
Mbytes/sec for the in-place method and the protocol method
respectively, indicating that
hardware performance is not well utilized.
6. Discussions and Future Work
We have presented OMPI, a compile-time optimizer for MPI that eliminates
much of the communication overhead using partial evaluation techniques.
Performance benchmarks show that, even compared to traditional dynamic
optimization techniques, our system is faster by substantial margins,
especially for communication-dominant computations in an high-performance
hardware interconnect setting.
There are still technical issues to be worked out as future research. Some issues we share in common with elaborate compilation techniques, such as separate compilation and debugging of optimized code. We believe that solution techniques in advanced compilers could be applied. Furthermore, since the user can always fall back to non-optimized version of MPI, it is possible for the user to fully debug his code before applying OMPI.
There are other static optimization techniques that could be applied. For example, we could perform more extensive static analysis, such as variable range checking, which would be effective in eliminating many of the checks even if we do not have full static information. Another is communication rescheduling; even a simple algorithm would be effective in grouping the communication, and applying techniques such as message vectorization and piggybacking [10]. More elaborate communication rescheduling techniques will allow further optimizations. We are also considering combining our techniques with dynamic optimization techniques.
One of the current technical challenges with our MPI optimization is how to ease the effort of implementation of template functions. Currently, we are taking three approaches to this problem. One is classic software engineering, that is to separate out the machine-independent optimizations from machine-specific optimizations. Another is semi-automated tool support: a software tool could aid the user in specializing his code, by semi-automatically generating the code the user starts out with, given the static/dynamic distinctions of the arguments. The tool could also be supplied with the characteristics of the underlying hardware and operating system (latency/bandwidth of different network interfaces, polling/interrupt-driven/buffered, single/multiprocessing, etc.) and further select or eliminate parts of code, in a similar manner as the current partial evaluator.
Another interesting approach is to implement the core functionality of a subset of MPI, and implement more sophisticated functionalities be implemented in terms of the core subset, and optimized via our MPI optimizer by expanding them all out with partial evaluation. By taking care not to implement MPI functions to be mutually recursive, such recursive expansion via partial evaluation should terminate in a few iterations. Indeed, the MPI standard defines an official subset, whereby other MPI functionalities could be implemented---we must investigate whether the official subset will be just enough for our purpose, in terms of its functionality and the speed of the resulting implementation.
Finally, it is an interesting research and design issue how much of the
new features currently proposed for MPI 2.0 could be superseded by
optimization techniques such as ours. Indeed, some of the new proposals are
fundamentally beneficial, such as threads, but there could be some features
which might not be necessary, and would otherwise will have unsatisfactory
effect on the current execution model and/or the MPI API.
7. Acknowledgement
This research was conducted under a grant from the Real World Computing
Partnership of Tsukuba, Japan. We thank the valuable comments from Frank
O'Caroll, Marc Snir, Yutaka Ishiakwa, Mitsuhisa Sato, Satoshi Sekiguchi,
Umpei Nagashima, and Nayeem Islam.
2 This optimization has been suggested by Marc Snir, but has not been implemented yet in our current system.
3 As noted earlier, the SUIF representation is not textual; the described code has been retransformed back to C for readability.
4 Interprocedural analyses can be easily implemented, using the Global Symbol Table of SUIF, which provides methods to access symbols and procedures across program hierarchy.
[2] K. Konishi, Y. Takano, and A. Konagaya. MPI/DE: an MPI Library for Cenju-3. In MPI Developers Conference, University of Notre Dame, June 1995.
[3] D. Sitsky and K. Hayashi. Implementing MPI for the Fujitsu AP1000/AP1000+ Using Polling, Interrupts and Remote Copying. In Proceedings of Joint Symposium on Parallel Processing '96, University of Waseda, Japan, June 1996 (to be submitted).
[4] T. Shimizu, T. Horie, and H. Ishihata. Low-latency Message Communication Support for the AP1000. In Proceedings of 19th Annual International Symposium on Computer Architecture, pp. 288-297, May 1992.
[5] R. Wilson, R. French, C. Wilson, S. Amrasinghe, J. Anderson, S. Tjiang, S-W. Liao, C-W. Tseng, M. Hall, M. Lam, and J. Hennessy. The SUIF Compiler System. Computer Systems Laboratory, Stanford University, 1994.
[6] Message-Passing Interface Forum. MPI: A Message Passing Interface Standard, Version 1.1. June 1995.
[7] W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press, 1994.
[8] K. Taura, S. Matsuoka, and A. Yonezawa. An Efficient Implementation Scheme of Concurrent Object-Oriented Languages. In Proceedings of 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 218-228, May 1993.
[9] T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser. Active Messages: a Mechanism for Integrated Communication and Computation. In Proceedings of the 19th International Symposium on Computer Architecture, pp. 256-266, May 1992.
[10] N. Islam, A. Dave, and R. Campbell. Communication Compilation for Unreliable Networks. In 16th International Conference on Distributed Computing Systems, May, 1996.