Runtime Performance of Parallel Array Assignment: An Empirical Study

Lei Wang
The University of North Carolina at Chapel Hill
Department of Computer Science, UNC, Chapel Hill, NC 27599-3175

 
wangle@cs.unc.edu 

http://www.cs.unc.edu/~wangle/

James M. Stichnoth
Carnegie Mellon University
School of Computer Science, CMU, Pittsburgh, PA 15213-3890


stichnot@cs.cmu.edu 

http://www.cs.cmu.edu/~stichnot/

Siddhartha Chatterjee
The University of North Carolina at Chapel Hill
Department of Computer Science, UNC, Chapel Hill, NC 27599-3175

 
sc@cs.unc.edu 

http://www.cs.unc.edu/~sc/

Abstract:: Generating code for the array assignment statement of High Performance Fortran(HPF) in the presence of block-cyclic distributions of data arrays is considered difficult, and several algorithms have been published to solve this problem. We present a comprehensive study of the run-time performance of the code these algorithms generate. We classify these algorithms into several families, identify several issues of interest in the generated code, and present experimental performance data for the various algorithms. We demonstrate that the code generated for block-cyclic distributions runs almost as efficiently as that generated for block or cyclic distributions.
Keywords:: Array assignment, code generation, performance evaluation, block-cyclic distribution, High Performance Fortran.

1 Introduction

The array assignment statement in High Performance Fortran [5] is an extremely powerful linguistic construct. Its simple syntax belies the fact that it can encode a wide range of communication operations, from purely local computation to shift communication to all-to-all personalized communication. It is therefore essential that the compiler generate efficient code for this construct. In this paper, our canonical example of the array assignment statement has the form

with , , , , and the array indices being related as follows:

The array assignment statement has the same semantics as the following pair of sequential loops.

    n = ceil((h1-l1)/s1)
    do i = 1, n
        T(i) = B(l2+i*s2)
    enddo
    do i = 1, n
        A(l1+i*s1) = T(i)
    enddo

The difference between the two pieces of code is operational, as the array assignment statement is expected to be executed in parallel. In general, we need the temporary array T due to the possibility of overlap between the array sections of A and B, but the compiler may eliminate this temporary array if it is safe to do so.

The factor that complicates code generation for array assignment is the memory model of HPF, which requires the user to provide alignment and distribution information for arrays. Alignment specifies the co-location of elements of different arrays, while distribution specifies the location of an individual array element on the processor grid. The most general distribution available for an array dimension in HPF is the cyclic(k) (or block-cyclic) distribution, which divides the array dimension into blocks of k contiguous elements and maps the blocks to the processors in cyclic (round-robin) fashion. Thus, if an array dimension of size n is mapped to p processors using a cyclic(k) distribution, then element i of the array is held in the local memory of processor . Two important extremal cases are k = 1 (cyclic distribution) and (block distribution).

To see how array distribution complicates work of the code generator, consider the following code fragment.

!HPF$ PROCESSORS P(3)
      REAL A(0:59)
!HPF$ DISTRIBUTE A(CYCLIC(4)) ONTO P
      A(0:59:5) = 0.00

This simple program operation zeroes every fifth element of the array A--completely local computation without any inter-processor communication. However, the actions required to perform this in a distributed memory environment are nontrivial. Figure 1 shows the elements of A mapped to each of the three processors, with the elements affected by assignment statement shown enclosed in boxes. (Note that a single array dimension is mapped to a two-dimensional structure; we call each horizontal row within a processor a course.) Each processor has to compute the global array indices that are mapped to its local memory, decide which of these are targets of the array assignment, and update the values of the memory locations corresponding to this target set. We would like to accomplish this with a total number of operations proportional to the size of the array section being updated (as opposed to the size of the whole array). We would also like to perform memory accesses in a way that enhances locality of reference.

Figure 1: A block-cyclic data decomposition with k=4, p=3, s=5.

The block and cyclic distributions are important precisely because they simplify the problem considerably. Efficient code generation algorithms for these cases are widely known [9, 10]. However, efficient algorithms for the general block-cyclic case have been discovered only recently. Currently, about seven such algorithms are known; their mathematical roots are different, and they are often based on different assumptions about memory and communication characteristics.

This paper is an attempt to compare, over the large parameter space, the run-time performance of the code generated by these algorithms for block-cyclically distributed arrays. While other papers have presented performance information for individual algorithms, the information there has been limited and cannot be used as a basis for comparing the different algorithms. It is not our purpose to pronounce judgment on these competing algorithms, but rather to catalog the performance of these algorithms over a wide spectrum of scenarios, in the hope that such a catalog may be useful to a compiler writer trying to decide on a code generation algorithm to use.

The remainder of the paper is organized as follows. Section 2 describes the various algorithms used in this study and categorizes them into families. Section 3 describes the experimental setup and the test cases for which we test the performance of the code generated by the algorithms, followed by the results and analysis. Finally, Section 4 presents the conclusions of this performance analysis study.

2 Algorithms

We categorize the code generation algorithms into several families based on the similarities in their mathematical formulation. We generally identify algorithms by the institutions at which they were developed. Due to space constraints, we cannot provide detailed descriptions of the algorithms. We refer the reader to the cited papers for this purpose.

2.1 Run time resolution

The simplest algorithm for this problem is called run time resolution [11] (abbreviated RTR). In this method, each processor examines every array index i, determines if A(i) is mapped to its local memory, and if so, performs the required computation. While this approach is simple, it is inherently non-scalable, as each processor must examine the entire address space. This algorithm has been used in the past as a baseline for comparison, but we feel it is a straw man rather than a serious contender. We therefore show running times for this algorithm but do not use it as a baseline in our study.

2.2 Linear algebraic methods

The next family of algorithms is linear algebraic. The two algorithms in this family are the IBM algorithm [12] and the École des Mines algorithm [1]. They represent data layout and computation mapping constraints in terms of linear equations and use linear algebraic transformations to derive the bounds and strides of the loops executed by the processors. The IBM algorithm uses (essentially) symbolic Gaussian elimination, whereas the École des Mines algorithm uses more complicated transformations, including Fourier-Motzkin elimination, the Hermite Normal Form, and projection operations. The code generated by either algorithm has a two-deep constant stride loop nest for each array dimension, with nonlinear operations such as max, min, floor, and ceiling being involved in the calculation of the loop bounds. The pseudocode for a single dimension for each node consists of the following loop:

for (b = 0; b < NumCourses; b++)
    compute first and last /* the first and last elements 
                              accessed in this course */
    for (base = first; base <= last; base += stride)
        compute with address base
    endfor
endfor

stride is the access stride. NumCourses, the number of courses on this node, can be calculated at either compile time or run time, depending on which parameters are known statically.

We used the IBM algorithm for our experiments on local address generation for one-dimensional arrays. The only implementation of the École des Mines algorithm available from the authors required manual intervention, which we did not consider feasible for use in an experimental setting.

2.3 Table-driven methods

The next family of algorithms are the ones we consider table-driven. The three algorithms we consider in this category are the RIACS algorithm [2], the Rice algorithm [6, 7, 8], and the LSU algorithm [14].

The basic idea in this family of algorithms is to capture the regularities in the local data access patterns for a given array assignment statement using a table of size . (The original presentation of the RIACS algorithm was couched in terms of finite state machines, but table-driven is a more accurate characterization of this approach.) These table entries can be precomputed at compile time if all parameters are statically known; otherwise they are computed as a preprocessing step at run time. The variations in the algorithms arise from the method of storing the table and the method of computing the table entries. The original RIACS algorithm used the theory of linear Diophantine equations to compute the table entries. This algorithm included a sorting step, making its complexity . The Rice and LSU algorithms use a different lattice-theoretic approach to the problem that removes the need for sorting and reduces the complexity to . These two algorithms differ in the exact number of points explored in the lattice, resulting in different constant factors.

An important characteristic of this family of algorithms is that, by introducing an inexpensive table generation step, they need only a single flattened loop for accessing an array dimension, making the generated code very simple and scalable. However, the loop has a non-uniform access stride, making it more difficult to produce efficient machine code. The pseudocode for a single dimension for each node consists of the following loop:

base = startmem; i=0;
while (base <= lastmem) do
    compute with address base
    base +=  Delta_M[i]; i++;
    if (i == length) i = 0;
enddo

startmem and lastmem are the first and last local memory locations to access; length is the length of the table; Delta_M[i] is the distance between successive local memory locations. All of these values are computed during the table generation step.

We ran some initial experiments (see Section 3.1.1)that clearly established the superiority of the LSU algorithm over the other two in this family in generating tables. Since the RIACS is the only one in this family that discussed the schemes on multi-dimensional case and communications, we combined the LSU's table generation algorithm with those of RIACS's schemes in subsequent experiments, and still call this hybrid algorithm the LSU algorithm.

2.4 Set-theoretic methods

The final family of algorithms is set-theoretic. These are based on treating regular sections of arrays as sets, and on describing the mapping of array elements to processors and the inter-processor communication sets in terms of unions and intersections of regular sections. The two algorithms in this family are the CMU algorithm [13] and the OSU algorithm [4]. Both algorithms produce a two-deep loop nest for each dimension, the outer loop iterating over courses of elements and the inner loop iterating over array elements in a course.

The OSU algorithm is based on viewing a block-cyclic distribution of an array as a composition of a block distribution with a cyclic distribution. Depending on the order of function composition, these are called the virtual-block and virtual-cyclic views. Since the block and cyclic decompositions are easy to analyze, the general case becomes a tedious (but straightforward) algebraic manipulation of unions and intersections of sets. This applies to the determination of quantities such as the local set, the send set, and the receive set of a processor. Although the final values of these quantities are the same, the amount of computation required depends on the views used for the arrays. This is due to the fact that the intersection of two regular sections is a regular section, but this is not true in general of the union of two regular sections; this difference affects the complexity of the algorithm, producing a closed-form expression in one case and a search procedure in the other. The algorithm has a model of these costs, and picks a representation based on its estimate of the lowest of these costs.

The CMU algorithm is essentially the OSU algorithm using only the virtual-cyclic view of data distribution, although it also contains additional algorithms for the block and cyclic special cases. Apart from this view, the two algorithms have subtle implementational differences that result in different constant and per-element overheads.

3 Experimental details and results

For our experiments, we wrote a parser to parse an HPF array assignment statement along with the associated alignment and distribution directives. A command-line argument specifies the code generation algorithm to invoke. This invocation of the code generation algorithm produces SPMD node code in ANSI C, along with some testing and timing code. This is then compiled using the C compiler and executed to obtain the timing information. We measure wall clock time, running enough iterations of the loop to eliminate problems related to timer resolution and cache startup. All parameters are known at compile time unless otherwise stated (but the table in the LSU and OSU algorithm are still generated at run-time). The machines we experimented on include SparcStation 5, SparcStation 20, Cray T3D, and Intel Paragon. We use gcc as our C compiler for the SparcStations, cc for the Cray T3D, and icc for the Intel Paragon.

We wanted to insure that we were running the best possible implementations of each code generation algorithm, so that the comparisons were fair. We therefore tried to obtain the original source code from the authors of the algorithms, and succeeded in most cases. Where we failed to obtain source code, we implemented our own version of the algorithm based on the published description. In all cases, we performed as much source-level optimization as possible and also ran the C compiler with optimization at -O3 level. In several cases we examined the assembly code produced by the C compiler and re-engineered the source code to produce better machine code. Such changes typically involved strength reduction on division and remainder operations.

In all the plots in this section, the block size k is the independent variable and the running time per element, measured in microseconds, is the dependent variable. The number of processors p and the upper bound n of the array are parameters for our plots. We set l = 0 and h = n for all experiments. We use time per element to present our results because it serves better in isolating other factor (such as load imbalance which is independent of algorithms) and focusing on the algorithm characteristics. (The single exception is the plot in Section 3.1.1 which reports time per processor.)

We see three issues that contribute to the complexity of compiling an array assignment statement: generating local addresses, dealing with multidimensional arrays, and communication generation. To isolate the effects of each issue, we devised experiments to study them one at a time.

3.1 Generating local addresses

The simplest case of array assignment is of the form A(l:h:s) = 0.0. This involves local work on the part of the processors, without any inter-processor communication. Load imbalance is still possible depending on the relation between p, k, and s. (This imbalance is independent of the code generation algorithm.) This is the best possible situation for code generation, since the performance obtained here is independent of the communication subsystem of the machine.

In this category of experiments, we ran some initial tests that clearly established the superiority of the LSU algorithm over the other two in the family of table-driven methods. Then we compared the RTR, IBM, and LSU algorithms. The OSU algorithm were designed chiefly for the case where communication is involved, and it proved difficult to adapt them for the local computation case. We performed this set of experiments on a SparcStation 5 running SunOS and the gcc compiler, running a sequential outer loop to simulate the actions of multiple processors. We also ran the same experiments on a T3D with the cc compiler to demonstrate the sensitivity of the results to the experimental environment.

3.1.1 Comparison of the table-driven algorithms

Figure 2 shows the performance of the three table-driven algorithms (RIACS, Rice, and LSU) for local address generation. Since the table generation process runs only once on each processor for each dimension of an array, and its time is independent of array size, we present the results in terms of time per processor rather than time per element. The LSU algorithm consistently outperforms the RIACS and Rice algorithms in the plot. (Though the LSU curve looks flat, it is in fact linear with a small constant factor.) This agrees with the results of the complexity analysis of these algorithms. This behavior was also demonstrated for other parameter values. We therefore pick the LSU algorithm as our representative table-driven algorithm.

One interesting feature of these plots is their improved behavior for block sizes that are multiples of the stride. This is easily explained by the fact that local memory accesses in this case have constant stride.

Figure 2: Comparison of RIACS, Rice, and LSU algorithms: h=10000, p=16, s=5.

3.1.2 Performance of local address generation

Figures 3 through 5 present the running time per element of each of the RTR, IBM, and LSU algorithms for different values of the access stride s and in two different machines. It is clear that RTR is unacceptably slow; thus we concentrate our discussion on LSU and IBM hereafter. These graphs show that LSU wins over IBM for small block sizes, while IBM outperforms LSU as the block size increases. Their crossover points vary over different access strides. Figure 6 shows their approximate crossover points at access strides ranging from 1 to 20.

tex2html_wrap_inline728

tex2html_wrap_inline730

tex2html_wrap_inline732

tex2html_wrap_inline734

tex2html_wrap_inline736

Figure 3: Comparison of the RTR, IBM, and LSU algorithms for local address generation, with h=100000, p=16, and s=1. The first three plots were run on a SparcStation 5, and the last three were run on a Cray T3D.

tex2html_wrap_inline738

tex2html_wrap_inline740

tex2html_wrap_inline742

tex2html_wrap_inline744

tex2html_wrap_inline746

tex2html_wrap_inline748

Figure 4: Comparison of the RTR, IBM, and LSU algorithms for local address generation, with h=100000, p=16, and s=5. The first three plots were run on a SparcStation 5, and the last three were run on a Cray T3D.

tex2html_wrap_inline750

tex2html_wrap_inline752

tex2html_wrap_inline754

tex2html_wrap_inline756

tex2html_wrap_inline758

tex2html_wrap_inline760

Figure 5: Comparison of the RTR, IBM, and LSU algorithms for local address generation, with h=100000, p=16, and s=9. The first three plots were run on a SparcStation 5, and the last three were run on a Cray T3D.

tex2html_wrap_inline764

Figure 6: Crossover points of IBM and LSU performance as a function of access stride for local address generation, with h=100000, p=16. The top plot is for a SparcStation 5, and the bottom plot for a Cray T3D.

The different shapes of these graphs deserve explanation. The LSU algorithm involves two separate steps. The table generation time is proportional to the block size, but this fixed overhead is negligible compared to the total time spent in the second step. The second step involves a flattened loop to access each dimension of the array, which implies roughly the same amount of work in accessing each array element, regardless of the array size and the block size. This makes the time per element curve of LSU algorithm almost flat. (The LSU curve from the T3D experiment becomes inconsistent with this theory as the block size increases. We believe this is caused by the limited cache on the T3D and have verified this assumption through other experiments.)

Given the two nested loops in IBM algorithm and the cost of outer loop initializations, its time per element shows a 1/x behavior, resulting in a significantly higher cost for small block sizes. The outer loop iterates over all the courses on the node, and the inner loop iterates over the elements within each course. Therefore, with the other parameters fixed, increasing the block size makes the inner loop larger, thus improving performance. With a large inner loop, the fixed loop stride of the IBM algorithm allows it to outperform the LSU algorithm. On the other hand, with other parameters fixed, the outer loop overhead dominates as access stride increases, making the IBM algorithm more sensitive to the access stride than the LSU algorithm. Thus the block size value at the crossover point tends to increase with the access stride.

3.2 Dealing with multidimensional arrays

Multidimensional arrays are treated as the tensor product of the individual dimensions. Thus, they do not present any new conceptual problems. However, numerous implementation details need to be considered here. This group of experiments is intended to establish the behavior of the algorithms as a function of the number of dimensions of the array. Specifically, we compare the performance of the LSU algorithm for the local address generation case with one-, two-, and three- dimensional arrays.

We fixed the total size of the array at 64,000,000 elements and the total number of processors at 64. In addition, we tried to make the total number of elements accessed to be roughly equal for each program in the same group (we group the experiments by stride size). We experimented with the following statements and processor organization:

tabular173

This experiment was conducted on a SparcStation 20 with 120MB memory and a 2-level cache. We present the results for access strides of 1 and 9 in Figure 7.

tex2html_wrap_inline774

tex2html_wrap_inline776

tex2html_wrap_inline778

tex2html_wrap_inline780

tex2html_wrap_inline782

Figure 7: Comparison of the performance of local address generation for one-, two- and three- dimensional array under LSU algorithm, with h=64,000,000, p=64. The first three plots have s=1 and the last three plots have s=9.

Per-element running time

As expected, the per-element running time (Figure 7) changes very little as we change the number of dimensions. This demonstrates the scalability of the LSU algorithm. The one abnormality is the fourth graph in Figure 7, which behaves much worse than expected. We believe this is caused by page level thrashing.

Load imbalance

Load imbalance is inherent in the multi-dimensional case, independent of the code generation scheme. This results in worse completion time for the higher dimensional case despite the time per element remaining unchanged. Assume a d-dimensional array, with n elements per dimension. Suppose that each dimension is distributed onto p processors with block size k. Then each processor holds either or blocks of data in each dimension. The load imbalance factor can be estimated as , which is exponential in d. Thus the load imbalance is more pronounced for higher dimensional arrays unless the block size is chosen carefully. Figure 8 shows the load imbalance as a function of block size for our three-dimensional test case.

Figure 8: Load imbalance of a three-dimensional array with h=400*400*400, s=(9,1,1), distributed on p=4*4*4.

Data locality

A multi-dimensional array is linearized in memory in some order (e.g., row major, column major), typically specified at the language level. When such an array is accessed with non-unit stride, the local memory access pattern depends strongly on the dimensions that are accessed in this manner. This difference in access pattern can affect the performance greatly for large data sizes by affecting the cache behavior. Figure 9 shows the difference in performance as we change the array dimensions accessed with non-unit stride. This suggests an opportunity for a programmer or an optimizing compiler to improve the run-time performance by restructuring arrays in memory, by interchanging loops, or by carefully choosing array distribution parameters.

Figure 9: Effects of cache behavior on the local address generation for a three-dimensional array under LSU algorithm, with h=64,000,000, p=64.

3.3 Generating communication sets

Most instances of the array assignment statement involve communication. In this group of experiments, we measured three kinds of array assignment statements: shifts, stride changes, and redistributions. The shift statement, A(l:h:s) = A(l+delta:h+delta:s), involves neighbor communication with at most two other processors. The stride change statement, A(l1:h1:s1) = A(l2:h2:s2), usually requires all-to-all personalized communication. A similar communication pattern results from a redistribution, A(0:h) = B(0:h), where A and B have different block sizes. We considered redistributions where B has a block-cyclic distribution and A has either a block or a cyclic distribution.

We tested the LSU, CMU, and OSU algorithms on the Intel Paragon with the icc compiler. We used the following template for executing an array assignment statement:

Generate tables (if necessary)
Post all receives in advance using irecv()
Pack all communication buffers
Barrier synchronization using gsync()
Send all buffers using csend()
Wait for all messages to arrive using msgwait()
Unpack all buffers

Posting all receives in advance minimizes operating systems overhead in the communication. We put all communication into one phase (instead of, e.g., interleaving packs and sends where feasible) to give more meaningful values for the communication costs.

We break down the total execution time into three components: sharable computation time, non-sharable computation time, and communication time. Sharable computation is the computation that need only be done once if the same array assignment statement is repeated multiple times; this includes the table generation time (for the LSU and OSU algorithms only). Non-sharable computation must be repeated for every execution; this includes packing and unpacking of communication buffers. On the graphs, we report the non-sharable computation time, the total computation time, and the communication time, in terms of time per element. For each test, we compiled all information except for the block size into the executable.

3.3.1 Performance of stride communication

In this experiment, we timed the array assignment statement A(0:n*s1:s1) = B(0:n*s2:s2), where A and B have cyclic(k) distributions over p=16 processors. Figure 10 shows the results for n=10000, for stride combinations (s1=6,s2=3) and (s1=11,s2=8).

As expected, the CMU and OSU algorithms are comparable for small block sizes, where the OSU algorithm uses the virtual-cyclic view of the data distributions. In fact, in this region the CMU algorithm slightly outperforms the OSU algorithm, due to slightly smaller overhead in the implementation details. However, the nonsharable OSU time is always better than the CMU time, so the OSU algorithm is the clear winner over CMU here.

For very small block sizes, the OSU algorithm performs marginally better than the LSU algorithm; for large block sizes OSU performs significantly better. For small to medium block sizes, LSU is the winner. But across all block sizes, the OSU nonsharable time is significantly better than the LSU nonsharable time. Taken in conjunction with the IBM results of Section 3.1.2, this suggests the importance of having uniform access strides in the loop.

The obvious discontinuities in the OSU graphs deserve some explanation. These are the points where the OSU selection algorithm changes from the virtual-cyclic view to virtual-block. Because the buffer packing and unpacking (i.e., ``nonsharable'') curve appears continuous, the overall discontinuity results from the preprocessing phase. The amount of work in the preprocessing phase generally increases as the block size approaches the crossover discontinuity point.

In addition, although none of the graphs here show it, in some cases we have observed huge jumps in the per-element cost, which are cache related.

Our conclusion is that for this kind of array assignment statement, the best rule of thumb is to use the LSU algorithm for small block sizes, and the OSU algorithm for large block sizes. However, if we expect to amortize the sharable costs over many executions of the same array assignment statement, we should use the OSU algorithm exclusively. Another conclusion is that if the implementation details of the OSU preprocessing phase were improved (e.g., by reducing the reliance on the relatively expensive malloc() call), the OSU performance could be significantly improved.

tex2html_wrap_inline806

tex2html_wrap_inline808

tex2html_wrap_inline810

tex2html_wrap_inline812

tex2html_wrap_inline814

Figure 10: Comparison of CMU, OSU, and LSU algorithms for stride communication on an Intel Paragon with p=16. We measure total computation (i.e., buffer packing/unpacking) time, the non-sharable portion of the computation time, and the communication time. All times are scaled to per-element costs.

3.3.2 Performance of shift communication

Figure 11 shows the performance of the array assignment statement A(0:n*s-1:s) = B(1:n*s:s) on p=16 processors. We set n=10000 and considered both s=3 and s=8. The comparisons are essentially the same as for the stride communication, except that for the small block sizes the performance of the CMU and OSU algorithms is virtually identical.

tex2html_wrap_inline818

tex2html_wrap_inline820

tex2html_wrap_inline822

tex2html_wrap_inline824

tex2html_wrap_inline826

Figure 11: Comparison of CMU, OSU, and LSU algorithms for shift communication on an Intel Paragon with p=16. We measure total computation (i.e., buffer packing/unpacking) time, the non-sharable portion of the computation time, and the communication time. All times are scaled to per-element costs.

3.3.3 Redistributions

For our final experiment, we looked at two kinds of redistributions statements: one from block-cyclic to block, and one from block-cyclic to cyclic. Once again, we used p=16 processors; we show timings for array sizes n=1000 and n=100000.

The results for the block case are similar to the other communication results, except that the LSU packing and unpacking (nonsharable) time is substantially higher than the other LSU cases (5 s rather than 1-3 s). Even the CMU algorithm at large block sizes outperforms LSU here.

For the cyclic redistribution, the OSU algorithm performs similarly as before, but the CMU and LSU algorithms are able to make better use of the knowledge that the block size of one of the arrays is 1. The CMU algorithm ends up being the clear winner for the larger array size.

tex2html_wrap_inline834

tex2html_wrap_inline836

tex2html_wrap_inline838

tex2html_wrap_inline840

tex2html_wrap_inline842

Figure 12: Comparison of CMU, OSU, and LSU algorithms for block redistribution on an Intel Paragon. The first three plots have n=1000, and the last three have n=100000. We measure total computation (i.e., buffer packing/unpacking) time, the non-sharable portion of the computation time, and the communication time. All times are scaled to per-element costs.

tex2html_wrap_inline846

tex2html_wrap_inline848

tex2html_wrap_inline850

tex2html_wrap_inline852

tex2html_wrap_inline854

Figure 13: Comparison of CMU, OSU, and LSU algorithms for cyclic redistribution on an Intel Paragon, with p=16. The first three plots have n=1000, and the last three have n=100000. We measure total computation (i.e., buffer packing/unpacking) time, the non-sharable portion of the computation time, and the communication time. All times are scaled to per-element costs.

4 Conclusions

In this paper, we have presented a head-to-head performance comparison of several code generation algorithms for the array assignment statement in HPF for block-cyclic distribution of arrays. We have classified the published algorithms for this problem into several families, and have similarly classified the various issues of interest in the compilation of array assignment. The availability of this performance data for a wide range of operating conditions should aid compiler writers and library writers (such as the implementors of SCALAPACK [3]) in selecting an appropriate code generation strategy for a given situation.

The complete process of data layout in HPF involves both alignment and distribution. So far, we have considered one-level mappings, in which arrays are aligned identically to the template, and every template cell contains an array element. In the more general case of two-level mappings, the array can be aligned to the template in a strided manner, and some template cells may be empty. Since distribution technically specifies a mapping of the template to the processors, we are left with a decision concerning local memory allocation. We are conducting more experiments studying different strategies in reducing memory waste. In addition, we are investigating the effects of compile-time knowledge on runtime performance; i.e., what penalty we pay for producing more general code with fewer compile-time constants.

While block-cyclic distributions are important from an algorithmic standpoint [3], the perceived difficulty of efficiently compiling for such distributions has delayed the inclusion of this feature in commercial HPF compilers. Indeed, there has even been some discussion of removing block-cyclic distributions from HPF altogether. We demonstrate through our experiments that the code generated for block-cyclic distributions can run almost as efficiently as that generated for block or cyclic distributions if the algorithm is selected carefully. We hope that the performance numbers presented in this paper will dispel some of the popular misconceptions and lead to a wider acceptance of block-cyclic distributions.

Acknowledgments

This research was supported in part by NSF Career Award CCR-9501979 and in part by the Advanced Research Projects Agency/CSTO monitored by SPAWAR under contract N00039-93-C-0152. CRAY T3D time was provided by the North Carolina Supercomputer Center under the 1996 Cray Grant Program.

Shivnandan Kaushik provided us the source code for the OSU algorithm. Nenad Nedeljkovic provided us the source code for the Rice algorithm, which also included the original code for the RIACS algorithm. J. Ramanujam provided us the source code for the LSU algorithm. Jim Anderson, Rik Faith, and Jan Prins allowed us to use compute cycles on their workstations.

Author Biographies

Lei Wang (wangle@cs.unc.edu) is a graduate student in the Department of Computer Science at the University of North Carolina at Chapel Hill. She received her B.S. in computer science from Tsinghua University (Beijing, P.R.China) in 1994. Her research is in compilers for parallel languages.

James M. Stichnoth (stichnot@cs.cmu.edu) is a graduate student in the School of Computer Science at Carnegie Mellon University. He received his B.S. in computer science from the University of Illinois at Urbana-Champaign. His current research includes compiling scientific applications for distributed-memory multicomputers.

Siddhartha Chatterjee (sc@cs.unc.edu) is an assistant professor of computer science at The University of North Carolina at Chapel Hill. He received his B.Tech. (Honors) in electronics and electrical communications engineering in 1985 from the Indian Institute of Technology, Kharagpur, and his Ph.D. in computer science in 1991 from Carnegie Mellon University. He was a visiting scientist at the Research Institute for Advanced Computer Science (RIACS) in Mountain View, California, from 1991 through 1994, and worked at AT&T; Bell Laboratories, Murray Hill, in the summer of 1988. He has written several papers in the areas of compilers for parallel languages, computer architecture, and parallel algorithms. His research interests include the design and implementation of programming languages and systems for high-performance scientific computations, high performance parallel architectures, and parallel algorithms and applications.

References

1: C. Ancourt, F. Coelho, F. Irigoin, and R. Keryell. A Linear Algebra Framework for Static HPF Code Distribution. In Workshop on Compilers for Parallel Computers, Delft, The Netherlands, Dec. 1993.
2: S. Chatterjee, J. R. Gilbert, F. J. E. Long, R. Schreiber, and S.-H. Teng. Generating Local Addresses and Communication Sets for Data-parallel Programs. Journal of Parallel and Distributed Computing, 26(1):72-84, Apr. 1995.
3: J. J. Dongarra and D. W. Walker. Software Libraries for Linear Algebra Computations on High Performance Computers. SIAM Review, 37(2):151-180, June 1995.
4: S. K. S. Gupta, S. D. Kaushik, S. Mufti, S. Sharma, C.-H. Huang, and P. Sadayappan. On Compiling Array Expressions for Efficient Execution on Distributed-memory Machines. In A. N. Choudhary and P. B. Berra, editors, Proceedings of the 1993 International Conference on Parallel Processing, volume II, pages 301-305. CRC Press, Inc., Aug. 1993.
5: High Performance Fortran Forum. High Performance Fortran Language Specification. Scientific Programming, 2(1-2):1-170, 1993.
6: K. Kennedy, N. Nedeljkovic, and A. Sethi. Efficient Address Generation for Block-cyclic Distributions. In Proceedings of the 1995 International Conference on Supercomputing, pages 180-184, Barcelona, Spain, July 1995.
7: K. Kennedy, N. Nedeljkovic, and A. Sethi. A Linear-time Algorithm for Computing the Memory Access Sequence in Data-parallel Programs. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 102-111, Santa Barbara, CA, July 1995.
8: K. Kennedy, N. Nedeljkovic, and A. Sethi. Communication Generation for Cyclic(k) Distributions. In B. Szymanski and B. Sinharoy, editors, Languages, Compilers, and Run-Time Systems for Scalable Computers, pages 185-197. Kluwer Academic Publishers, 1996.
9: C. Koelbel. Compile-time Generation of Regular Communication Patterns. In Proceedings of Supercomputing'91, pages 101-110, Albuquerque, NM, Nov. 1991.
10: C. Koelbel and P. Mehrotra. Compiling Global Name-space Parallel Loops for Distributed Execution. IEEE Transactions on Parallel and Distributed Systems, 2(4):440-451, Oct. 1991.
11: T. MacDonald, D. Pase, and A. Meltzer. Addressing in Cray Research's MPP Fortran. In Proceedings of the Third Workshop on Compilers for Parallel Computers, pages 161-172, Vienna, Austria, July 1992. Austrian Center for Parallel Computation.
12: S. Midkiff. Local Iteration Set Computation for Block-cyclic Distributions. In C. Polychronopoulos, editor, Proceedings of the 24th International Conference on Parallel Processing, pages 77-84. CRC Press, Aug. 1995.
13: J. M. Stichnoth. Efficient Compilation of Array Statements for Private Memory Multicomputers. Technical Report CMU-CS-93-109, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, Feb. 1993.
14: A. Thirumalai and J. Ramanujam. Fast Address Sequence Generation for Data-parallel Programs Using Integer Lattices. In C.-H. Huang, P. Sadayappan, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing (LNCS 1033), pages 191-208. Springer, 1995.