Sparse LU Factorization with Partial Pivoting on Distributed Memory Machines

Cong Fu
Department of Computer Science
University of California
Santa Barbara, CA 93106

cfu@cs.ucsb.edu

Tao Yang
Department of Computer Science
University of California
Santa Barbara, CA 93106

tyang@cs.ucsb.edu

Abstract:: Sparse LU factorization with partial pivoting is important to many scientific applications, but the effective parallelization of this algorithm is still an open problem. The main difficulty is that partial pivoting operations make structures of and factors unpredictable beforehand. This paper presents a novel approach called for parallelizing this problem on distributed memory machines. incorporates static symbolic factorization to avoid run-time control overhead and uses nonsymmetric L/U supernode partitioning and amalgamation strategies to maximize the use of BLAS-3 routines. The irregular task parallelism embedded in sparse LU is exploited using the RAPID run-time system [9] which optimizes asynchronous communication and task scheduling. The experimental results on the Cray-T3D with a set of Harwell-Boeing nonsymmetric matrices are very encouraging and good scalability has been achieved. Even compared to a highly optimized sequential code, the parallel speedups are still impressive considering the current status of sparse LU research.
Keywords:: Sparse LU factorization, Column partial pivoting, Dense structures, Task scheduling, Irregular parallelism, Run-time support, Symbolic factorization.

Introduction

Sparse matrix factorization is the key to solving a sparse system of linear equations. If a matrix is symmetric and positive definite, Cholesky factorization can be used, for which fast sequential and parallel algorithms have been developed in [22,23,17]. However in many applications such as circuit simulation, computational fluid dynamics and structural mechanics, the associated equation systems involve nonsymmetric matrices. Pivoting must be conducted to maintain numerical stability for such nonsymmetric linear systems and a typical strategy is partial column pivoting [3,16]. Because the pivoting operations interchange rows based on the numerical values of matrix elements during the elimination process, it is impossible to predict the precise structures of and factors without actually performing the numerical factorization. The adaptive and irregular nature of sparse LU data structures makes the efficient implementation of this algorithm very hard even on a modern sequential machine with memory hierarchies.

There are several approaches that can be used for solving nonsymmetric systems. One approach is the unsymmetric-pattern multi-frontal method [18] that uses elimination graphs to model irregular parallelism and guide the parallel computation if the pivoting sequence can be known prior to numerical factorization. Another approach [11] is to restructure a sparse matrix into a bordered block upper triangular form and use a special pivoting technique which preserves the structure and maintains numerical stability at acceptable levels. It is implemented on Illinois Cedar multi-processors based on Aliant shared memory clusters. Our paper focuses on the parallelization issues for a given matrix ordering with a commonly used pivoting strategy (column partial pivoting) to maintain numerical stability. Parallelization of sparse LU with partial column pivoting is also studied in [12,13] on a shared memory machine. Their approaches overestimate the nonzero fill-ins by using a static symbolic LU factorization so that the dynamic variation of LU data structures is avoided. They have obtained good speedups for up to 6 processors on a Sequent machine and further work is needed to assess the performance of the sequential code.

As far as we know, there is no published result for parallel sparse LU on current commercially available distributed memory machines such as Cray-T3D, Intel Paragon, IBM SP/2, TMC CM-5 and Meiko CS-2. One difficulty in the parallelization of sparse LU on these machines is how to utilize the sophisticated uni-processor architecture. The design of a sequential algorithm must take advantage of caching, which makes some previously proposed techniques less effective. On the other hand, a parallel implementation must utilize the fast communication mechanisms available on these machines. It is easy to get speedups by comparing a parallel code to a sequential code which does not fully exploit the uni-processor capability, but it is not as easy to parallelize a highly optimized sequential code. One such sequential code is SuperLU [4] which uses a supernode approach to conduct sequential sparse LU with column partial pivoting. The supernode partitioning makes it possible to perform most of the numerical updates using BLAS-2 level dense matrix-vector multiplications, and therefore to better exploit memory hierarchies. They perform symbolic factorization and generate supernodes on the fly as the factorization proceeds. Their code delivers impressive performance and is among the best sequential codes for sparse LU with partial pivoting [4,2]. However it is challenging to parallelize their code to get scalable performance. They are working on the parallelization of SuperLU for shared memory machines. But so far we have not seen any published results on the parallelization of their method on distributed memory machines.

In this paper, we present a novel approach that considers three key optimization strategies together in parallelizing the sparse LU algorithm: 1) adopt a static symbolic factorization scheme to eliminate the data structure variation caused by dynamic pivoting; 2) identify data regularity from the sparse structure obtained by the symbolic factorization scheme so that efficient dense operations can be used to perform most of the computation; 3) make use of graph scheduling techniques and efficient run-time support to exploit irregular parallelism. We observe that on most current commodity processors with memory hierarchies, a highly optimized BLAS-3 subroutine usually outperforms a BLAS-2 subroutine in implementing the same numerical operations [3,5]. We can afford to introduce some extra BLAS-3 operations in re-designing the LU algorithm so that the new algorithm is easy to be parallelized but the sequential performance of this code is still competitive to the current best sequential code. We use the static symbolic factorization technique first proposed in [12,13] to predict the worst possible structures of the and factors without knowing the actual numerical values, then we develop a nonsymmetric L/U supernode partitioning technique to identify the dense structures in both and factors, and maximize the use of BLAS-3 level subroutines (matrix-matrix multiplication) for these dense structures. We also incorporate a supernode amalgamation technique to increase the granularity of the computation.

In modeling the irregular parallelism in the re-designed sparse LU algorithm, we use directed acyclic task graphs (DAGs). Scheduling and executing DAG parallelism is a difficult job because parallelism in sparse problems is irregular and execution must be asynchronous. We need to overlap computation with communication, balance loads and eliminate unnecessary communication overhead. We have examined two possible scheduling strategies. One is to use the compute-ahead scheduling [16] and another is to use sophisticated graph scheduling. We implemented our sparse LU method using a run-time system RAPID [8,9] and conducted experiments with a set of Harwell-Boeing nonsymmetric matrices on Cray-T3D. Our experiments show that the scheduled code has achieved good scalability.

The rest of the paper is organized as follows. Section 2 gives the problem definition. Section 3 describes how to predict data structures for sparse LU factorization. Section 4 presents a nonsymmetric L/U supernode partitioning. Section 5 describes the program partitioning and the task graph definition. Section 6 addresses the scheduling and run-time support issues. Section 7 discusses a supernode amalgamation technique to adjust task granularity. Section 8 presents the experimental results. Section 9 concludes the paper.

Preliminaries

Sparse LU factorization can be used to solve large linear systems arising from many important applications [3,16,21]. Given a nonsingular matrix , it can be factorized into two matrices and using Gaussian Elimination with column partial pivoting. During each step of the elimination process, a row interchange may be needed to maintain numerical stability. The elimination process is listed in Figure 1 and the result of the factorization process can be expressed by:

where is a unit lower triangular matrix, is a upper triangular matrix, and is a permutation matrix which contains the row interchange information. The solution of a linear system can hence be solved by two triangular solves: and . The triangular solves are much less time consuming than the Gaussian Elimination process.

Figure 1: Gaussian elimination with column partial pivoting for LU factorization.

When is a nonsymmetric sparse matrix, the structures of triangular factors and depend on not only the sparsity structure of but also the numerical values of . Most sparse LU algorithms involve dynamic memory allocation during the factorization process. For example, the SuperLU package [4] first makes a rough estimation on the total storage needed, or more precisely according to the results from previous factorization, and expands the storage when it becomes necessary later during the factorization.

Caching behavior plays an important role in achieving good performance for scientific computations. To better exploit memory hierarchy in modern architectures, supernode partitioning is an important technique to exploit the regularity of sparse matrix computations and utilize BLAS operations to take advantage of dense operations. It has been successfully applied to Cholesky factorization [19,22,23]. The difficulty for the nonsymmetric case is that supernode structure depends on pivoting choices during the factorization and thus cannot be determined in advance. The SuperLU performs symbolic factorization and identifies supernodes on the fly. It also maximizes the use of BLAS-2 level operations to improve the cache performance of sparse LU. As a result, it outperforms several other sparse LU packages [4]. For instance, it is better than UMFPACK [2] for 15 out of 21 testing matrices.

It is challenging to parallelize SuperLU and get good speedups because of the following reasons.

One has to parallelize the symbolic factorization part. Otherwise the speedups will be limited by the symbolic factorization which contributes 20--45% of the total execution time for most of the tested matrices [4].
Even if the symbolic factorization is parallelizable, there exists another difficulty for parallelization, especially on distributed memory machines. The pivot selection and row interchange dynamically increase fill-ins, which in turn will cause changes of L/U data structures, data dependence structures and processor loads. Using the precise pivoting information at each elimination step can certainly optimize data space usage, reduce communication and improve load balance, but such benefits could be offset by high run-time control overhead.

The strategy of static data structure prediction in [13] is valuable in avoiding dynamic symbolic factorization, identifying the maximal data dependence patterns and minimizing dynamic control overhead. We will use this static strategy. But the overestimation does introduce extra fill-ins and some unnecessary operations in the numerical factorization.

We observe that in SuperLU [4] the DGEMV routine (the BLAS-2 level dense matrix vector multiplication) accounts for 78% to 98% of the floating point operations (exclude the symbolic factorization part). It is also a fact [3] that BLAS-3 routine DGEMM (matrix-matrix multiplication) is usually much faster than BLAS-1 and BLAS-2 routines. On Cray-T3D with a matrix of size 25x25, DGEMM can achieve 103 MFLOPS while DGEMV only reaches 85 MFLOPS. Thus the key idea of our approach is that if we could find a way to maximize the use of DGEMM after using static symbolic factorization, even with overestimated nonzeros and extra numerical operations, the overall code performance could still be competitive to SuperLU which mainly uses DGEMV.

Storage prediction for L and U factors

The purpose of the symbolic factorization is to obtain structures of and factors. Since pivoting sequences are not known until the numerical factorization, the only way to allocate enough storage space for the fill-ins generated in the numerical factorization phase is to overestimate. Given a sparse matrix with a zero-free diagonal, a simple solution is to use the Cholesky factor of . It has been shown that the structure of can be used as an upper bound for the structures of and factors regardless of the choice of the pivot row at each step [12]. But it turns out that this bound is not very tight. It often substantially overestimates the structures of the and factors.

We consider another method instead [12]. The basic idea is to statically consider all the possible pivoting choices at each step. The space is allocated for all the possible nonzeros that would be introduced by any pivoting sequence that could occur during the numerical factorization. We summarize the symbolic factorization method briefly as follows.

The nonzero structure of a row is defined as a set of column indices at which nonzeros or fill-ins are present in the given matrix . Since the nonzero pattern of each row will change as the factorization proceeds, we use to denote the structure of row after step of the factorization and to denote the structure of the matrix after step . And denotes the element in . Notice that the structures of each row or the whole matrix cover the structures of both and factors. In addition, in the process of symbolic factorization we assume that no exact numerical cancellation occurs. Thus, we have

We also define the set of candidate pivot rows at step as follows:

We assume that is always a nonzero. For any nonsingular matrix which does not have a zero-free diagonal, it is always possible to permute the rows of the matrix so that the permuted matrix has a zero-free diagonal [6]. Though the symbolic factorization does work on a matrix that contains zero entries in the diagonal, it is not preferable because it makes the overestimation too generous. The symbolic factorization process will iterate steps and at step , for each row , its structure will be updated as:

Essentially the structure of each candidate pivot row at step will be replaced by the union of the structures of all the candidate pivot rows except those column indices less than . In this way it is guaranteed that the resulted structure will be able to accommodate the fill-ins introduced by any possible pivot sequence. A simple example in Figure 2 demonstrates the whole process.

Figure 2: The first 3 steps of the symbolic factorization on a sample 5x5 sparse matrix. The structure remains unchanged at steps 4 and 5.

This symbolic factorization is applied after an ordering is performed on the matrix to reduce fill-ins. The ordering we are currently using is the multiple minimum degree ordering for . We also permute the rows of the matrix using a transversal obtained from Duff's algorithm [6] to make have a zero-free diagonal. The transversal can often help reduce fill-ins [7].

In the SuperLU, symbolic factorization is conducted dynamically according to the actual pivoting choice so that overestimation is not an issue. But the symbolic factorization contributes average 20--45% to the total factorization time for the tested matrices. Even for matrices with the same sparsity structure but with different numerical values, the dynamic symbolic factorization has to be re-performed. One advantage of the static symbolic factorization is that the result can be reused for matrices with the same initial nonzero structure. A set of nonsymmetric testing matrices from the Harwell-Boeing collection are listed in Table 1. We have compared the number of nonzeros obtained by the static approach and the number of nonzeros obtained by SuperLU for these matrices. The results in Table 2 show that the over-estimation usually leads to about 30--50% more nonzeros, which is acceptable.

Matrix	Identifier	Order	Nonzeros
1 2 3 4 5 6 7	sherman5 lnsp3937 lns3937 sherman3 jpwh991 orsreg1 saylr4	3312 3937 3937 5005 991 2205 3564	20793 25407 25407 20033 6027 14133 22316

Table 1: Testing matrices from the Harwell-Boeing collection.

Matrix	1	2	3	4	5	6	7
NZ(Static)	326434	694257	709337	625063	205016	585686	986203
NZ(SuperLU)	250108	453904	459002	443349	141927	414699	670964
Ratio	1.31	1.53	1.55	1.41	1.44	1.41	1.47

Table 2: The number of nonzeros obtained by the static symbolic factorization and the dynamic approach in SuperLU.

The extra nonzeros do imply additional computational cost. For example, one has to either check if a symbolic nonzero is an actual nonzero during the numerical factorization, or directly perform arithmetic operations which could be unnecessary. If we can aggregate these element-wise floating point operations and maximize the use of BLAS-3 subroutines, the sequential code performance will still be competitive. Thus it is necessary to identify dense structures in a sparse matrix after the static symbolic factorization.

L/U supernode partitioning and dense structure identification

Supernode partitioning is a commonly used technique to improve the cache performance of sparse code [1]. For a symmetric sparse matrix, a supernode is defined as a group of contiguous columns that have identical nonzero patterns. Excellent performance has been achieved in [19,22,23] using supernode partitioning for Cholesky factorization. However, the above definition is not directly applicable to sparse LU with nonsymmetric matrices. A good analysis for defining unsymmetric supernodes is available in [4]. Basically there could be several types of unsymmetric supernodes, namely T1, T2, T3 and T4 based on different criteria. Their results show that T2 and T3 are the best choices in terms of performance. In this paper we choose type T2 supernode. It is defined as a group of consecutive columns in which the corresponding factor has a dense lower triangular diagonal block and the same nonzero pattern below the diagonal block. Based on this definition, in each column block the part only contains dense subrows. We call this partitioning method L supernode partitioning. Here by ``subrow'' we mean the contiguous part of a row in a supernode. However, the issue of identifying dense structures in a factor is not addressed by supernode types T2 and T3 [4] and a standard compressed sparse column format is used to store a factor.

For the dynamic pivoting approach in SuperLU, it is difficult to explore structure regularity in a factor after L supernode partitioning. However in our approach, there are dense columns (or subcolumns) in a factor that we can identify after the static symbolic factorization. We discuss the U partitioning strategy as follows. After a L supernode partition has been obtained on a sparse matrix , i.e., a set of column blocks with possible different block sizes, the same partition is applied to the rows of the matrix to further break each supernode into submatrices. Now each off-diagonal submatrix in the part is either a dense block or contains dense blocks. Furthermore, we use the following theorem to identify dense subcolumn structures in the factor. This is the key to maximize the use of BLAS-3 subroutines in our algorithm.

In the following theorem and corollaries, we show that our strategy will be successful and there is a rich set of dense structures to exploit. The proofs can be found in [10]. The following notations will be used through the rest of the paper.

The L and U partitioning divides the columns of into column blocks and the rows of into row blocks so that the whole matrix is divided into submatrices. Notice that is not necessarily the number of supernodes because in order to fit into cache, we have to break big supernodes into smaller ones. For submatrices in the factor, we denote them as for . For submatrices in the factor, we denote them as for . If , denotes the diagonal submatrix.
Define as the starting column (or row) number of the -th column (or row) block.
A subcolumn (or subrow) is a column (or row) in a submatrix. For simplicity, we use a global column (or row) index to denote a subcolumn (or subrow) in a submatrix. For example, by subcolumn in the submatrix block , it means the subcolumn in this submatrix with the global column index where . Similarly we use to indicate an individual nonzero element based on global indices. A compound structure in L or U is a submatrix, a subcolumn, or a subrow.
A compound structure is nonzero if it contains at least one nonzero element or fill-in. In the following we will not differentiate between nonzero and fill-in entries. They are all considered as nonzero elements. Notice that an algorithm only needs to operate on nonzero compound structures. A compound structure is dense if all of its elements are nonzeros.

The above theorem shows that U partitioning can identify a rich set of dense subcolumns or even dense submatrices in a factor. We also further incorporate this result with supernode amalgamation in Section 7 and our experiments indicate that more than 64% of numerical updates is performed by the BLAS-3 routine DGEMM in , which shows the effectiveness of the U partitioning method. Figure 3 shows the result of a supernode partitioning on a 7x7 sample sparse matrix. One can see that all the submatrices in the upper triangular part of the matrix only contain dense subcolumns.

Figure 3: An example of L/U supernode partitioning.

Using the above theorem, we can further show the structural relationship between two submatrices in the same supernode column block, which will be useful in implementing our algorithm to detect nonzero structures efficiently for numerical updating.

Program partitioning and task dependence graphs

After dividing a sparse matrix into submatrices using the L/U supernode partitioning, we need to partition the LU code accordingly to define coarse grained tasks which can effectively manipulate the partitioned dense data structures. For the convenience of illustration, we use a two-dimensional array to represent the submatrix structure of , where is the number of submatrices in both dimensions. That is a nonzero implies that the corresponding submatrix or in is nonzero. Therefore is viewed as a 2-D block structure.

The granularity of the program partitioning is directly related to the available parallelism. A 2-D submatrix oriented partitioning is the desirable granularity because it gives maximum parallelism. However it may not be efficient for parallel sparse partial pivoting on distributed memory machines. If the submatrices in the same column block are assigned to different processors, it will require a substantial number of inter-processor messages to conduct pivot searching and row interchanges. In order to reduce communication, we keep the submatrices, both from and part, of the same column block in the same processor. This strategy has the advantage that both pivot searching and row interchange can be done locally.

The column block partitioning follows the supernode structure on the sparse matrix. Each column block is associated with two types of tasks: Factor(k) and Update(i,k) for some . Task Factor(k) is to factorize all the columns in the -th column block, including finding the pivoting sequence associated with those columns. Instead of performing the row interchange to the other part of the matrix right after each pivoting search, a technique called ``delayed-pivoting" is used [3]. In this technique, the pivoting sequence is held until the factorization of the -th column block is completed. Then the pivoting sequence is applied to the rest of the matrix, i.e., interchange rows. Delayed-pivoting is important, especially to the parallel algorithm, because it is equivalent to aggregating multiple small messages into a larger one. We will vectorize messages as much as possible to reduce expensive communication start-up overhead. Task Update(i,k) exists only for those 's such that is not zero. It receives column block together with the pivoting information to update column block . Here again the pivoting sequence is further packed together with the content of the -th column block.

An outline of the partitioned sparse LU factorization algorithm with partial pivoting is described in Figure 4. The code of Factor(k) is summarized in Figure 5. It uses BLAS-1 and BLAS-2 subroutines. The computational cost of the numerical factorization is mainly dominated by Update() tasks. The implementation of task Update(k,j) is presented in Figure 6. The lines (04) and (11) are using dense matrix multiplications.

Figure 4: A partitioned sparse LU factorization with partial pivoting.

Figure 5: The description of task Factor(k).

Figure 6: The description of task Update(k,j).

We use DAGs to model the irregular parallelism arising in this partitioned sparse LU program. Previous work on sparse Cholesky factorization has used elimination trees (e.g. [22]), which can accurately capture the available parallelism because the input matrix is symmetric and pivoting is not involved. For nonsymmetric sparse LU, one can also induce some parallelism from a row-wise elimination tree [13], but it is not as straightforward as the one for sparse Cholesky. Dynamically created DAGs have been used for modeling parallelism in a nonsymmetric multi-frontal method [18].

Given the task definitions in Figures 4, 5 and 6 we can define the structure of a sparse LU task graph in the following.

There are tasks Factor(k) (or for short), where .
There is a task Update(k,j) (or for short) if , where . For a dense matrix, there will be a total of updating tasks.
There is a dependence edge from to task , for any such that .
There is a dependence from to , where and there exists no task such that .

Figure 7(a) shows the nonzero pattern of the partitioned matrix shown in Figure 3. Figure 7(b) is the corresponding task dependence graph.

Figure 7: (a) The structure information array for the example in Figure 3. (b) The dependence graph derived from the partitioning result.

Task scheduling and run-time support

We discuss how sparse LU tasks can be mapped onto processors so that parallel time can be minimized and how a schedule can be executed efficiently.

Scheduling sparse LU tasks on distributed memory machines

For dense problems, one can use a regular mapping such as block cyclic mapping [16]. These schemes usually result in good load balancing and decent speedup. However it is not agreed that what the best mapping strategy is for irregular applications such as sparse matrix computations.

For sparse Cholesky factorization, researchers have used heuristic mapping strategies [20,23] based on elimination trees. For sparse LU, an elimination tree does not directly reflect the available parallelism and therefore scheduling based on the elimination tree is difficult. [13] uses a dynamic load balancing algorithm on a shared memory machine. For distributed memory machines, dynamic and adaptive load balancing work well for problems with very coarse grained computations, but it is still an open problem to balance the benefits of dynamic scheduling with the run-time control overhead since task and data migration cost is too expensive for sparse problems with mixed granularities.

We use task dependence graphs to guide scheduling and have investigated two types of scheduling schemes. In both schemes, the owner-compute rule is used to assign all tasks that update the same column block to the same cluster so that unnecessary communication can be eliminated.

Compute-ahead scheduling. This is to use block-cyclic mapping of tasks with a compute-ahead execution strategy, which is demonstrated in Figure 8. This idea has been used to speed up parallel dense factorizations [16]. It executes the numerical factorization layer by layer based on the current submatrix index. The parallelism is exploited for concurrent updating. In order to overlap computation with communication, the factorization is executed as soon as and (if it exists) finish so that the pivoting sequence and column block for the next layer can be broadcast as early as possible.
Graph scheduling. We assign clusters to processors and order task execution within each processor using the algorithms in [25]. The basic optimizations are balancing processor loads and overlapping computation with communication to hide communication latency. These are done by utilizing global dependence structures and critical path information.

Figure 8: The parallel sparse LU factorization algorithm with partial pivoting using compute-ahead schedule.

Graph scheduling has already been shown effective for irregular problems such as adaptive n-body simulation [14] and sparse Cholesky factorization [8,9]. For sparse LU, it can better exploit irregular parallelism compared to the compute-ahead scheduling. The major reason is that the compute-ahead scheduling is not as efficient in overlapping communication with computation. We demonstrate this point using the LU task graph in Figure 7. For this example, the Gantt charts of the compute-ahead schedule and the schedule derived by our graph scheduling algorithm are listed in Figure 9. It is assumed that each task has a computation weight 2 and each edge has communication weight 1. It is easy to see that our scheduling approach produces a better result than the compute-ahead schedule. If we look at the compute-ahead schedule carefully, we can see that the reason is that it can look ahead only one step so that the execution of the task is placed after . On the other hand, the graph scheduling algorithm detects that can be executed before which leads to better overlap of communication with computation. Also the simple cyclic mapping could result in load imbalance, and the graph scheduling algorithm can do a better job in balancing processor loads.

Figure 9: (a) A schedule derived by our graph scheduling algorithm. (b) A compute-ahead schedule.

In conclusion, our graph scheduling scheme outperforms the compute-ahead scheduling in a theoretical sense. However, the efficient execution of a task graph schedule is the key to reach this goal. We discuss the run-time support strategies for executing tasks in the next subsection.

The RAPID run-time system

We have implemented our scheme using the RAPID run-time system [9]. This system provides a set of library functions for specifying irregular data objects and tasks that access these objects. The system then extracts a task dependence graph from data access patterns, schedules and executes tasks efficiently on a distributed memory machine.

The RAPID run-time support provides an efficient task communication protocol to execute scheduled tasks. We summarize the main optimizations as follows.

Eliminating the buffering and copying overhead in the asynchronous communication. The key idea is to use Remote Memory Access(RMA) to communicate a data object between two processors. It does not incur any copying/buffering during a data transfer. The functionality of RMA suffices to fit our task protocol and the overhead is kept as small as possible. RMA is available in modern multi-processor architectures such as Cray-T3D [24] and Meiko CS-2 [8].
Eliminating redundant communication. A task may send the same message to several successors and some of these successors could be assigned to the same processor. In this case, it is enough to send this message once to the processor on which those successors reside [25].
Eliminating unnecessary synchronizations. Since the RMA directly writes data to a remote address, it is possible that the content at the remote address is still being used by other tasks and then the execution at the remote processor could be incorrect. Thus for a general computation, a permission to write the remote address needs to be obtained before issuing a remote write. However in the RAPID system, this hand-shaking process can be avoided by a carefully designed task communication protocol [9].

It should be mentioned that un-synchronized direct remote memory access may overwrite some live data and cause data inconsistencies during the execution. In [8,9], we prove that if a task graph is dependence-complete, i.e., all data dependencies are captured in this graph and the graph is sequentializable, the task schedule execution by RAPID is always correct. It is easy to show that our sparse LU task graphs are dependence-complete. Hence there is no data inconsistency during the schedule execution.

As it will be seen in Section 8.2, the above run-time optimization strategies are effective in executing sparse LU graphs. The experiments show that more than 70% of the speedup predicted by the scheduler can be delivered on Cray-T3D. In addition, using RAPID system greatly reduces the amount of work to parallelize the sparse LU algorithm.

Supernode amalgamation

For most tested sparse matrices, the average size of a supernode after L/U partitioning is very small, about 1.5 to 2 columns. This results in very fine grained tasks. Amalgamating small supernodes can lead to great performance improvement for both parallel and sequential sparse codes because it can improve cache performance and reduce inter-processor communication overhead.

There could be many ways to amalgamate supernodes [22,4]. The basic idea is to relax the restriction that all the columns in a supernode must have exactly the same nonzero structure. The amalgamation is usually guided by a supernode elimination tree. A parent could be merged with its children if the merge does not create too many extra zero entries. Row and column permutations are needed if the parent is not consecutive with its children.

However, the column and/or row permutation introduced by the above amalgamation method could undermine the correctness of the static symbolic factorization. We give an example to demonstrate this point. The left part of Figure 10 shows a sparse structure after the static symbolic factorization. Assume that column 1 and row 2 are interchanged by a permutation. The right part of Figure 10 depicts the result of the interchange. Then the new structure does not correctly predict nonzero patterns for all possible pivoting sequences. Suppose that at step 1 of the numerical factorization, the fourth row is chosen as the pivot row. Then there will be no space in the first row to accommodate the two nonzero elements swapped from the fourth row (indicated by the question marks in Figure 10). Similarly we can find counter examples showing that row permutation or column/row permutation can cause the result of static symbolic factorization incorrect.

Figure 10: An example showing how a column permutation could invalidate the result of symbolic factorization.

We have used a simple approach that does not require any permutation. This approach only amalgamates consecutive supernodes if their nonzero structures only differ by a small number of entries. We can control the maximum allowed differences by an amalgamation factor r. Our experiments show that when r is in the range of 4-6, it gives the best performance for the tested matrices and leads to 16--50% improvement on the execution times of the sequential code. Compared to the 5--15% running time improvement by relaxed supernodes in [4], our improvement is much more. The reason is that by getting bigger supernodes, we are getting larger dense structures, although there may be a few zero entries in them, and we are taking more advantage of BLAS-3 kernels. Notice that after applying the supernode amalgamation, the dense structures identified in the Theorem 1 are not strictly dense any more. We call them almost-dense structures and can still use the result of Theorem 1 with a minor revision. That is summarized in the following corollary. All the results presented in Section 8 are obtained using supernode amalgamation.

Experimental studies

Our experiments are conducted on a Cray-T3D distributed memory machine. Each processing element of T3D includes a DEC Alpha EV4(21064) processor with 64 megabytes of memory. The cache size is 8K per processor. The BLAS-3 matrix-matrix multiplication routine DGEMM can achieve 103 MFLOPS, and the BLAS-2 matrix-vector multiplication routine DGEMV can reach 85 MFLOPS. These figures are obtained assuming all the data is in cache and using cache read-ahead optimization on T3D. The communication network of the T3D is a 3-D torus. The Cray provides a shared memory access library called shmem which can achieve 126 MBytes/s bandwidth and communication overhead using shmem_put() primitive [24]. The implementation of the RAPID system on T3D uses shmem_put() for remote memory access.

Sequential performance

The sequential performance on the testing matrices from Table 1 are listed in Table 3. In order to evaluate our sequential code, we also list the performance of SuperLU on these matrices. In calculating the achieved MFLOPS, we use the operation count obtained from SuperLU and do not include extra floating point operations introduced by the overestimation.

Table 3: Sequential performance on T3D: versus SuperLU.

Though the static symbolic factorization has introduced a lot of extra computation work, the performance of after nonsymmetric L/U partitioning is consistently competitive to that of highly optimized SuperLU. The absolute single node performance that has been achieved by the approach is in the range of 6--10% of the DGEMM performance for those matrices of small or medium sizes. Considering the fact that sparse codes usually suffer poor cache reuse, this performance is reasonable. In addition, the amount of computation for these testing matrices is small, ranging from 18 to 107 million double precision floating operations. Since the characteristic of the approach is to explore more dense structures and utilize BLAS-3 kernels, better performance is expected on larger or denser matrices. This is verified empirically in Table 4. The first matrix goodwin is a relatively large sparse matrix and reaches 15% of DGEMM performance. Matrix b33_5600 is truncated from BCSSTK33 because of the current implementation is not able to handle the entire matrix due to memory constraint. For a 1000x1000 dense matrix, 63.6 MFLOPS has been achieved, which is about 62% of the peak BLAS-3 performance and 68% of the LAPACK DGETRF performance. And for the same dense matrix, SuperLU achieves 34 MFLOPS.

Table 4: Sequential performance of the approach for larger or denser matrices on T3D.

We present a quantitative analysis to explain why can be competitive to SuperLU.

We quantify the percentage of numerical updates that is performed in the approach using BLAS-3 kernel in Table 6. Assume the speed of BLAS-2 kernel is and the speed of BLAS-3 kernel is . The total amount of numerical updates is for SuperLU and for the . Apparently . For simplicity, we ignore the computation from scaling part within each column because it contributes very little to the total execution time. Hence we have:

where is the time spent on dynamic symbolic factorization in the SuperLU approach, is the percentage of the numerical updates that are performed by DGEMM in . Let be the ratio of symbolic factorization time to numerical factorization time in SuperLU, then we simplify Equation 1 to the following:

Table 5: Ratios of the number of floating point operations performed for numerical factorization in and in SuperLU.

From [4] we can see that for the tested matrices, and from Table 6 we can see that . We also calculate the ratios of the number of floating point operations performed in and SuperLU for the tested matrices in Table 5. In average, the value of is 3.98. We plug in these typical parameters in Equation 2 and 3, and we have:

For T3D, and . Then we can get , which is close to the ratios obtained in Table 3. The discrepancy is caused by the fact that the submatrix sizes of supernodes are non-uniform, which leads to different caching performance. If submatrices are of uniform sizes, we expect our prediction is more accurate. For instance, in the dense case, is exactly 1. The ratio is calculated as 0.48, which is almost the same as the ratio of the MFLOPS obtained: 34/63.6=0.53.

The above analysis shows that using BLAS-3 as much as possible makes competitive to SuperLU. Suppose in a machine that DGEMM outperforms DGEMV substantially in terms of cache performance, could be faster than SuperLU. Table 6 also shows that ratios of execution times for with and without using BLAS-3 routines. is the execution time obtained by using BLAS-3 routines as much as possible. is the execution time by replacing all the BLAS-3 routines in by BLAS-2 routines. As a result, is usually less than . This can again be verified by Equation 4 as follows:

Table 6: Percentage of the numerical updates which is performed using BLAS-3 routines and the ratios of the execution times with and without using BLAS-3 routines.

Parallel performance

In this subsection, we report a set of experiments conducted to examine the overall parallel performance of and the effectiveness of its optimization strategies.

Overall Performance: The overall speedups of for the tested matrices are shown in Figure 11. scales fairly well up to 32 processors for all the matrices. The speedups range from 6.3 to 11.4 on 32 processors. However the speedups remain almost the same on 64 processors for those matrices except goodwin and b33_5600, which are much larger problems than the rest. The reason is that those small tested matrices do not have enough amount of computation to saturate a large number of processors. The execution times for most of these matrices are less than 1 second on 32 processors. Even for goodwin the execution time on 64 processors is about 3 seconds and therefore an absolute performance of 216 MFLOPS has been achieved. For b33_5600, 412 MFLOPS has been achieved on 64 processors, and 438 MFLOPS on 128 processors. It is our belief that better and more scalable performance can be obtained on larger matrices. This trend can be seen on the matrix goodwin and b33_5600. But currently the available memory on each node of T3D limits the problem size that can be solved with the current version of . We are looking at the issues of how to execute large task graphs under memory constraints.

Figure 11: Parallel speedups on Cray-T3D.

Effectiveness of the Graph Scheduling: We have implemented our sparse LU code using a column-block cyclic mapping with compute-ahead schedule (referred to as ) and compared the performance with that of the code (referred to as ) which uses the RAPID system to schedule and execute tasks. Note that in the implementation of executing a compute-ahead schedule, the communication optimizations used in the RAPID system are still included. The results are shown in Figure 12. The Y axis is , where stands for Parallel Time. For 2 and 4 processors, in certain cases, the compute-ahead code is slightly faster than the RAPID code. But for the number of processors more than 4, the RAPID code runs 25--68% faster. The more processors involved, the bigger the performance gap is. In fact, from 16 processors to 32 processors, in all cases the parallel times of the compute-ahead code actually increase. The reason is that for a small number of processors, there are sufficient tasks making all processors busy and the compute-ahead schedule performs well while the RAPID code suffers a certain degree of control overhead. For a larger number of processors, schedule optimization becomes important since there is limited parallelism to exploit.

Figure 12: Impact of different scheduling and mapping schemes on Cray-T3D.

We have also run both codes on a dense matrix of size 1000x1000, and we list the results in Table 7. Not surprisingly, the compute-ahead code outperforms the RAPID code in most cases because there is sufficient parallelism to exploit. But in the 32 processor case, the RAPID code is still faster than the compute-ahead code. It is safe to conclude that the graph scheduling algorithm is effective in mapping irregular problems.

Table 7: Ratios of parallel times by RAPID code and compute-ahead code for a dense matrix.

Effectiveness of the Run-time Support: A good scheduling algorithm does not suffice to produce an efficient code. A run-time system is necessary to convert the scheduled performance to real performance. The run-time support of the RAPID system incorporates several communication optimizations and provides a very low overhead task communication protocol. We plot the difference between the actual speedups on T3D and the speedups predicted by the scheduler in Figure 13. The Y axis is the ratio . The parallel code using RAPID has been able to achieve more than 70% of the scheduled speedup. In certain cases, the actual speedups are more than the predicted speedups. This result is a clear indication that the RAPID system is suitable for parallelizing the sparse pivoting code. The run-time overhead is small enough to make the run-time situation match the static schedule closely.

Figure 13: Difference between actual and predicted speedups on Cray-T3D.

Effectiveness of supernode amalgamation: We have examined how effective our supernode amalgamation strategy is. Let and be the parallel time with and without supernode amalgamation respectively. The parallel time improvement ratio for several testing matrices are listed in Table 8. Apparently the supernode amalgamation has brought significant improvement due to the increase of supernode size which implies an increase of the task granularities. This is important to obtain good parallel performance [15].

Matrix	P=1	P=2	P=4	P=8	P=16	P=32
1 4 5 6 7	48% 20% 50% 16% 20%	53% 31% 55% 23% 26%	56% 34% 56% 31% 27%	56% 41% 61% 27% 32%	59% 27% 56% 22% 36%	47% 20% 48% 14% 25%

Table 8: Parallel time improvement obtained by supernode amalgamation.

Conclusions

In this paper we present a new approach for efficiently parallelizing sparse LU factorization with partial pivoting on distributed memory machines. The major contribution of this paper is that we integrate several optimization techniques together such as static symbolic factorization, graph scheduling and run-time support for asynchronous parallelism, extend nonsymmetric supernode partitioning techniques to effectively identify dense structures, and maximize the use of BLAS-3 subroutines in the algorithm design. Using these ideas, we are able to exploit more data regularity for this open irregular problem and achieve decent performance on both uni-processor and multi-processors. Our current experiments are conducted on matrices of small or medium sizes, we believe this approach has a great potential to get better performance on larger matrices and plan to investigate this in future work. It should be noted that the static symbolic factorization could fail to be practical if the input matrix has a nearly dense row because it will lead to an almost complete fill-in of the whole matrix. Fortunately this is not the case in most of matrices derived from real applications. Therefore our approach is still applicable to a wide range of problems.

Acknowledgments

This work is supported by NSF CCR-9409695 and a startup fund from UCSB. We would like to thank Xiaoye Li and Jim Demmel for helpful discussion and providing us their testing matrices and SuperLU code, Dan Andresen, Apostolos Gerasoulis, Ed Rothberg, Rob Schreiber, Horst Simon and Kathy Yelick for their valuable comments, and Stefan Boeriu for his help in accessing Cray-T3D.

References

[1] C. Ashcraft, R. Grimes, J. Lewis, B. Peyton, and H. Simon. Progress in Sparse Matrix Methods for Large Sparse Linear Systems on Vector Supercomputers. International Journal of Supercomputer Applications, 1:10-30, 1987.

[2] T. Davis. User's Guide for the Unsymmetric-pattern Multifrontal Package (UMFPACK). Technical Report TR-93-020, Computer and Information Sciences Department, University of Florida, June 1993.

[3] J. Demmel. Numerical Linear Algebra on Parallel Processors. Lecture Notes for NSF-CBMS Regional Conference in the Mathematical Sciences, June 1995. To be published as a book by SIAM.

[4] J. Demmel, S. Eisenstat, J. Gilbert, X. Li, and J. Liu. A Supernodal Approach to Sparse Partial Pivoting. Technical Report CSD-95-883, UC Berkeley, September 1995.

[5] J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson. An Extended Set of Basic Linear Algebra Subroutines. ACM Trans. on Mathematical Software, 14:18-32, 1988.

[6] I. S. Duff. On Algorithms for Obtaining a Maximum Transversal. ACM Transactions on Mathematical Software, 7(3):315-330, September 1981.

[7] I. S. Duff. Personal Communication. 1996.

[8] C. Fu and T. Yang. Efficient Run-time Support for Irregular Task Computations with Mixed Granularities. In Proceedings of IEEE International Parallel Processing Symposium, pages 823-830, Hawaii, April 1996.

[9] C. Fu and T. Yang. Run-time Compilation for Parallel Sparse Matrix Computations. In Proceedings of ACM International Conference on Supercomputing, pages 237-244, Philadelphia, May 1996.

[10] C. Fu and T. Yang. Sparse LU Factorization with Partial Pivoting on Distributed Memory Machines. Technical Report TRCS96-18, University of California at Santa Barbara, 1996.

[11] K. Gallivan, B. Marsolf, and H. Wijshoff. The Parallel Solution of Nonsymmetric Sparse Linear Systems Using H* Reordering and an Associated Factorization. In Proc. of ACM International Conference on Supercomputing, pages 419-430, Manchester, July 1994.

[12] A. George and E. Ng. Symbolic Factorization for Sparse Gaussian Elimination with Partial Pivoting. SIAM J. Scientific and Statistical Computing, 8(6):877-898, November 1987.

[13] A. George and E. Ng. Parallel Sparse Gaussian Elimination with Partial Pivoting. Annals of Operations Research, 22:219-240, 1990.

[14] A. Gerasoulis, J. Jiao, and T. Yang. Scheduling of Structured and Unstructured Computation. In D. Hsu, A. Rosenberg, and D. Sotteau, editors, Interconnections Networks and Mappings and Scheduling Parallel Computation. American Math. Society, 1995.

[15] A. Gerasoulis and T. Yang. On the Granularity and Clustering of Directed Acyclic Task Graphs. IEEE Transactions on Parallel and Distributed Systems, 4(6):686-701, June 1993.

[16] G. Golub and J. M. Ortega. Scientific Computing: An Introduction with Parallel Computing Compilers. Academic Press, 1993.

[17] A. Gupta and V. Kumar. Optimally Scalable Parallel Sparse Cholesky Factorization. In Proc. of 7th SIAM Conf. on Parallel Processing for Scientific Computing, pages 442-447, February 1995.

[18] S. Hadfield and T. Davis. A Parallel Unsymmetric-pattern Multifrontal Method. Technical Report TR-94-028, Computer and Information Sciences Departmenmt, University of Florida, August 1994.

[19] M. Heath, E. Ng, and B. Peyton. Parallel Algorithms for Sparse Linear Systems. SIAM Review, 33(3):420-460, September 1991.

[20] J. W. H. Liu. Computational Models and Task Scheduling for Parallel Sparse Cholesky Factorization. Parallel Computing, 18:327-342, 1986.

[21] R. Lucas, T. Blank, and J. Tiemann. A Parallel Solution Method for Large Sparse Systems of Equations. IEEE Transactions on Computer-Aided Design, CAD-6(6):981-991, November 1987.

[22] E. Rothberg. Exploiting the Memory Hierarchy in Sequential and Parallel Sparse Cholesky Factorization. PhD thesis, Dept. of Computer Science, Stanford, December 1992.

[23] E. Rothberg and R. Schreiber. Improved Load Distribution in Parallel Sparse Cholesky Factorization. In Proc. of Supercomputing'94, pages 783-792, November 1994.

[24] T. Stricker, J. Stichnoth, D. O'Hallaron, S. Hinrichs, and T. Gross. Decoupling Synchronization and Data Transfer in Message Passing Systems of Parallel Computers. In Proc. of International Conference on Supercomputing, pages 1-10, Barcelona, July 1995.

[25] T. Yang and A. Gerasoulis. PYRROS: Static Task Scheduling and Code Generation for Message-Passing Multiprocessors. In Proc. of 6th ACM International Conference on Supercomputing, pages 428-437, 1992.

Author Biographies:

Cong Fu received his B.S. degree in Computer Science from University of Science and Technology of China in 1991. He has been a research assistant in China National Research Center for Intelligent Computing Systems and Institue for Computing Technology, Chinese Academy of Sciences from 1991 to 1993 before he joined University of California at Santa Barbara in Fall 1993. He received his M.S. degree in Computer Science from UCSB in 1995. Now he is a PH.D. candidate of Computer Science in UCSB. His research interests inlcude efficient communication mechanisms, parallel scientific computing and scheduling loop/task parallelism.

Tao Yang received the B.S. degree in Computer Science from Zhejiang University, China, in 1984. He received the M.S. and Ph.D. degrees in Computer Science from Rutgers University in 1990 and 1993, respectively. He is an Assistant Professor at the Department of Computer Science, University of California, Santa Barbara. His main research interests are algorithms and programming environments for parallel and distributed processing, program scheduling and compilation, parallel scientific computing and digital libraries.Dr. Yang served as a guest editor for a special issue on partitioning and scheduling in Journal Parallel Processing Letters, and a program or organizing committee member for several parallel computing conferences and workshops including IPPS'97, Irregular'96, Euro-Par'96, SPDP'96, HiPC'96, HICSS'96 and '95, PPoPP'95 and SPAA'95. Dr. Yang is on the editorial boards of the CD ROM Journal of Computing and Information, Discrete Mathematics & Theoretical Computer Science.