Determining the structure of biological macromolecules such as proteins and nucleic acids is an
important element of molecular biology because of the intimate relation between form and
function of these molecules. Individual sources of data about molecular structure are subject to
varying degrees of uncertainty. Previously we have examined the parallelization of a
probabilistic algorithm for combining multiple sources of uncertain data to estimate the
three-dimensional structure of molecules and also predict a measure of the uncertainty in the
estimated structure. In this paper we extend our work on two major fronts. First we present a
hierarchiacal decomposition of the original algorithm which reduces the sequential computational
complexity tremendously. The hierarchical decomposition in turn reveals a new axis of
parallelism not present in the "flat" organization of the problems, as well as new parallelization
problems. We demonstrate good speedups on two cache-coherent shared-memory
multiprocessors, the Stanford DASH and the SGI Challenge, with distributed and centralized
memory organization, respectively. Our results point to several areas of further study to make
both the hierarchiacal and the parallel aspects more flexible for general problems: automatic
structure decomposition, processor load balancing across the hierarchy, and data locality
management in conjunction with load balancing. Finally we outline the directions we are
investigating to incorporate these extensions.
The O(N) hierarchical N-body algorithms and Massively Parallel Processors allow particle
systems of 100 million particles or more to be simulated in acceptable time. We present a
data-parallel implementation of Anderson's method and demonstrate both efficiency and
scalability of the implementation on the Connection Machine CM-5/5E systems. The
communication time for large particle systems amounts to about 10-25%, and the overall
efficiency is about 35%. The evaluation of the potential field of a system of 100 million particles
takes 3 minutes and 15 minutes on a 256 node CM-5E, giving expected four and seven digits of
accuracy, respectively. The speed of the code scales linearly with the number of processors
and number of particles.
Developing and maintaining a large software package is a complex task. Decisions are made
early in the design process that affect i) the ability of a user to effectively exploit the package
and ii) the ability of a software engineer to maintain it. This case study discusses issues in
software development and maintainability of a scientific package called SnB, which is used to
determine molecular crystal structures. The design of the user interface is
discussed along with
important software engineering concepts, including modular programming, data encapsulation,
and internal code documentation. Fortran is a language that is still widely
used in the scientific
community. Issues concerning the integration of Fortran into a modern scientific application with
a C-based user interface are also discussed. Scientific applications benefit from being available
on a wide variety of platforms. Due to demand, SnB is available on a variety of sequential and
parallel platforms. Methods used in the design of SnB for such portability are presented
including POSIX compliance, automatic configuration scripts, and parallel programming
techniques.
Compiling the array assignment statement of High Performance Fortran in the
presence of
block-cyclic distributions of data arrays is considered difficult, and several algorithms have
been published to solve this problem. We present a comprehensive study of the performance of
these algorithms. We classify these algorithms into several families and identify several issues of
interest in the compilation process, and present experimental performance data for the various
algorithms. We demonstrate that block-cyclic distributions can be compiled almost as efficiently
as block and cyclic distributions.
This paper outlines the content and performance of ScaLAPACK, a collection of mathematical
software for linear algebra computations on distributed memory computers. The importance of
developing standards for computational and message passing interfaces is discussed. We
present the different components and building blocks of ScaLAPACK, and indicate the
difficulties inherent in producing correct codes for networks of heterogeneous processors.
Finally, this paper briefly describes future directions for the ScaLAPACK library and concludes
by suggesting alternative approaches to mathematical libraries, explaining how ScaLAPACK
could be integrated into efficient and user-friendly distributed systems.
The advantages of workstation clusters as a parallel computing platform include a superior
price-performance ratio, availability, scalability, and ease of incremental growth. However, the
performance of traditional LAN technologies such as Ethernet and FDDI rings are insufficient for
many parallel applications. This paper describes APACHE (Automated Pvm Application
CHaracterization Environment), an automated analysis system that uses an
application-independent model for predicting the impact of ATM on the execution time of
iterative parallel applications. APACHE has been used to predict the performance of several
core applications that form the basis for many real scientific and engineering problems. We
present a comparison of the performance predicted by APACHE with observed execution times
to demonstrate the accuracy of our model. Finally, we present a method for a simple
cost-benefit analysis that can be used to determine whether an investment in ATM equipment is
justified for a particular workstation cluster environment.
We present efficient parallel algorithms for interactive display of higher order surfaces on
current graphics systems. At each frame, these algorithms approximate the surface by polygons
and rasterize them over the graphics pipeline. The time for polygon generation for each surface
primitive varies between successive frames and we address issues in distributing the load across
processors for different environments. This includes algorithms to statically distribute the
primitives to reduce dynamic load imbalance as well a distributed wait-free
algorithm for
machines on which re-distribution is efficient, e.g. shared memory machine.
These algorithms
have been implemented on different graphics systems and applied to interactive display of
trimmed spline models. In practice, we are able to obtain almost linear speed-ups (as a function
of number of processors). Moreover, the distributed wait-free algorithm is faster by 25-30% as
compared to static and dynamic schemes.
In this paper, we describe STREN, a parallel stereo renderer for fixed-location terrain
rendering tasks required for the simulation of planetary exploration missions. The renderer is
based on a novel spatial data representation, called the TANPO map. This data representation
stores terrain data using a simple and compact structure and provides excellent locality for such
rendering applications. Experimental results show that the renderer not only performs very well,
but also scales perfectly to different numbers of processors.
Cornell Theory Center, a national center for high performance computing, has been designing
and delivering education programs on parallel processing in traditional workshops for years.
With the advent and growth of the World Wide Web, we have been able to expand our training
efforts to a distance education format, including online lectures and exercises, communication
with CTC consultants and other participants, and logins on CTC's world-class IBM
RS/6000 SP.
This description includes workshop design, technical content covered, design of
the modules,
and participants' response.
Many scientific applications are iterative and specify repetitive communication patterns. This
paper shows how a parallel-language compiler and custom cache-coherence protocols in a
distributed shared memory system together can implement shared-memory communication
efficiently for applications with unpredictable but repetitive communication patterns. The
compiler uses data-flow analysis to identify program points where repetitive communication
occurs. At runtime, the custom protocol builds communication schedules in one iteration and uses
it to pre-send data in following iterations. This paper contains measurements on three iterative
applications (including adaptive programs with unstructured data accesses) to show that custom
protocols increase the number of shared-data requests satisfied locally, thus reducing the
amount of time spent waiting for remote data.
This paper describes the design of a data distribution tool which automatically derives the data
mapping for the arrays and the parallelization strategy for the loops in a Fortran 77 program.
The layout generated can be static or dynamic, and the distribution is one-dimensional BLOCK
or CYCLIC. The tool takes into account the control flow statements in the code in order to
better estimate the behavior of the program. All the information regarding data movement and
parallelism is contained in a single data structure named Communication-Parallelism Graph
(CPG). The CPG is used to model a minimal path problem in which time is the
objective function
to minimize. It is solved using a general purpose linear programming solver, which finds the
optimal solution for the whole problem. The experimental results will illustrate the quality of the
solutions generated and the feasibility of the approach in terms of compilation time.
Loop transformations are critical for compiling high-performance code for modern computers.
Existing work has focused on transformations for perfectly nested loops (that is, loops in which
all assignment statements are contained within the innermost loop of a loop
nest). In practice,
most loop nests, such as those in matrix factorization codes, are imperfectly nested. In some
programs, imperfectly nested loops can be transformed into perfectly nested
loops by loop
distribution, but this is not always legal. In this paper, we present an approach to transforming
imperfectly nested loops directly. Our approach is an extension of the linear loop transformation
framework for perfectly nested loops, and it models permutation, reversal, skewing, scaling,
alignment, distribution and jamming. We also give a completion procedure which generates a
complete transformation from a partial transformation.
We describe the design and discuss the performance of a parallel elastic wave propagation
simulator that is being used to model and study earthquake-induced ground motion in large
sedimentary basins. The components of the system include mesh generators, a mesh partitioner,
a parceler, and a parallel code generator, as well as parallel numerical methods for applying
seismic forces, incorporating absorbing boundaries, and solving the discretized wave
propagation problem. We discuss performance on the Cray T3D for unstructured mesh wave
propagation problems of up to 14 million tetrahedra. By paying careful attention to each step of
the process, we obtain excellent performance despite the highly irregular structure of the
coefficient matrices of the problem. The mesh generator, partitioner, parceler, and code
generator have been incorporated into an integrated toolset/compiler. This system, called
Archimedes, automates the solution of unstructured mesh PDE problems on parallel computers,
and is being used for other unstructured mesh applications beyond ground motion modeling.
An analysis is presented of several factors influencing the performance of a parallel
implementation of the UCLA atmospheric general circulation model(AGCM) on massively parallel
computer systems. Several modifications to the parallel AGCM code aimed at improving its
numerical efficiency, interprocessor communication cost, load-balance and cache efficiency are
discussed. The impact of some of the optimization strategies on the performance of the AGCM
code as we implemented on several state-of-the-art parallel computers, including the Intel
Paragon, Cray T3D and IBM SP2, is presented and analyzed.
We have designed and implemented a set of highly efficient and highly scalable algorithms for an
unstructured computational package, the PSAS data assimilation package, as demonstrated by
detailed performance analysis of systematic runs on up to 512-nodes of an Intel Paragon. The
preconditioned Conjugate Gradient solver achieves a sustained 18 Gflops performance.
Consequently, we achieve an unprecedented 100-fold reduction in time to solution on the Intel
Paragon over a single head of a Cray C90. This not only exceeds the daily performance
requirement of the Data Assimilation Office at NASA's Goddard Space Flight Center, but also
makes it possible to explore much larger and challenging data assimilation problems which are
unthinkable on a traditional computer platform such as the Cray C90.
Tuning supercomputer application performance often requires analyzing the interaction of the
application and the underlying architecture. In this paper, we describe support in the MIPS
R10000 for non-intrusively monitoring a variety of processor events -- support that is
particularly useful for characterizing the dynamic behavior of multi-level memory hierarchies,
hardware-based cache coherence, and speculative execution. We first explain how
performance data is collected using an integrated set of hardware mechanisms, operating
system abstractions, and performance tools. We then describe several examples drawn from
scientific applications that illustrate how the counters and profiling tools provide information that
helps developers analyze and tune applications.
Fine tuning the performance of large parallel programs is a very difficult task. A profiling tool can
provide detailed insight into the utilization and communication of the different processors, which
helps identify performance bottlenecks. In this paper we present a profiler for the fine-grained
parallel programming language Split-C, which provides a simple global address space memory
model. As our experience shows, it is much more challenging to profile programs that make use
of efficient, low-overhead communication. We incorporated techniques which minimize profiling
effects on the running program. We quantify the profiling overhead and present several Split-C
applications which show that the profiler is useful in determining performance bottlenecks.
This paper presents a case study of modeling, evaluating, and testing the data collection
services (called an instrumentation system) of the Paradyn parallel performance measurement
tool using well-known performance evaluation and experiment design techniques. The overall
objective of the study is to use modeling- and simulation-based evaluation to provide feedback
to the tool developers to help them choose system configurations and task scheduling policies
that can significantly reduce the data collection overheads. We develop and
parameterize a
resource occupancy (ROCC) model for Paradyn instrumentation system (IS) for
an IBM SP-2
platform. This model is parameterized with a measurement-based workload characterization
and subsequently used to answer several "what if" questions regarding configuration options
and two policies to schedule instrumentation system tasks: collect-and-forward (CF) and
batch-and-forward (BF) policies. Simulation results indicate that the BF policy can significantly
reduce the overheads. Based on this feedback, the BF policy was implemented
in Paradyn IS as
an option to manage the data collection. Measurement-based testing results obtained from this
enhanced version of Paradyn IS are reported in this paper and indicate more
than 60%
reduction in the direct IS overheads when the BF policy is used.
The HINT Benchmark was developed to provide a broad-spectrum metric for computers, and to
measure performance over the full range of memory sizes and time scales. We
have extended
our understanding of why HINT performance curves look the way they do, and can now predict
the curves using an analytical model based on simple hardware specifications as input
parameters.
In this paper we analyze communication patterns in the parallel three-dimensional
Navier-Stokes solver Prism, and present performance results on the IBM SP2,
the Cray T3D
and the SGI Power Challenge XL. Prism is used for direct numerical simulation of turbulence in
non-separable and multiply-connected domains. The numerical method used in the solver is
based on mixed spectral element-Fourier expansions in (x-y) planes and z-direction,
respectively. Each (or a group) of Fourier modes is computed on a separate processor as the
linear contributions (Helmholtz solves) are completely uncoupled in the incompressible
Navier-Stokes equations; coupling is obtained via the nonlinear contributions (convective
terms). The transfer of data between physical and Fourier space requires a series of complete
exchange operations, which dominate the communication cost for small number
of processors. As
the number of processors increases, global reduction and gather operations become important
while complete exchange becomes more latency dominated. Predictive models for these
communication operations are proposed and tested against measurements. A relatively large
variation in communication timings per iteration is observed in simulations
and quantified in terms
of specific operations. A number of improvements are proposed that could significantly reduce
the communications overhead with increasing numbers of processors, and {\em
generic}
predictive maps are developed for the complete exchange operation, which remains the
fundamental communication in Prism. Results presented in this paper are representative of a
wider class of parallel spectral and finite element codes for computational
mechanics which
require similar communication operations.
Current parallel benchmarks, while appropriate for scientific applications, lack the defense
relevance and representativeness for developers who are considering parallel computers for
their Command, Control, Communication, and Intelligence (C3I) systems. We present a new set
of compact application benchmarks which are specific to the C3I application domain. The C3I
Parallel Benchmark Suite (C3IPBS) program is addressing the evaluation of not only machine
performance, but also the software implementation effort. Our methodology currently draws
heavily from the PARKBENCH[2] and NAS Parallel Benchmarks[1]. The paper presents the
benchmarking methodology, introduces the benchmarks, and reports initial results and analysis.
Finally, we describe the lessons that we have learned so far from formulating and implementing
the C3I benchmarks.
In November 1994, the NEC Corporation announced the SX-4 supercomputer. It
is the third in the SX series of supercomputers and is upward compatible from the SX-3R
vector processor with enhancements for scalar processing, short vector processing, and
parallel processing. In this paper we describe the architecture of the SX-4 which has an 8.0
ns clock cycle and a peak performance of 2 Gflops per processor. We also describe the
composition of the NCAR Benchmark Suite, designed to evaluate the computers for use on
climate modeling applications. Additionally, we contrast this benchmark suite with other
benchmarks. Finally, we detail the scalability and performance of the SX-4/32 relative to
the NCAR Benchmark Suite.
Virtual channels can be used to implement deadlock free adaptive routing algorithms and
increase network throughput. Unfortunately, they introduce asymmetries in the use of buffers of
symmetric networks as the toroidal k-ary n-cubes. In this paper we present a minimal adaptive
routing algorithm that tries to balance the use of the virtual channels by limiting the injection of
new packets into the network. The experimental results, conducted on a 256 nodes torus, show
that it is possible to increase the saturation point and to keep the network throughput stable at
high traffic rates. The comparison with the Chaos router, a non minimal cut-through adaptive
routing, shows that our algorithm obtains similar performance results using only a small fraction of
buffers and a simpler router model.
The IBM SP is one of the most powerful commercial MPPs, yet, in spite of its fast processors and
high network bandwidth, the SP's communication latency is inferior to older machines such as the
TMC CM-5 or Meiko CS-2. This paper investigates the use of Active Messages (AM)
communication primitives as an alternative to the standard message passing in order to reduce
communication overheads and to offer a good building block for higher layers of software. The
first part of this paper describes an implementation of Active Messages (SP AM) which is
layered directly on top of the SP's network adapter (TB2). With comparable bandwidth, SP
AM's low overhead yields a round-trip latency that is 40% lower than IBM MPL's. The second
part of the paper demonstrates the power of AM as a communication substrate by layering
Split-C as well as MPI over it. Split-C benchmarks are used to compare the SP to other MPPs
and show that low message overhead and high throughput compensate for SP's high network
latency. The MPI implementation is based on the freely available MPICH version and achieves
performance equivalent to IBM's MPI-F on the NAS benchmarks.
While all-optical networks offer large bandwidth for transferring data, the control mechanisms
to dynamically establish all-optical paths incur large overhead. In this paper, we consider the
problem of adapting all-optical multiplexed networks in multiprocessor or multicomputer
environment by using compiled communiation as an alternative to dynamic network control. In
compiled communication, the network resources are managed statically and therefore, run time
control overhead is eliminated. In addition, complex offline algorithms can be incorporated to
manage the network resources more efficiently. We studied several off-line connection
scheduling algorithms for optimizing the multiplexing degree required to satisfy communication
requests. The performance of compiled communication for communication patterns that can be
determined at compile time in application programs is evaluated and compared with dynamically
controlled communication assuming a two-dimmension torus topology. Our results show that the
compiled communication out-performs the dynamic communication to a large degree for these
communication patterns. Since most of the communication patterns in parallel applications can be
determined at compile time, we conclude that compiled communication is an effective mechanism
for all-optical network in multiprocessor environments.
In multivector processors, the lost cycles due to conflicts between concurrent vector streams
make the effective throughput be lower than the peak throughput. When the request rate of all
the concurrent vector streams to every memory module is less than or equal to the service rate,
conflicts appear because concurrent vector streams reference memory modules
in different
orders. In additiona, in a memory system where several memory modules are mapped in every
bus (complex memory system) bus conflicts are added to memory module conflicts. This paper
proposes an access order to the vector stream elements that reduces the average memory
access time in vector processors with complex memory systems. When request rate is greater
than the service rate, the proposed order reduces the numbe of lost cycles,
and the effective
throughput increases. In other cases, the effective throughput reach the peak throughput. The
proposed order generates the memory references in such a way that the memory modules
shared by the concurrent self-conflict-free vector streams, and the sections where memory
modules are mapped, are referenced using the same order.
Formation by gravitational collapse of galaxies and the large-scale structure of the universe is a
nonlinear, multi-scale, multi-component problem. This complex process involves dynamics of the
gaseous baryons as well as of the gravitationally dominant dark matter. We discuss an
implementation of a parallel, distributed memory, cosmological hydrodynamics code using MPI
message passing; it is based on the Total-Variation-Diminishing (Harten 1983) serial code of
Ryu et al. (1993). This parallel code follows the motion of gas and dark matter simultaneously,
combining a mesh based Eulerian hydrodynamics code and a Particle-Mesh N-body code. A
new, flexible matrix transpose algorithm is used to interchange distributed and local dimensions
of the mesh. Timing results from runs on an IBM SP2 supercomputer are given.
Transient dynamics simulations are commonly used to model phenomena such as car crashes,
underwater explosions, and the response of shipping containers to high-speed impacts. Physical
objects in such a simulation are typically represented by Lagrangian meshes because the
meshes can move and deform with the objects as they undergo stress. Fluids (gasoline, water) or
fluid-like materials (earth) in the simulation can be modeled using the techniques of smoothed
particle hydrodynamics. Implementing a hybrid mesh/particle model on a massively parallel
computer poses several difficult challenges. One challenge is to simultaneously parallelize and
load-balance both the mesh and particle portions of the computation. A second challenge is to
efficiently detect the contacts that occur within the deforming mesh and between mesh elements
and particles as the simulation proceeds. These contacts impart forces to the mesh elements and
particles which must be computed at each timestep to accurately capture the physics of interest.
In this paper we describe new parallel algorithms for smoothed particle hydrodynamics and
contact detection which turn out to have several key features in common. Additionally, we
describe how to join the new algorithms with traditional parallel finite element techniques to
create an integrated particle/mesh transient dynamics simulation. Our approach to this problem
differs from previous work in that we use three different parallel decompositions, a static one for
the finite element analysis and dynamic ones for particles and for contact detection. We have
implemented our ideas in a parallel version of the transient dynamics code PRONTO-3D and
present results for the code running on a large Intel Paragon.
The implementation and optimization of a production mode Computational Fluid Dynamics (CFD)
software to NEC and Cray supercomputing platforms are discussed. It is intended to assess the
impact of different computer architectures and High Power Computing approaches while
studying a problem of engineering and scientific importance. The computational model solves the
viscous compressible flow equations for high-speed wall-shear layer flows. The code is very
versatile and has performed extremely well in vector machines as well as on massively parallel
platforms. It is written in FORTRAN programming language while implementing the recent
extensions such as HPF and FORTRAN 90. With this approach, an almost perfect level of
portability is achieved while the performance on a particular supercomputing platform does not
degrade.
Iterative schemes for solving sparse linear systems arising from elliptic PDEs are very suitable
for efficient implementation on large scale multiprocessors. However, these methods rely heavily
on effective preconditioners which must also be amenable to parallelization. In this paper, we
present a novel method to obtain a preconditioned linear system which is solved using an
iterative method. Each iteration comprises of a matrix-vector product with k sparse matrices (k
< logn), and can be computed in O(n) opertions where n is the number of unknowns. The
numerical convergence properties of our preconditioner are superior to the commonly used
incomplete factorization preconditioners. Moreover, unlike the incomplete factorization
preconditioners, our algorithm affords a higher degree of concurrency and doesn't require
triangular system solves, thereby achieving the dual objective of good preconditioning and
efficient parallel implementation. We describe our scheme for certain linear systems with
symmetric positive definite or symmetric indefinite matrices and present an efficient parallel
implementation along with an analysis of the parallel complexity. Results of the parallel
implimentation of our algorithm will also be presented in the final version of this paper.
Sparse LU factorization with partial pivoting is important to many scientific applications, but the
effective parallelization of this algorithm is still an open problem. The main difficulty is that partial
pivoting operations make structures of L and U factors unpredictable beforehand. This paper
presents a novel approach called S* for parallelizing this problem on distributed memory
machines. S* incorporates static symbolic factorization to avoid run-time control overhead and
uses nonsymmetric L/U supernode partitioning and amalgamation strategies to
maximize the use
of BLAS-3 routines. The irregular task parallelism embedded in sparse LU is
exploited using
graph scheduling and efficient run-time support techniques which optimize communication,
overlap computation with communication and balance processor loads. The experimental results
on the Cray-T3D with a set of Harwell-Boeing nonsymmetric matrices are very
encouraging
and good scalability has been achieved. Even compared to a highly optimized
sequential code,
the parallel speedups are still impressive considering the current status of sparse LU research.
In this paper we report on the development of an efficient and portable implementation of
Strassen's matrix multiplication algorithm for matrices of arbitrary size. Our technique for defining
the criterion which stops the recursions is more detailed than those generally used, thus allowing
enhanced performance for a larger set of input sizes. In addition, we deal with odd matrix
dimensions using a method whose usefulness had previously been in question and had not so far
been demonstrated. Our memory requirements have also been reduced, in certain cases by 40
to more than 70 percent over other similar implementations. We measure performance of our
code on the IBM RS/6000, CRAY YMP C90, and CRAY T3D single processor, and offer
comparisons to other codes. Finally, we demonstrate the usefulness of our implementation by
using it to perform the matrix multiplications in a large application code.
Dynamic mesh adaption on unstructured grids is a powerful tool for efficiently computing
unsteady problems to resolve solution features of interest. Unfortunately, this causes load
imbalance among processors on a parallel machine. This paper describes the parallel
implementation of a tetrahedral mesh adaption scheme and a new global load balancing method.
A huristic remapping algorithm is presented that assigns partitions to processors such that the
redistribution cost is minimized. Results indicate that the paralel performance of the mesh
adaption code depends on the nature of the adaption region and show a 35.5X speedup on 64
processors when about 35% of the mesh is randomly adapted. For large-scale scientific
computations, our load balancing strategy gives almost a sixfold reduction in solver execution
times over non-balanced loads. Furthermore, our heuristic remapper yields processor
assignments that are less than 3% off the optimal solutions but requries only 1% of the
computational time.
The method of moments is an important tool for solving boundary integral equations arising in a
variety of applications. It transforms the physical problem into a dense linear system. Due to the
large number of variables and the associated computational requirements, these systems are
solved iteratively using methods such as GMRES, CG and its variants. The core operation of
thes itertive solvers is the application of the system matrix to a vector. This requres O(n2)
operations and memory using accurate dense methods. The computational complexity can be
reduced to O(n log n) and the memory requirement to O(n) using hierarchical
approximation
techniques. The algorithmic speedup from approximation can be combined with
parallelism to
yield very fast dense solvers. In this paper, we present efficient parallel
formulations of dense
iterative solvers based on hierarchical approximations for solving the integral form of Laplace
equation. We study the impact of various parameters on the accuracy and performance of the
parallel solver. We present two preconditioning techniques for accelerating
the convergence of
the iterative solver. Thes techniques are based on an inner-outer scheme and a block diagonal
scheme based on a truncated Green's function. We present detailed experimental results on up
to 256 processors of a Cray T3D.
In this paper we present a parallel formulation of a multilevel k-way graph
partitioning algorithm.
The multilevel k-way partitioning algorithm reduces the size of the graph by collapsing vertices
and edges (coarsening phase), finds a k-way partition of the smaller graph,
and then it
constructs a k-way partition for the original graph by projecting and refining the partition to
successively finer graphs (uncoarsening phase). A key innovative feature of
our parallel
formulation is that it utilizes graph coloring to effectively parallelize both the coarsenin and the
refinement during the uncoarsening phase. Our algorithm is able to achieve a high degree of
concurrency, while maintaining the high quality partitions produced by the serial algorithm. We
test our scheme on a large number of graphs from finite element methods, and transportation
domains. Our parallel formulation on Cray T3D, produces high quality 128-way partitions on 128
processors in a little over two seconds, for graphs with a million vertices. Thus our parallel
algorithm makes it possible to perform dynamic graph partition in adaptive computations without
compromising quality.
High Performance Fortran (HPF) does not allow efficient expression of mixed task/data-parallel
computations or the coupling of separately compiled data-parallel modules. In this paper, we
show how a coordination library implementing the Message Passing Interface (MPI) can be used
to represent these common parallel program structures. This library allows data-parallel tasks
to exchange distributed data structures using calls to simple communication functions. We
present microbenchmark results that characterize the performance of this library and that
quantify the impact of optimizations that allow reuse of communication schedules in common
situations. In addition, results from two-dimensional FFT, convolution, and multiblock programs
demonstrate that the HPF/MPI library can provide performance superior to that of pure HPF.
We conclude that this synergistic combination of two parallel programming standards represents
a useful approach to task parallelism in a data-parallel framework, increasing the range of
problems addressable in HPF without requiring complex compiler technology.
MPI is gaining acceptance as a standard for message-passing in high-performance computing,
due to its powerful and flexible support of various communication styles. However, the
complexity of its API poses significant software overhead, and as a result,
applicability of MPI
has been restricted to rather regular, coarse-grained computations. Our OMPI (Optimizing
MPI) system removes much of the excess overhead by employing partial evaluation techniques,
which exploit static information of MPI calls. Because partial evaluation alone is insufficient, we
also utilize template functions for further optimization. To validate the effectiveness for our
OMPI system, we performed baseline as well as more extensive benchmarks on a set of
application cores with different communication characteristics, on the 64-node Fujitsu AP1000
MPP. Benchmarks show that OMPI improves execution efficiency by as much as factor of two
for communication-intensive application core with minimal code increase. It
also performs
significantly better than previous dynamic optimization technique.
Particle-in-Cell (PIC) plasma simulation codes model the interaction of charged particles with
surrounding electrostatic and magnetic fields. Its computational requirements made it to be
classified as one of the grand-challenge problems facing the high performance community. In
this paper we present the implementation of 1-D and 2-D electrostatic PIC codes in High
Performance Fortran(HPF) on a IBM SP-2. HPF expands Fortran 90 with data distribution and
alignment directives and data parallel statements. It is a powerful language for writing portable
and high performance programs across many platforms. We used one of the most successful
commerical HPF compilers currently available in the market and augmented the compiler's
missing HPF functions with extrinsic routines when necessary. We obtained near linear speed-up
in all of our test cases. The performance of the HPF programs is comparable
to the native
message passing implementations of the same codes on the SP-2.
Heterogeneous networks are increasingly being used as platforms for resource-intensive
distributed parallel applications. A critical contributor to the performance of such applications is
the scheduling of constituent application tasks on the network. Since often the distributed
resources cannot be brought under the control of a single global scheduler, the application must
be scheduled by the user. To obtain the best performance, the user must take into account both
application-specific and dynamic system information in developing a schedule which meets his or
her performance criteria. In this paper, we define a set of principles underlying application-level
scheduling and describe our work-in-progress building AppLeS (application-level scheduling)
agents. We illustrate the application-level scheduling approach with a detailed description and
results for a distributed 2D Jacobi application on two production heterogeneous platforms.
This paper presents a new system, called NetSolve, that allows users to access computational
resources, such as hardware and software, distributed across the network. The development of
NetSolve was motivated by the need for an easy-to-use, efficient mechanism for using
computational resources remotely. Ease of use is obtained as a result of different interfaces,
some of which require no programming effort from the user. Good performance
is ensured by a
load-balancing policy that enables NetSolve to use the computational resources available as
efficiently as possible. NetSolve offers the ability to look for computational resources on a
network, choose the best one availab le, solve a problem (with retry for fault-tolerance), and
return the answer to the user.
Metacomputing systems use high-speed networks to connect supercomputers, mass storage
systems, scientific instruments, and display devices with the objective of enabling parallel
applications to utilize geographically distributed computing resources. However, experience
shows that high performance can often be achieved only if applications can integrate diverse
communication substrates, transport mechanisms, and protocols, chosen according to where
communication is directed, what is communicated, or when communication is performed. In this
paper, we describe a software architecture that addresses this requirement. This architecture
allows multiple communication methods to be supported transparently in a single application, with
either automatic or user-specified selection criteria guiding the methods used for each
communication. We describe an implementation of this architecture, based on the Nexus
communication library, and use this implementation to evaluate performance issues. This
implementation was used to support a wide variety of applications in the I-WAY metacomputing
experiment at Supercomputing~95; we use one of these applications to provide a quantitative
demonstration of the advantages of multimethod communication in a heterogeneous networked
environment.
In today's high performance computing arena, there is a strong trend toward
building virtual
computers from heterogeneous resources on a network. In this paper we describe our
experiences in building a world-wide virtual machine (WWVM) based on emerging Web and
existing HPCC technologies. We have constructed a Web-based parallel/distributed
programming environment on top of this machine demonstrating MPI and PVM message-passing
programs and High Performance Fortran programs. Alternatively, the WWVM can
be
configured as a metacomputer for the solution of metaproblems.
Data mining is an emerging research area, whose goal is to extract significant patterns or
interesting rules from large databases. High-level inference from large volumes of routine
business data can provide valuable information to businesses, such as customer buying patterns,
shelving criterion in supermarkets and stock trends. Many algorithms have been proposed for
data mining of association rules. However, research so far has mainly focused on sequential
algorithms. In this paper we present parallel algorithms for data mining of
association rules, and
study the degree of parallelism, synchronization, and data locality issues on the SGI Power
Challenge shared-memory multi-processor. We further present a set of optimizations for the
sequential and parallel algorithms.Experiments show that a significant improvement of
performance is achieved using our proposed optimizations. We also achieved good speed-up
for the parallel algorithm, but we observe a need for parallel I/O techniques for further
performance gains.
We describe dynamic computation migration, the runtime choice between computation and data
migration. Dynamic computation migration is useful for concurrent data structures with
unpredictable read/write patterns. We implemented it in MCRL, a multithreaded DSM system
that runs on the MIT Alewife machine and Thinking Machines' CM-5. We evaluate two dynamic
migration heuristics relative to data migration. On a concurrent, distributed B-tree with 50%
lookups and 50% inserts, the STATIC heuristic improves performance by about 17%, on both
Alewife and the CM-5. The REPEAT heuristic generally performs better than the STATIC
heuristic. On Alewife, with 80% lookups and 20% inserts, the REPEAT heuristic improves
performance by 23%; on the CM-5, it improves performance by 46%. Our results apply to
concurrent, dynamic data structures whose access patterns are only known at runtime. For
regularly accessed data structures, static methods will always be applicable, but we expect
future applications to be more dynamic.
We present an analytical performance model for Panda, a library for synchronized i/o of large
multidimensional arrays on parallel and sequential platforms, and show how the Panda
developers use this model to evaluate Panda's parallel i/o performance and guide future Panda
development. The model validation shows that system developers can simplify
performance
analysis, identify potential performance bottlenecks, and study the design trade-offs for Panda
on massively parallel platforms more easily than by conducting empirical experiments. More
importantly, we show that the outputs of the performance model can be used to help make
optimal plans for handling application i/o requests, the first step toward our long-term goal of
automatically optimizing i/o request handling in Panda.
There is a growing demand in high reliability beyond what current RAID can provide and there
are various levels of user demand for data reliability. An efficient data placement scheme called
RM2 has been proposed in \cite{Park95}, which makes a disk array system tolerable against
double disk failures. In this paper, we consider how to choose an optimal striping unit for RM2
particularly when no workload information is available except read/write ratio.
A disk array
simulator for RM2 has been developed for experimental works. It is shown that RM
2 has an
optimal striping unit of two and half tracks in the case of disk read operations,
and one third of a
single track if any disk write operations are involved.
The NWT was reinforced to get 280 GFLOPS of theoretical peak performance
with the addition of 26 PEs to the original 140 PEs. On a CFD
simulation of jet engine compressor, we attained active performance
speed of 111 GFLOPS using 160 PEs.
We report on resent N-body simulations of galaxy formation performed on the
GRAPE-4 (GRAvity PipE 4) system, a special-purpose computer for astrophysical N-body
simulations. We review the astrophysical motivation, the algorithm, the actual performance,
and the price per performance. The performance obtained is 332 Gflops averaged over 185
hours for a simulation of a galaxy formation with 786,400 particles. The price per
performance obtained is 4,600 dollars per Gflops. The configuration used for the simulation
consists of 1,269 pipeline processors and has a peak speed of 663 Gflops.
We have developed an highly efficient electronic structure code, for parallel
computers using message passing. The algorithm takes advantage of the natural parallelism
in quantum chemistry problems to obtain very high performance even on a large number of
processors. Most of the terms which scale cubically with respect to the number of atoms
have been eliminated allowing the treatment of very large systems. It uses one of the most
precise versions of Density Functional Theory, namely Self-Interaction Corrected Density
Functional Theory. On a 6 processor Silicon Graphics Symmetric Multiprocessor based on
the MIPS R8000 microprocessor, we obtain a performance of 6.3 Gflops per million dollar.
We present a computational steering approach for controlling, analyzing, and visualizing very
large scale molecular dynamics simulations involving tens to hundreds of millions of atoms. Our
approach relies on extensible scripting languages and an easy to use tool for building extensions
and modules. The system is extremely easy to modify, works with existing C code, is memory
efficient, and can be used from inexpensive workstations and networks. We demonstrate how
we have used this system to manipulate data from production MD simulations involving as many
as 104 million atoms running on the CM-5 and Cray T3D. We also show how this approach can
be used to build systems that integrate common scripting languages (including Tcl/Tk, Perl, and
Python), simulation code, user extensions, and commercial data analysis packages.
The Discrete Element Model (DEM) is an alternative to the classical continuum mechanics
approach for problems with flow-like deformations of solids. DEM is especially well-suited for
problems such as soil plowing where the media is subject to large discontinuous deformations.
The goal of this project was to develop a large-scale three dimensional DEM soil modeling
system capable of handling problems with potentially several million particles. This
paper
discusses the development of that model and the factors driving it toward a High Performance
Computing (HPC) solution. It also discusses how work in progress is making use of
heterogeneous HPC environments and high speed networks to produce a real-time user
interaction and visualization capability.
Advances in parallel supercomputing now make possible molecular-based engineering and
science calculations that will soon revolutionize many technologies, such as those involving
polymers and those involving aqueous electrolytes. We have developed a suite of
message-passing codes for classical molecular simulation of such complex fluids and amorphous
materials and have completed a number of demonstration calculations of problems of scientific
and technological importance with each (described at the World Wide Web site
http://flory.engr.utk.edu/ldrd). In this paper, we will focus on the molecular simulation of
rheological properties, particularly viscosity, of simple and complex fluids using parallel
implementations of non-equilibrium molecular dynamics. Such calculations represent significant
challenges computationally because, in order to reduce the thermal noise in the calculated
properties within acceptable limits, large systems and/or long simulated times are required.
Scientific applications often require some strategy for temporary data storage to do the largest
possible simulations. The use of virtual memory for temporary data storage has received
criticism because of performance problems. However, modern virtual memory found
in recent
operating systems such as Cenju-3/DE give application writers control over virtual memory
policies. We demonstrate that custom virtual memory policies can dramatically reduce virtual
memory overhead and allow applications to run out-of-core efficiently. We also demonstrate
that the main advantage of virtual memory, namely programming simplicity, is not lost.
This paper addresses job scheduling for parallel supercomputers. Modern parallel systems with
n nodes can be used by jobs requesting up to n nodes. If less than n nodes are requested,
multiple jobs can be run at the same time, allowing several users to use the system. One of the
challenges for the operating system is to give reasonable service to a diverse group of
requests. A single 1-node job that is running for a long time may effectively block the whole
machine if the next job requests all n nodes. To date various policies have been proposed for
the scheduling highly parallel computers. But as the users of current supercomputers know,
these policies work far from perfect. This paper reports on the measurement of the usage of a
96-node Intel Paragon at ETH Zurich, a 512 node IBM SP2 at Cornell Theory Center, and a
512 node Cray T3D at Pittsburgh Supercomputing Center. We discuss the common
characteristics of the different workloads and identify their impact on job scheduling techniques
for such parallel systems. The metrics used for evaluating scheduling are based on turnaround
time and fairness among jobs. We specifically show how two simple scheduling optimizations
based on reordering the waiting queue can be used to effectively improve scheduling
performance on real workloads. An important contribution of this paper is to establish that
supercomputer workloads do exhibit some common characteristics but they also differ in
important ways, and the knowledge of workloads is important for design of effective scheduling
algorithms. Given the current ad-hoc approach applied to tuning the scheduler systems, these
results are of interest to scheduling researchers, supercomputer installations, and developers of
scheduling software.
Biology Applications
Parallel Hierarchical Molecular Structure Estimation
Cheng Che Chen
Jaswinder Pal Singh
Russ B. Altman
A Data-Parallel Implementation of O(N) Hierarchical N-body Methods
Yu Hu
S. Lennart Johnsson
The Design of a Portable Scientific Tool: A Case Studing Using SnB
Steven M. Gallo
Russ Miller
Charles M. Weeks
Performance I
Runtime Performance of Parallel Array Assignment: An Empirical Study
Lei Wang
James M. Stichnoth
Siddhartha Chatterjee
ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance
Laura Susan Blackford
J. Choi
A. Cleary
A. Petitet
R. C. Whaley
J. Demmel
I. Dhillon
K. Stanley
J. Dongarra
S. Hammarling
G. Henry
D. Walker
Network Performance Modeling for PVM Clusters
Mark J. Clement
Michael R. Steed
Phyllis E. Crandall
Visualization & Education
Scalable Parallel Algorithms For Interactive Visualization of Curved Surfaces
Subodh Kumar
Chun-Fa Chang
Dinesh Manocha
STREN: A Highly Scalable Parallel Stereo Terrain Renderer for Planetary Mission Simulations
Ansel Teng
Scott Whitman
Meemong Lee
Education in High Performance Computing via the WWW: Designing and Using Technical Materials Effectively
Susan Mehringer
Compiler Analysis
Compiler-directed Shared-memory Communication for Iterative Parallel Applications
Guhan Viswanathan
James R. Larus
Dynamic Data Distribution with Control Flow Analysis
Jordi Garcia
Eduard Ayguade
Jesus Labarta
Transformations for Imperfectly Nested Loops
Induprakas Kodukula
Keshav Pingali
Geophysical Applications
Earthquake Ground Motion Modeling on Parallel Computers
Hesheng Bao
Jacobo Bielak
Omar Ghattas
Loukas F. Kallivokas
David R. O'Hallaron
Jonathan R. Shewchuk
Jifeng Xu
Performance Analysis and Optimization on the UCLA Parallel Atmospheric General Circulation Model Code
John Lou
John Farrara
Climate Data Assimilation on a Massively Parallel Supercomputer
Hong Q. Ding
Robert D. Ferraro
Tools
Performance Analysis Using the MIPS R10000 Performance Counters
Marco Zagha
Brond Larson
Steve Turner
Marty Itzkowitz
Profiling a Parallel Language Based on Fine-Grained Communication
Bjoern Haake
Klaus E. Schauser
Chris Scheiman
Modeling, Evaluation, and Testing of Paradyn Instrumentation System
Abdul Waheed
Diane T. Rover
Jeffrey K. Hollingsworth
Performance II
An Analytical Model of the HINT Performance Metric
Quinn O. Snell
John L. Gustafson
Communication Patterns and Models in Prism: A Spectral Element-Fourier Parallel Navier-Stokes Solver
Constantinos Evangelinos
George Em Karniadakis
The C3I Parallel Benchmark Suite - Introduction and Preliminary Results
Rakesh Jha
Richard C. Metzger
Brian VanVoorst
Luiz S. Pires
Wing Au
Minesh Amin
David A. Castanon
Vipin Kumar
Architecture and Application: The Performance of the NEC SX-4 on the NCAR Benchmark Suite
Steven W. Hammond
Richard D. Loft
Philip D. Tannenbaum
Networking & Architecture
Minimal Adaptive Routing with Limited Injection on Toroidal k-ary n-cubes
Fabrizio Petrini
Marco Vanneschi
Low-Latency Communication on the IBM RISC System/6000 SP
Chi-Chao Chang
Grzegorz Czajkowski
Chris Hawblitzel
Thorsten von Eicken
Compiled Communication for All-optical TDM Networks
Xin Yuan
R. Melhem
R. Gupta
Increasing the Effective Bandwidth of Complex Memory Systems in Multivector Processors
Anna M. del Corral
Jose M. Llaberia
Hydrodynamics Applications
A Parallel Cosmological Hydrodynamics Code
Paul W. Bode
Guohong Xu
Renyue Cen
Transient Dynamics Simulations: Parallel Algorithms for Contact Detection and Smoothed Particle Hydrodynamics
Steve Plimpton
Bruce Hendrickson
Steve Attaway
Jeff Swegle
Courtenay Vaughan
Dave Gardner
Performance of a Computational Fluid Dynamics Code on NEC and Cray Supercomputers: Beyond 10 Gigaflops
Ferhat F. Hatay
Algorithms I
Parallel Preconditioners for Elliptic PDEs
Vivek Sarin
Ahmed Sameh
Sparse LU Factorization with Partial Pivoting on Distributed Memory Machines
Cong Fu
Tao Yang
Implementation of Strassen's Algorithm for Matrix Multiplication
Steven Huss-Lederman
Elaine M. Jacobson
Jeremy R. Johnson
Anna Tsao
Thomas Turnbull
Algorithms II
Global Load Balancing with Parallel Mesh Adaption on Distributed-Memory Systems
Rupak Biswas
Leonid Oliker
Andrew Sohn
Parallel Hierarchical Solvers and Preconditioners for Boundary Element Methods
Ananth Grama
Vipin Kumar
Ahmed Sameh
Parallel Multilevel k-way Partitioning Scheme for Irregular Graphs
George Karypis
Vipin Kumar
Parallel Programming Support
Double Standards: Bringing Task Parallelism to HPF Via the Message Passing Interface
Ian Foster
David R. Kohr, Jr.
Rakesh Krishnaiyer
Alok Choudhary
OMPI: Optimizing MPI Programs Using Partial Evaluation
Hirotaka Ogawa
Satoshi Matsuoka
Particle-in-Cell Simulation Codes in High Performance Fortran
Erol Akarsu
Kivanc Dincer
Tomasz Haupt
Geoffrey C. Fox
Scheduling
Application-Level Scheduling on Distributed Heterogeneous Networks
Francine D. Berman
Rich Wolski
Silvia Figueira
Jennifer Schopf
Gary Shao
NetSolve: A Network Server for Solving Computational Science Problems
Henri Casanova
Jack Dongarra
Multimethod Communication for High-performance Metacomputing Applications
Ian Foster
Jonathan Geisler
Carl Kesselman
Steven Tuecke
Building a World-Wide Virtual Machine Based on Web and HPCC Technologies
Kivanc Dincer
Geoffrey C. Fox
Data Mining & Modeling
Parallel Data Mining for Association
Rules on Shared-memory Multi-processors
M. J. Zaki
M. Ogihara
S. Parthasarathy
W. Li
Dynamic Computation Migration in DSM Systems
Wilson C. Hsieh
M. Frans Kaashoek
William E. Weihl
Performance Modeling for the Panda Array I/O Library
Ying Chen
Marianne Winslett
Szu-wen Kuo
Yong Cho
Mahesh Subramaniam
Kent Seamons
Striping in Disk Array RM2 Enabling the Tolerance of Double Disk Failures
Chan-Ik Park
Tae-Young Choe
Gordon Bell Finalists
Simulation of the 3 Dimensional Cascade Flow
with Numerical Wind Tunnel (NWT)
Takashi Nakamura
Toshiyuki Iwamiya
Masahiro Yoshida
Yuichi Matsuo
Masahiro FukudaN-body Simulation of Galaxy Formation on GRAPE-4 Special-Purpose Computer
Toshiyuki Fukushige
Junichiro MakinoElectronic Structure of Materials Using Self-Interaction Corrected Density Functional Theory
Adolfy Hoisie
Stefan Goedecker
Jurg HutterParticle Dynamics
Lightweight Computational Steering of Very Large Scale Molecular Dynamics Simulations
David M. Beazley
Peter S. Lomdahl
Design of a Large Scale Discrete Element Soil Model for High Performance Computing Systems
Alex R. Carrillo
David A. Horner
John F. Peters
John E. West
Molecular Simulation of Rheological Properties Using Massively Parallel Supercomputers
R. K. Bhupathiraju
S. T. Cui
S. Gupta
H. D. Cochran
P. T. Cummings
Data & Scheduling
Virtual Memory Versus File Interfaces for Large, Memory-intensive Scientific Applications
Yoonho Park
Ridgway Scott
Stuart Sechrest
Impact of Job Mix on Optimizations for Space Sharing Schedulers
Jaspal Subhlok
Thomas Gross
Takashi Suzuoka