Technical Paper Abstracts

Abstracts

Biology Applications

Parallel Hierarchical Molecular Structure Estimation

Cheng Che Chen
Jaswinder Pal Singh
Russ B. Altman

Determining the structure of biological macromolecules such as proteins and nucleic acids is an important element of molecular biology because of the intimate relation between form and function of these molecules. Individual sources of data about molecular structure are subject to varying degrees of uncertainty. Previously we have examined the parallelization of a probabilistic algorithm for combining multiple sources of uncertain data to estimate the three-dimensional structure of molecules and also predict a measure of the uncertainty in the estimated structure. In this paper we extend our work on two major fronts. First we present a hierarchiacal decomposition of the original algorithm which reduces the sequential computational complexity tremendously. The hierarchical decomposition in turn reveals a new axis of parallelism not present in the "flat" organization of the problems, as well as new parallelization problems. We demonstrate good speedups on two cache-coherent shared-memory multiprocessors, the Stanford DASH and the SGI Challenge, with distributed and centralized memory organization, respectively. Our results point to several areas of further study to make both the hierarchiacal and the parallel aspects more flexible for general problems: automatic structure decomposition, processor load balancing across the hierarchy, and data locality management in conjunction with load balancing. Finally we outline the directions we are investigating to incorporate these extensions.

A Data-Parallel Implementation of O(N) Hierarchical N-body Methods

Yu Hu
S. Lennart Johnsson

The O(N) hierarchical N-body algorithms and Massively Parallel Processors allow particle systems of 100 million particles or more to be simulated in acceptable time. We present a data-parallel implementation of Anderson's method and demonstrate both efficiency and scalability of the implementation on the Connection Machine CM-5/5E systems. The communication time for large particle systems amounts to about 10-25%, and the overall efficiency is about 35%. The evaluation of the potential field of a system of 100 million particles takes 3 minutes and 15 minutes on a 256 node CM-5E, giving expected four and seven digits of accuracy, respectively. The speed of the code scales linearly with the number of processors and number of particles.

The Design of a Portable Scientific Tool: A Case Studing Using SnB

Steven M. Gallo
Russ Miller
Charles M. Weeks

Developing and maintaining a large software package is a complex task. Decisions are made early in the design process that affect i) the ability of a user to effectively exploit the package and ii) the ability of a software engineer to maintain it. This case study discusses issues in software development and maintainability of a scientific package called SnB, which is used to determine molecular crystal structures. The design of the user interface is discussed along with important software engineering concepts, including modular programming, data encapsulation, and internal code documentation. Fortran is a language that is still widely used in the scientific community. Issues concerning the integration of Fortran into a modern scientific application with a C-based user interface are also discussed. Scientific applications benefit from being available on a wide variety of platforms. Due to demand, SnB is available on a variety of sequential and parallel platforms. Methods used in the design of SnB for such portability are presented including POSIX compliance, automatic configuration scripts, and parallel programming techniques.

Performance I

Runtime Performance of Parallel Array Assignment: An Empirical Study

Lei Wang
James M. Stichnoth
Siddhartha Chatterjee

Compiling the array assignment statement of High Performance Fortran in the presence of block-cyclic distributions of data arrays is considered difficult, and several algorithms have been published to solve this problem. We present a comprehensive study of the performance of these algorithms. We classify these algorithms into several families and identify several issues of interest in the compilation process, and present experimental performance data for the various algorithms. We demonstrate that block-cyclic distributions can be compiled almost as efficiently as block and cyclic distributions.

ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance

Laura Susan Blackford
J. Choi
A. Cleary
A. Petitet
R. C. Whaley
J. Demmel
I. Dhillon
K. Stanley
J. Dongarra
S. Hammarling
G. Henry
D. Walker

This paper outlines the content and performance of ScaLAPACK, a collection of mathematical software for linear algebra computations on distributed memory computers. The importance of developing standards for computational and message passing interfaces is discussed. We present the different components and building blocks of ScaLAPACK, and indicate the difficulties inherent in producing correct codes for networks of heterogeneous processors. Finally, this paper briefly describes future directions for the ScaLAPACK library and concludes by suggesting alternative approaches to mathematical libraries, explaining how ScaLAPACK could be integrated into efficient and user-friendly distributed systems.

Network Performance Modeling for PVM Clusters

Mark J. Clement
Michael R. Steed
Phyllis E. Crandall

The advantages of workstation clusters as a parallel computing platform include a superior price-performance ratio, availability, scalability, and ease of incremental growth. However, the performance of traditional LAN technologies such as Ethernet and FDDI rings are insufficient for many parallel applications. This paper describes APACHE (Automated Pvm Application CHaracterization Environment), an automated analysis system that uses an application-independent model for predicting the impact of ATM on the execution time of iterative parallel applications. APACHE has been used to predict the performance of several core applications that form the basis for many real scientific and engineering problems. We present a comparison of the performance predicted by APACHE with observed execution times to demonstrate the accuracy of our model. Finally, we present a method for a simple cost-benefit analysis that can be used to determine whether an investment in ATM equipment is justified for a particular workstation cluster environment.

Visualization & Education

Scalable Parallel Algorithms For Interactive Visualization of Curved Surfaces

Subodh Kumar
Chun-Fa Chang
Dinesh Manocha

We present efficient parallel algorithms for interactive display of higher order surfaces on current graphics systems. At each frame, these algorithms approximate the surface by polygons and rasterize them over the graphics pipeline. The time for polygon generation for each surface primitive varies between successive frames and we address issues in distributing the load across processors for different environments. This includes algorithms to statically distribute the primitives to reduce dynamic load imbalance as well a distributed wait-free algorithm for machines on which re-distribution is efficient, e.g. shared memory machine. These algorithms have been implemented on different graphics systems and applied to interactive display of trimmed spline models. In practice, we are able to obtain almost linear speed-ups (as a function of number of processors). Moreover, the distributed wait-free algorithm is faster by 25-30% as compared to static and dynamic schemes.

STREN: A Highly Scalable Parallel Stereo Terrain Renderer for Planetary Mission Simulations

Ansel Teng
Scott Whitman
Meemong Lee

In this paper, we describe STREN, a parallel stereo renderer for fixed-location terrain rendering tasks required for the simulation of planetary exploration missions. The renderer is based on a novel spatial data representation, called the TANPO map. This data representation stores terrain data using a simple and compact structure and provides excellent locality for such rendering applications. Experimental results show that the renderer not only performs very well, but also scales perfectly to different numbers of processors.

Education in High Performance Computing via the WWW: Designing and Using Technical Materials Effectively

Susan Mehringer

Cornell Theory Center, a national center for high performance computing, has been designing and delivering education programs on parallel processing in traditional workshops for years. With the advent and growth of the World Wide Web, we have been able to expand our training efforts to a distance education format, including online lectures and exercises, communication with CTC consultants and other participants, and logins on CTC's world-class IBM RS/6000 SP. This description includes workshop design, technical content covered, design of the modules, and participants' response.

Compiler Analysis

Compiler-directed Shared-memory Communication for Iterative Parallel Applications

Guhan Viswanathan
James R. Larus

Many scientific applications are iterative and specify repetitive communication patterns. This paper shows how a parallel-language compiler and custom cache-coherence protocols in a distributed shared memory system together can implement shared-memory communication efficiently for applications with unpredictable but repetitive communication patterns. The compiler uses data-flow analysis to identify program points where repetitive communication occurs. At runtime, the custom protocol builds communication schedules in one iteration and uses it to pre-send data in following iterations. This paper contains measurements on three iterative applications (including adaptive programs with unstructured data accesses) to show that custom protocols increase the number of shared-data requests satisfied locally, thus reducing the amount of time spent waiting for remote data.

Dynamic Data Distribution with Control Flow Analysis

Jordi Garcia
Eduard Ayguade
Jesus Labarta

This paper describes the design of a data distribution tool which automatically derives the data mapping for the arrays and the parallelization strategy for the loops in a Fortran 77 program. The layout generated can be static or dynamic, and the distribution is one-dimensional BLOCK or CYCLIC. The tool takes into account the control flow statements in the code in order to better estimate the behavior of the program. All the information regarding data movement and parallelism is contained in a single data structure named Communication-Parallelism Graph (CPG). The CPG is used to model a minimal path problem in which time is the objective function to minimize. It is solved using a general purpose linear programming solver, which finds the optimal solution for the whole problem. The experimental results will illustrate the quality of the solutions generated and the feasibility of the approach in terms of compilation time.

Transformations for Imperfectly Nested Loops

Induprakas Kodukula
Keshav Pingali

Loop transformations are critical for compiling high-performance code for modern computers. Existing work has focused on transformations for perfectly nested loops (that is, loops in which all assignment statements are contained within the innermost loop of a loop nest). In practice, most loop nests, such as those in matrix factorization codes, are imperfectly nested. In some programs, imperfectly nested loops can be transformed into perfectly nested loops by loop distribution, but this is not always legal. In this paper, we present an approach to transforming imperfectly nested loops directly. Our approach is an extension of the linear loop transformation framework for perfectly nested loops, and it models permutation, reversal, skewing, scaling, alignment, distribution and jamming. We also give a completion procedure which generates a complete transformation from a partial transformation.

Geophysical Applications

Earthquake Ground Motion Modeling on Parallel Computers

Hesheng Bao
Jacobo Bielak
Omar Ghattas
Loukas F. Kallivokas
David R. O'Hallaron
Jonathan R. Shewchuk
Jifeng Xu

We describe the design and discuss the performance of a parallel elastic wave propagation simulator that is being used to model and study earthquake-induced ground motion in large sedimentary basins. The components of the system include mesh generators, a mesh partitioner, a parceler, and a parallel code generator, as well as parallel numerical methods for applying seismic forces, incorporating absorbing boundaries, and solving the discretized wave propagation problem. We discuss performance on the Cray T3D for unstructured mesh wave propagation problems of up to 14 million tetrahedra. By paying careful attention to each step of the process, we obtain excellent performance despite the highly irregular structure of the coefficient matrices of the problem. The mesh generator, partitioner, parceler, and code generator have been incorporated into an integrated toolset/compiler. This system, called Archimedes, automates the solution of unstructured mesh PDE problems on parallel computers, and is being used for other unstructured mesh applications beyond ground motion modeling.

Performance Analysis and Optimization on the UCLA Parallel Atmospheric General Circulation Model Code

John Lou
John Farrara

An analysis is presented of several factors influencing the performance of a parallel implementation of the UCLA atmospheric general circulation model(AGCM) on massively parallel computer systems. Several modifications to the parallel AGCM code aimed at improving its numerical efficiency, interprocessor communication cost, load-balance and cache efficiency are discussed. The impact of some of the optimization strategies on the performance of the AGCM code as we implemented on several state-of-the-art parallel computers, including the Intel Paragon, Cray T3D and IBM SP2, is presented and analyzed.

Climate Data Assimilation on a Massively Parallel Supercomputer

Hong Q. Ding
Robert D. Ferraro

We have designed and implemented a set of highly efficient and highly scalable algorithms for an unstructured computational package, the PSAS data assimilation package, as demonstrated by detailed performance analysis of systematic runs on up to 512-nodes of an Intel Paragon. The preconditioned Conjugate Gradient solver achieves a sustained 18 Gflops performance. Consequently, we achieve an unprecedented 100-fold reduction in time to solution on the Intel Paragon over a single head of a Cray C90. This not only exceeds the daily performance requirement of the Data Assimilation Office at NASA's Goddard Space Flight Center, but also makes it possible to explore much larger and challenging data assimilation problems which are unthinkable on a traditional computer platform such as the Cray C90.

Tools

Performance Analysis Using the MIPS R10000 Performance Counters

Marco Zagha
Brond Larson
Steve Turner
Marty Itzkowitz

Tuning supercomputer application performance often requires analyzing the interaction of the application and the underlying architecture. In this paper, we describe support in the MIPS R10000 for non-intrusively monitoring a variety of processor events -- support that is particularly useful for characterizing the dynamic behavior of multi-level memory hierarchies, hardware-based cache coherence, and speculative execution. We first explain how performance data is collected using an integrated set of hardware mechanisms, operating system abstractions, and performance tools. We then describe several examples drawn from scientific applications that illustrate how the counters and profiling tools provide information that helps developers analyze and tune applications.

Profiling a Parallel Language Based on Fine-Grained Communication

Bjoern Haake
Klaus E. Schauser
Chris Scheiman

Fine tuning the performance of large parallel programs is a very difficult task. A profiling tool can provide detailed insight into the utilization and communication of the different processors, which helps identify performance bottlenecks. In this paper we present a profiler for the fine-grained parallel programming language Split-C, which provides a simple global address space memory model. As our experience shows, it is much more challenging to profile programs that make use of efficient, low-overhead communication. We incorporated techniques which minimize profiling effects on the running program. We quantify the profiling overhead and present several Split-C applications which show that the profiler is useful in determining performance bottlenecks.

Modeling, Evaluation, and Testing of Paradyn Instrumentation System

Abdul Waheed
Diane T. Rover
Jeffrey K. Hollingsworth

This paper presents a case study of modeling, evaluating, and testing the data collection services (called an instrumentation system) of the Paradyn parallel performance measurement tool using well-known performance evaluation and experiment design techniques. The overall objective of the study is to use modeling- and simulation-based evaluation to provide feedback to the tool developers to help them choose system configurations and task scheduling policies that can significantly reduce the data collection overheads. We develop and parameterize a resource occupancy (ROCC) model for Paradyn instrumentation system (IS) for an IBM SP-2 platform. This model is parameterized with a measurement-based workload characterization and subsequently used to answer several "what if" questions regarding configuration options and two policies to schedule instrumentation system tasks: collect-and-forward (CF) and batch-and-forward (BF) policies. Simulation results indicate that the BF policy can significantly reduce the overheads. Based on this feedback, the BF policy was implemented in Paradyn IS as an option to manage the data collection. Measurement-based testing results obtained from this enhanced version of Paradyn IS are reported in this paper and indicate more than 60% reduction in the direct IS overheads when the BF policy is used.

Performance II

An Analytical Model of the HINT Performance Metric

Quinn O. Snell
John L. Gustafson

The HINT Benchmark was developed to provide a broad-spectrum metric for computers, and to measure performance over the full range of memory sizes and time scales. We have extended our understanding of why HINT performance curves look the way they do, and can now predict the curves using an analytical model based on simple hardware specifications as input parameters.

Communication Patterns and Models in Prism: A Spectral Element-Fourier Parallel Navier-Stokes Solver

Constantinos Evangelinos
George Em Karniadakis

In this paper we analyze communication patterns in the parallel three-dimensional Navier-Stokes solver Prism, and present performance results on the IBM SP2, the Cray T3D and the SGI Power Challenge XL. Prism is used for direct numerical simulation of turbulence in non-separable and multiply-connected domains. The numerical method used in the solver is based on mixed spectral element-Fourier expansions in (x-y) planes and z-direction, respectively. Each (or a group) of Fourier modes is computed on a separate processor as the linear contributions (Helmholtz solves) are completely uncoupled in the incompressible Navier-Stokes equations; coupling is obtained via the nonlinear contributions (convective terms). The transfer of data between physical and Fourier space requires a series of complete exchange operations, which dominate the communication cost for small number of processors. As the number of processors increases, global reduction and gather operations become important while complete exchange becomes more latency dominated. Predictive models for these communication operations are proposed and tested against measurements. A relatively large variation in communication timings per iteration is observed in simulations and quantified in terms of specific operations. A number of improvements are proposed that could significantly reduce the communications overhead with increasing numbers of processors, and {\em generic} predictive maps are developed for the complete exchange operation, which remains the fundamental communication in Prism. Results presented in this paper are representative of a wider class of parallel spectral and finite element codes for computational mechanics which require similar communication operations.

The C3I Parallel Benchmark Suite - Introduction and Preliminary Results

Rakesh Jha
Richard C. Metzger
Brian VanVoorst
Luiz S. Pires
Wing Au
Minesh Amin
David A. Castanon
Vipin Kumar

Current parallel benchmarks, while appropriate for scientific applications, lack the defense relevance and representativeness for developers who are considering parallel computers for their Command, Control, Communication, and Intelligence (C3I) systems. We present a new set of compact application benchmarks which are specific to the C3I application domain. The C3I Parallel Benchmark Suite (C3IPBS) program is addressing the evaluation of not only machine performance, but also the software implementation effort. Our methodology currently draws heavily from the PARKBENCH[2] and NAS Parallel Benchmarks[1]. The paper presents the benchmarking methodology, introduces the benchmarks, and reports initial results and analysis. Finally, we describe the lessons that we have learned so far from formulating and implementing the C3I benchmarks.

Architecture and Application: The Performance of the NEC SX-4 on the NCAR Benchmark Suite

Steven W. Hammond
Richard D. Loft
Philip D. Tannenbaum

In November 1994, the NEC Corporation announced the SX-4 supercomputer. It is the third in the SX series of supercomputers and is upward compatible from the SX-3R vector processor with enhancements for scalar processing, short vector processing, and parallel processing. In this paper we describe the architecture of the SX-4 which has an 8.0 ns clock cycle and a peak performance of 2 Gflops per processor. We also describe the composition of the NCAR Benchmark Suite, designed to evaluate the computers for use on climate modeling applications. Additionally, we contrast this benchmark suite with other benchmarks. Finally, we detail the scalability and performance of the SX-4/32 relative to the NCAR Benchmark Suite.

Networking & Architecture

Minimal Adaptive Routing with Limited Injection on Toroidal k-ary n-cubes

Fabrizio Petrini
Marco Vanneschi

Virtual channels can be used to implement deadlock free adaptive routing algorithms and increase network throughput. Unfortunately, they introduce asymmetries in the use of buffers of symmetric networks as the toroidal k-ary n-cubes. In this paper we present a minimal adaptive routing algorithm that tries to balance the use of the virtual channels by limiting the injection of new packets into the network. The experimental results, conducted on a 256 nodes torus, show that it is possible to increase the saturation point and to keep the network throughput stable at high traffic rates. The comparison with the Chaos router, a non minimal cut-through adaptive routing, shows that our algorithm obtains similar performance results using only a small fraction of buffers and a simpler router model.

Low-Latency Communication on the IBM RISC System/6000 SP

Chi-Chao Chang
Grzegorz Czajkowski
Chris Hawblitzel
Thorsten von Eicken

The IBM SP is one of the most powerful commercial MPPs, yet, in spite of its fast processors and high network bandwidth, the SP's communication latency is inferior to older machines such as the TMC CM-5 or Meiko CS-2. This paper investigates the use of Active Messages (AM) communication primitives as an alternative to the standard message passing in order to reduce communication overheads and to offer a good building block for higher layers of software. The first part of this paper describes an implementation of Active Messages (SP AM) which is layered directly on top of the SP's network adapter (TB2). With comparable bandwidth, SP AM's low overhead yields a round-trip latency that is 40% lower than IBM MPL's. The second part of the paper demonstrates the power of AM as a communication substrate by layering Split-C as well as MPI over it. Split-C benchmarks are used to compare the SP to other MPPs and show that low message overhead and high throughput compensate for SP's high network latency. The MPI implementation is based on the freely available MPICH version and achieves performance equivalent to IBM's MPI-F on the NAS benchmarks.

Compiled Communication for All-optical TDM Networks

Xin Yuan
R. Melhem
R. Gupta

While all-optical networks offer large bandwidth for transferring data, the control mechanisms to dynamically establish all-optical paths incur large overhead. In this paper, we consider the problem of adapting all-optical multiplexed networks in multiprocessor or multicomputer environment by using compiled communiation as an alternative to dynamic network control. In compiled communication, the network resources are managed statically and therefore, run time control overhead is eliminated. In addition, complex offline algorithms can be incorporated to manage the network resources more efficiently. We studied several off-line connection scheduling algorithms for optimizing the multiplexing degree required to satisfy communication requests. The performance of compiled communication for communication patterns that can be determined at compile time in application programs is evaluated and compared with dynamically controlled communication assuming a two-dimmension torus topology. Our results show that the compiled communication out-performs the dynamic communication to a large degree for these communication patterns. Since most of the communication patterns in parallel applications can be determined at compile time, we conclude that compiled communication is an effective mechanism for all-optical network in multiprocessor environments.

Increasing the Effective Bandwidth of Complex Memory Systems in Multivector Processors

Anna M. del Corral
Jose M. Llaberia

In multivector processors, the lost cycles due to conflicts between concurrent vector streams make the effective throughput be lower than the peak throughput. When the request rate of all the concurrent vector streams to every memory module is less than or equal to the service rate, conflicts appear because concurrent vector streams reference memory modules in different orders. In additiona, in a memory system where several memory modules are mapped in every bus (complex memory system) bus conflicts are added to memory module conflicts. This paper proposes an access order to the vector stream elements that reduces the average memory access time in vector processors with complex memory systems. When request rate is greater than the service rate, the proposed order reduces the numbe of lost cycles, and the effective throughput increases. In other cases, the effective throughput reach the peak throughput. The proposed order generates the memory references in such a way that the memory modules shared by the concurrent self-conflict-free vector streams, and the sections where memory modules are mapped, are referenced using the same order.

Hydrodynamics Applications

A Parallel Cosmological Hydrodynamics Code

Paul W. Bode
Guohong Xu
Renyue Cen

Formation by gravitational collapse of galaxies and the large-scale structure of the universe is a nonlinear, multi-scale, multi-component problem. This complex process involves dynamics of the gaseous baryons as well as of the gravitationally dominant dark matter. We discuss an implementation of a parallel, distributed memory, cosmological hydrodynamics code using MPI message passing; it is based on the Total-Variation-Diminishing (Harten 1983) serial code of Ryu et al. (1993). This parallel code follows the motion of gas and dark matter simultaneously, combining a mesh based Eulerian hydrodynamics code and a Particle-Mesh N-body code. A new, flexible matrix transpose algorithm is used to interchange distributed and local dimensions of the mesh. Timing results from runs on an IBM SP2 supercomputer are given.

Transient Dynamics Simulations: Parallel Algorithms for Contact Detection and Smoothed Particle Hydrodynamics

Steve Plimpton
Bruce Hendrickson
Steve Attaway
Jeff Swegle
Courtenay Vaughan
Dave Gardner

Transient dynamics simulations are commonly used to model phenomena such as car crashes, underwater explosions, and the response of shipping containers to high-speed impacts. Physical objects in such a simulation are typically represented by Lagrangian meshes because the meshes can move and deform with the objects as they undergo stress. Fluids (gasoline, water) or fluid-like materials (earth) in the simulation can be modeled using the techniques of smoothed particle hydrodynamics. Implementing a hybrid mesh/particle model on a massively parallel computer poses several difficult challenges. One challenge is to simultaneously parallelize and load-balance both the mesh and particle portions of the computation. A second challenge is to efficiently detect the contacts that occur within the deforming mesh and between mesh elements and particles as the simulation proceeds. These contacts impart forces to the mesh elements and particles which must be computed at each timestep to accurately capture the physics of interest. In this paper we describe new parallel algorithms for smoothed particle hydrodynamics and contact detection which turn out to have several key features in common. Additionally, we describe how to join the new algorithms with traditional parallel finite element techniques to create an integrated particle/mesh transient dynamics simulation. Our approach to this problem differs from previous work in that we use three different parallel decompositions, a static one for the finite element analysis and dynamic ones for particles and for contact detection. We have implemented our ideas in a parallel version of the transient dynamics code PRONTO-3D and present results for the code running on a large Intel Paragon.

Performance of a Computational Fluid Dynamics Code on NEC and Cray Supercomputers: Beyond 10 Gigaflops

Ferhat F. Hatay

The implementation and optimization of a production mode Computational Fluid Dynamics (CFD) software to NEC and Cray supercomputing platforms are discussed. It is intended to assess the impact of different computer architectures and High Power Computing approaches while studying a problem of engineering and scientific importance. The computational model solves the viscous compressible flow equations for high-speed wall-shear layer flows. The code is very versatile and has performed extremely well in vector machines as well as on massively parallel platforms. It is written in FORTRAN programming language while implementing the recent extensions such as HPF and FORTRAN 90. With this approach, an almost perfect level of portability is achieved while the performance on a particular supercomputing platform does not degrade.

Algorithms I

Parallel Preconditioners for Elliptic PDEs

Vivek Sarin
Ahmed Sameh

Iterative schemes for solving sparse linear systems arising from elliptic PDEs are very suitable for efficient implementation on large scale multiprocessors. However, these methods rely heavily on effective preconditioners which must also be amenable to parallelization. In this paper, we present a novel method to obtain a preconditioned linear system which is solved using an iterative method. Each iteration comprises of a matrix-vector product with k sparse matrices (k < logn), and can be computed in O(n) opertions where n is the number of unknowns. The numerical convergence properties of our preconditioner are superior to the commonly used incomplete factorization preconditioners. Moreover, unlike the incomplete factorization preconditioners, our algorithm affords a higher degree of concurrency and doesn't require triangular system solves, thereby achieving the dual objective of good preconditioning and efficient parallel implementation. We describe our scheme for certain linear systems with symmetric positive definite or symmetric indefinite matrices and present an efficient parallel implementation along with an analysis of the parallel complexity. Results of the parallel implimentation of our algorithm will also be presented in the final version of this paper.

Sparse LU Factorization with Partial Pivoting on Distributed Memory Machines

Cong Fu
Tao Yang

Sparse LU factorization with partial pivoting is important to many scientific applications, but the effective parallelization of this algorithm is still an open problem. The main difficulty is that partial pivoting operations make structures of L and U factors unpredictable beforehand. This paper presents a novel approach called S* for parallelizing this problem on distributed memory machines. S* incorporates static symbolic factorization to avoid run-time control overhead and uses nonsymmetric L/U supernode partitioning and amalgamation strategies to maximize the use of BLAS-3 routines. The irregular task parallelism embedded in sparse LU is exploited using graph scheduling and efficient run-time support techniques which optimize communication, overlap computation with communication and balance processor loads. The experimental results on the Cray-T3D with a set of Harwell-Boeing nonsymmetric matrices are very encouraging and good scalability has been achieved. Even compared to a highly optimized sequential code, the parallel speedups are still impressive considering the current status of sparse LU research.

Implementation of Strassen's Algorithm for Matrix Multiplication

Steven Huss-Lederman
Elaine M. Jacobson
Jeremy R. Johnson
Anna Tsao
Thomas Turnbull

In this paper we report on the development of an efficient and portable implementation of Strassen's matrix multiplication algorithm for matrices of arbitrary size. Our technique for defining the criterion which stops the recursions is more detailed than those generally used, thus allowing enhanced performance for a larger set of input sizes. In addition, we deal with odd matrix dimensions using a method whose usefulness had previously been in question and had not so far been demonstrated. Our memory requirements have also been reduced, in certain cases by 40 to more than 70 percent over other similar implementations. We measure performance of our code on the IBM RS/6000, CRAY YMP C90, and CRAY T3D single processor, and offer comparisons to other codes. Finally, we demonstrate the usefulness of our implementation by using it to perform the matrix multiplications in a large application code.

Algorithms II

Global Load Balancing with Parallel Mesh Adaption on Distributed-Memory Systems

Rupak Biswas
Leonid Oliker
Andrew Sohn

Dynamic mesh adaption on unstructured grids is a powerful tool for efficiently computing unsteady problems to resolve solution features of interest. Unfortunately, this causes load imbalance among processors on a parallel machine. This paper describes the parallel implementation of a tetrahedral mesh adaption scheme and a new global load balancing method. A huristic remapping algorithm is presented that assigns partitions to processors such that the redistribution cost is minimized. Results indicate that the paralel performance of the mesh adaption code depends on the nature of the adaption region and show a 35.5X speedup on 64 processors when about 35% of the mesh is randomly adapted. For large-scale scientific computations, our load balancing strategy gives almost a sixfold reduction in solver execution times over non-balanced loads. Furthermore, our heuristic remapper yields processor assignments that are less than 3% off the optimal solutions but requries only 1% of the computational time.

Parallel Hierarchical Solvers and Preconditioners for Boundary Element Methods

Ananth Grama
Vipin Kumar
Ahmed Sameh

The method of moments is an important tool for solving boundary integral equations arising in a variety of applications. It transforms the physical problem into a dense linear system. Due to the large number of variables and the associated computational requirements, these systems are solved iteratively using methods such as GMRES, CG and its variants. The core operation of thes itertive solvers is the application of the system matrix to a vector. This requres O(n2) operations and memory using accurate dense methods. The computational complexity can be reduced to O(n log n) and the memory requirement to O(n) using hierarchical approximation techniques. The algorithmic speedup from approximation can be combined with parallelism to yield very fast dense solvers. In this paper, we present efficient parallel formulations of dense iterative solvers based on hierarchical approximations for solving the integral form of Laplace equation. We study the impact of various parameters on the accuracy and performance of the parallel solver. We present two preconditioning techniques for accelerating the convergence of the iterative solver. Thes techniques are based on an inner-outer scheme and a block diagonal scheme based on a truncated Green's function. We present detailed experimental results on up to 256 processors of a Cray T3D.

Parallel Multilevel k-way Partitioning Scheme for Irregular Graphs

George Karypis
Vipin Kumar

In this paper we present a parallel formulation of a multilevel k-way graph partitioning algorithm. The multilevel k-way partitioning algorithm reduces the size of the graph by collapsing vertices and edges (coarsening phase), finds a k-way partition of the smaller graph, and then it constructs a k-way partition for the original graph by projecting and refining the partition to successively finer graphs (uncoarsening phase). A key innovative feature of our parallel formulation is that it utilizes graph coloring to effectively parallelize both the coarsenin and the refinement during the uncoarsening phase. Our algorithm is able to achieve a high degree of concurrency, while maintaining the high quality partitions produced by the serial algorithm. We test our scheme on a large number of graphs from finite element methods, and transportation domains. Our parallel formulation on Cray T3D, produces high quality 128-way partitions on 128 processors in a little over two seconds, for graphs with a million vertices. Thus our parallel algorithm makes it possible to perform dynamic graph partition in adaptive computations without compromising quality.

Parallel Programming Support

Double Standards: Bringing Task Parallelism to HPF Via the Message Passing Interface

Ian Foster
David R. Kohr, Jr.
Rakesh Krishnaiyer
Alok Choudhary

High Performance Fortran (HPF) does not allow efficient expression of mixed task/data-parallel computations or the coupling of separately compiled data-parallel modules. In this paper, we show how a coordination library implementing the Message Passing Interface (MPI) can be used to represent these common parallel program structures. This library allows data-parallel tasks to exchange distributed data structures using calls to simple communication functions. We present microbenchmark results that characterize the performance of this library and that quantify the impact of optimizations that allow reuse of communication schedules in common situations. In addition, results from two-dimensional FFT, convolution, and multiblock programs demonstrate that the HPF/MPI library can provide performance superior to that of pure HPF. We conclude that this synergistic combination of two parallel programming standards represents a useful approach to task parallelism in a data-parallel framework, increasing the range of problems addressable in HPF without requiring complex compiler technology.

OMPI: Optimizing MPI Programs Using Partial Evaluation

Hirotaka Ogawa
Satoshi Matsuoka

MPI is gaining acceptance as a standard for message-passing in high-performance computing, due to its powerful and flexible support of various communication styles. However, the complexity of its API poses significant software overhead, and as a result, applicability of MPI has been restricted to rather regular, coarse-grained computations. Our OMPI (Optimizing MPI) system removes much of the excess overhead by employing partial evaluation techniques, which exploit static information of MPI calls. Because partial evaluation alone is insufficient, we also utilize template functions for further optimization. To validate the effectiveness for our OMPI system, we performed baseline as well as more extensive benchmarks on a set of application cores with different communication characteristics, on the 64-node Fujitsu AP1000 MPP. Benchmarks show that OMPI improves execution efficiency by as much as factor of two for communication-intensive application core with minimal code increase. It also performs significantly better than previous dynamic optimization technique.

Particle-in-Cell Simulation Codes in High Performance Fortran

Erol Akarsu
Kivanc Dincer
Tomasz Haupt
Geoffrey C. Fox

Particle-in-Cell (PIC) plasma simulation codes model the interaction of charged particles with surrounding electrostatic and magnetic fields. Its computational requirements made it to be classified as one of the grand-challenge problems facing the high performance community. In this paper we present the implementation of 1-D and 2-D electrostatic PIC codes in High Performance Fortran(HPF) on a IBM SP-2. HPF expands Fortran 90 with data distribution and alignment directives and data parallel statements. It is a powerful language for writing portable and high performance programs across many platforms. We used one of the most successful commerical HPF compilers currently available in the market and augmented the compiler's missing HPF functions with extrinsic routines when necessary. We obtained near linear speed-up in all of our test cases. The performance of the HPF programs is comparable to the native message passing implementations of the same codes on the SP-2.

Scheduling

Application-Level Scheduling on Distributed Heterogeneous Networks

Francine D. Berman
Rich Wolski
Silvia Figueira
Jennifer Schopf
Gary Shao

Heterogeneous networks are increasingly being used as platforms for resource-intensive distributed parallel applications. A critical contributor to the performance of such applications is the scheduling of constituent application tasks on the network. Since often the distributed resources cannot be brought under the control of a single global scheduler, the application must be scheduled by the user. To obtain the best performance, the user must take into account both application-specific and dynamic system information in developing a schedule which meets his or her performance criteria. In this paper, we define a set of principles underlying application-level scheduling and describe our work-in-progress building AppLeS (application-level scheduling) agents. We illustrate the application-level scheduling approach with a detailed description and results for a distributed 2D Jacobi application on two production heterogeneous platforms.

NetSolve: A Network Server for Solving Computational Science Problems

Henri Casanova
Jack Dongarra

This paper presents a new system, called NetSolve, that allows users to access computational resources, such as hardware and software, distributed across the network. The development of NetSolve was motivated by the need for an easy-to-use, efficient mechanism for using computational resources remotely. Ease of use is obtained as a result of different interfaces, some of which require no programming effort from the user. Good performance is ensured by a load-balancing policy that enables NetSolve to use the computational resources available as efficiently as possible. NetSolve offers the ability to look for computational resources on a network, choose the best one availab le, solve a problem (with retry for fault-tolerance), and return the answer to the user.

Multimethod Communication for High-performance Metacomputing Applications

Ian Foster
Jonathan Geisler
Carl Kesselman
Steven Tuecke

Metacomputing systems use high-speed networks to connect supercomputers, mass storage systems, scientific instruments, and display devices with the objective of enabling parallel applications to utilize geographically distributed computing resources. However, experience shows that high performance can often be achieved only if applications can integrate diverse communication substrates, transport mechanisms, and protocols, chosen according to where communication is directed, what is communicated, or when communication is performed. In this paper, we describe a software architecture that addresses this requirement. This architecture allows multiple communication methods to be supported transparently in a single application, with either automatic or user-specified selection criteria guiding the methods used for each communication. We describe an implementation of this architecture, based on the Nexus communication library, and use this implementation to evaluate performance issues. This implementation was used to support a wide variety of applications in the I-WAY metacomputing experiment at Supercomputing~95; we use one of these applications to provide a quantitative demonstration of the advantages of multimethod communication in a heterogeneous networked environment.

Building a World-Wide Virtual Machine Based on Web and HPCC Technologies

Kivanc Dincer
Geoffrey C. Fox

In today's high performance computing arena, there is a strong trend toward building virtual computers from heterogeneous resources on a network. In this paper we describe our experiences in building a world-wide virtual machine (WWVM) based on emerging Web and existing HPCC technologies. We have constructed a Web-based parallel/distributed programming environment on top of this machine demonstrating MPI and PVM message-passing programs and High Performance Fortran programs. Alternatively, the WWVM can be configured as a metacomputer for the solution of metaproblems.

Data Mining & Modeling

Parallel Data Mining for Association Rules on Shared-memory Multi-processors

M. J. Zaki
M. Ogihara
S. Parthasarathy
W. Li

Data mining is an emerging research area, whose goal is to extract significant patterns or interesting rules from large databases. High-level inference from large volumes of routine business data can provide valuable information to businesses, such as customer buying patterns, shelving criterion in supermarkets and stock trends. Many algorithms have been proposed for data mining of association rules. However, research so far has mainly focused on sequential algorithms. In this paper we present parallel algorithms for data mining of association rules, and study the degree of parallelism, synchronization, and data locality issues on the SGI Power Challenge shared-memory multi-processor. We further present a set of optimizations for the sequential and parallel algorithms.Experiments show that a significant improvement of performance is achieved using our proposed optimizations. We also achieved good speed-up for the parallel algorithm, but we observe a need for parallel I/O techniques for further performance gains.

Dynamic Computation Migration in DSM Systems

Wilson C. Hsieh
M. Frans Kaashoek
William E. Weihl

We describe dynamic computation migration, the runtime choice between computation and data migration. Dynamic computation migration is useful for concurrent data structures with unpredictable read/write patterns. We implemented it in MCRL, a multithreaded DSM system that runs on the MIT Alewife machine and Thinking Machines' CM-5. We evaluate two dynamic migration heuristics relative to data migration. On a concurrent, distributed B-tree with 50% lookups and 50% inserts, the STATIC heuristic improves performance by about 17%, on both Alewife and the CM-5. The REPEAT heuristic generally performs better than the STATIC heuristic. On Alewife, with 80% lookups and 20% inserts, the REPEAT heuristic improves performance by 23%; on the CM-5, it improves performance by 46%. Our results apply to concurrent, dynamic data structures whose access patterns are only known at runtime. For regularly accessed data structures, static methods will always be applicable, but we expect future applications to be more dynamic.

Performance Modeling for the Panda Array I/O Library

Ying Chen
Marianne Winslett
Szu-wen Kuo
Yong Cho
Mahesh Subramaniam
Kent Seamons

We present an analytical performance model for Panda, a library for synchronized i/o of large multidimensional arrays on parallel and sequential platforms, and show how the Panda developers use this model to evaluate Panda's parallel i/o performance and guide future Panda development. The model validation shows that system developers can simplify performance analysis, identify potential performance bottlenecks, and study the design trade-offs for Panda on massively parallel platforms more easily than by conducting empirical experiments. More importantly, we show that the outputs of the performance model can be used to help make optimal plans for handling application i/o requests, the first step toward our long-term goal of automatically optimizing i/o request handling in Panda.

Striping in Disk Array RM2 Enabling the Tolerance of Double Disk Failures

Chan-Ik Park
Tae-Young Choe

There is a growing demand in high reliability beyond what current RAID can provide and there are various levels of user demand for data reliability. An efficient data placement scheme called RM2 has been proposed in \cite{Park95}, which makes a disk array system tolerable against double disk failures. In this paper, we consider how to choose an optimal striping unit for RM2 particularly when no workload information is available except read/write ratio. A disk array simulator for RM2 has been developed for experimental works. It is shown that RM 2 has an optimal striping unit of two and half tracks in the case of disk read operations, and one third of a single track if any disk write operations are involved.

Gordon Bell Finalists

Simulation of the 3 Dimensional Cascade Flow with Numerical Wind Tunnel (NWT)

Takashi Nakamura
Toshiyuki Iwamiya
Masahiro Yoshida
Yuichi Matsuo
Masahiro Fukuda

The NWT was reinforced to get 280 GFLOPS of theoretical peak performance with the addition of 26 PEs to the original 140 PEs. On a CFD simulation of jet engine compressor, we attained active performance speed of 111 GFLOPS using 160 PEs.

N-body Simulation of Galaxy Formation on GRAPE-4 Special-Purpose Computer

Toshiyuki Fukushige
Junichiro Makino

We report on resent N-body simulations of galaxy formation performed on the GRAPE-4 (GRAvity PipE 4) system, a special-purpose computer for astrophysical N-body simulations. We review the astrophysical motivation, the algorithm, the actual performance, and the price per performance. The performance obtained is 332 Gflops averaged over 185 hours for a simulation of a galaxy formation with 786,400 particles. The price per performance obtained is 4,600 dollars per Gflops. The configuration used for the simulation consists of 1,269 pipeline processors and has a peak speed of 663 Gflops.

Electronic Structure of Materials Using Self-Interaction Corrected Density Functional Theory

Adolfy Hoisie
Stefan Goedecker
Jurg Hutter

We have developed an highly efficient electronic structure code, for parallel computers using message passing. The algorithm takes advantage of the natural parallelism in quantum chemistry problems to obtain very high performance even on a large number of processors. Most of the terms which scale cubically with respect to the number of atoms have been eliminated allowing the treatment of very large systems. It uses one of the most precise versions of Density Functional Theory, namely Self-Interaction Corrected Density Functional Theory. On a 6 processor Silicon Graphics Symmetric Multiprocessor based on the MIPS R8000 microprocessor, we obtain a performance of 6.3 Gflops per million dollar.

Particle Dynamics

Lightweight Computational Steering of Very Large Scale Molecular Dynamics Simulations

David M. Beazley
Peter S. Lomdahl

We present a computational steering approach for controlling, analyzing, and visualizing very large scale molecular dynamics simulations involving tens to hundreds of millions of atoms. Our approach relies on extensible scripting languages and an easy to use tool for building extensions and modules. The system is extremely easy to modify, works with existing C code, is memory efficient, and can be used from inexpensive workstations and networks. We demonstrate how we have used this system to manipulate data from production MD simulations involving as many as 104 million atoms running on the CM-5 and Cray T3D. We also show how this approach can be used to build systems that integrate common scripting languages (including Tcl/Tk, Perl, and Python), simulation code, user extensions, and commercial data analysis packages.

Design of a Large Scale Discrete Element Soil Model for High Performance Computing Systems

Alex R. Carrillo
David A. Horner
John F. Peters
John E. West

The Discrete Element Model (DEM) is an alternative to the classical continuum mechanics approach for problems with flow-like deformations of solids. DEM is especially well-suited for problems such as soil plowing where the media is subject to large discontinuous deformations. The goal of this project was to develop a large-scale three dimensional DEM soil modeling system capable of handling problems with potentially several million particles. This paper discusses the development of that model and the factors driving it toward a High Performance Computing (HPC) solution. It also discusses how work in progress is making use of heterogeneous HPC environments and high speed networks to produce a real-time user interaction and visualization capability.

Molecular Simulation of Rheological Properties Using Massively Parallel Supercomputers

R. K. Bhupathiraju
S. T. Cui
S. Gupta
H. D. Cochran
P. T. Cummings

Advances in parallel supercomputing now make possible molecular-based engineering and science calculations that will soon revolutionize many technologies, such as those involving polymers and those involving aqueous electrolytes. We have developed a suite of message-passing codes for classical molecular simulation of such complex fluids and amorphous materials and have completed a number of demonstration calculations of problems of scientific and technological importance with each (described at the World Wide Web site http://flory.engr.utk.edu/ldrd). In this paper, we will focus on the molecular simulation of rheological properties, particularly viscosity, of simple and complex fluids using parallel implementations of non-equilibrium molecular dynamics. Such calculations represent significant challenges computationally because, in order to reduce the thermal noise in the calculated properties within acceptable limits, large systems and/or long simulated times are required.

Data & Scheduling

Virtual Memory Versus File Interfaces for Large, Memory-intensive Scientific Applications

Yoonho Park
Ridgway Scott
Stuart Sechrest

Scientific applications often require some strategy for temporary data storage to do the largest possible simulations. The use of virtual memory for temporary data storage has received criticism because of performance problems. However, modern virtual memory found in recent operating systems such as Cenju-3/DE give application writers control over virtual memory policies. We demonstrate that custom virtual memory policies can dramatically reduce virtual memory overhead and allow applications to run out-of-core efficiently. We also demonstrate that the main advantage of virtual memory, namely programming simplicity, is not lost.

Impact of Job Mix on Optimizations for Space Sharing Schedulers

Jaspal Subhlok
Thomas Gross
Takashi Suzuoka

This paper addresses job scheduling for parallel supercomputers. Modern parallel systems with n nodes can be used by jobs requesting up to n nodes. If less than n nodes are requested, multiple jobs can be run at the same time, allowing several users to use the system. One of the challenges for the operating system is to give reasonable service to a diverse group of requests. A single 1-node job that is running for a long time may effectively block the whole machine if the next job requests all n nodes. To date various policies have been proposed for the scheduling highly parallel computers. But as the users of current supercomputers know, these policies work far from perfect. This paper reports on the measurement of the usage of a 96-node Intel Paragon at ETH Zurich, a 512 node IBM SP2 at Cornell Theory Center, and a 512 node Cray T3D at Pittsburgh Supercomputing Center. We discuss the common characteristics of the different workloads and identify their impact on job scheduling techniques for such parallel systems. The metrics used for evaluating scheduling are based on turnaround time and fairness among jobs. We specifically show how two simple scheduling optimizations based on reordering the waiting queue can be used to effectively improve scheduling performance on real workloads. An important contribution of this paper is to establish that supercomputer workloads do exhibit some common characteristics but they also differ in important ways, and the knowledge of workloads is important for design of effective scheduling algorithms. Given the current ad-hoc approach applied to tuning the scheduler systems, these results are of interest to scheduling researchers, supercomputer installations, and developers of scheduling software.