TECHNICAL PAPERS

The full text of the technical papers is available in the SC96 proceedings in html and postscript format.

TUESDAY, NOVEMBER 19

10:30 a.m.-noon
Room S2

BIOLOGY APPLICATIONS SESSION

Chair: Balaji Veeraraghavan, National Center for Supercomputer Applications

Parallel Hierarchical Molecular Structure Estimation
Cheng Che Chen, Stanford University; Jaswinder Pal Singh, Princeton
University; Russ B. Altman, Stanford University

Determining the three-dimensional structure of biological molecules such as proteins and nucleic acids is an important element of molecular biology because of the intimate relation between form and function of these molecules. Individual sources of data about molecular structure are subject to varying degrees of uncertainty. Previously we have examined the parallelization of a probabilistic algorithm for combining multiple sources of uncertain data to estimate the structure of molecules and also predict a measure of the uncertainty in the estimated structure. In this paper we extend our work on two fronts. First, we present a hierarchical decomposition of the original algorithm which reduces the sequential computational complexity tremendously. The hierarchical decomposition in turn reveals a new axis of parallelism not present in the "flat" organization of the problems, as well as new parallelization issues. We demonstrate good speedups on two cache-coherent shared-memory multiprocessors, the Stanford DASH and the SGI Challenge, with distributed and centralized memory organization, respectively. Our results point to several areas of further study to make both the hierarchical and the parallel aspects more flexible for general problems: automatic structure decomposition, processor load balancing across the hierarchy, and data locality management in conjunction with load balancing. Finally, we outline the directions we are investigating to incorporate these extensions.

A Data-Parallel Implementation of 0(n) Hierarchical n-Body Methods
Yu Hu, Harvard University; S. Lennart Johnsson, University of Houston and Harvard University

The 0(N) hierarchical N-body algorithms and massively parallel processors allow particle systems of 100 million particles or more to be simulated in acceptable time. We present a data-parallel implementation of Anderson's method and demonstrate both efficiency and scalability of the implementation on the Connection Machine CM-5/5E systems. The communication time for large particle systems amounts to about 10-25%, and the overall efficiency is about 35%. The evaluation of the potential field of a system of 100 million particles takes 3 minutes and 15 minutes on a 256 node CM-5E, giving expected four and seven digits of accuracy, respectively. The speed of the code scales linearly with the number of processors and number of particles.

The Design of a Portable Scientific Tool: A Case Study Using snb
Steven M. Gallo, Russ Miller, State University of New York at Buffalo; Charles M. Weeks, Hauptman Woodward Medical Research Institute

Developing and maintaining a large software package is a complex task. Decisions are made early in the design process that affect 1) the ability of a user to effectively exploit the package and 2) the ability of a software engineer to maintain it. This case study discusses issues in software development and maintainability of a scientific package called SnB, which is used to determine molecular crystal structures. The design of the user interface is discussed along with software engineering concepts, including modular programming, data encapsulation and internal code documentation. Issues concerning the integration of Fortran, a language that is still widely used in the scientific community, into a modern scientific application with C-based user interface are also discussed. Scientific applications benefit from being available on a wide variety of platforms. Due to demand, SnB is available on a variety of sequential and parallel platforms. Methods used in the design of SnB for such portability are presented, including POSIX compliance, automatic configuration scripts and parallel programming techniques.

TUESDAY, NOVEMBER 19

10:30 a.m.-Noon
Room S6

PERFORMANCE I SESSION

Chair: Allen D. Malony, University of Oregon

Runtime Performance of Parallel Array Assignment: An Empirical Study
Lei Wang, The University of North Carolina at Chapel Hill; James M. Stichnoth, Carnegie Mellon University; Siddhartha Chatterjee, The University of North Carolina at Chapel Hill

Generating code for the array assignment statement of High Performance Fortran (HPF) in the presence of block-cyclic distributions of data arrays is considered difficult, and several algorithms have been published to solve this problem. We present a comprehensive study of the run-time performance of the code these algorithm generate. We classify these algorithms into several families, identify several issues of interest in the generated code and present experimental performance data for the various algorithms. We demonstrate that the code generated for block-cyclic distributions runs almost as efficiently as that generated for block or cyclic distributions.

Scalapack: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance
Laura Susan Blackford, University of Tennessee, Knoxville; Jaeyoung Choi, Soongsil University; A. Cleary, A. Petitet, R. C. Whaley, University of Tennessee, Knoxville; J. Demmel, I. Dhillon, K. Stanley, University of California, Berkeley; Jack Dongarra, University of California, Berkeley; S. Hammarling, University of Tennessee, Knoxville; G. Henry, Intel SSPD; D. Walker,Oak Ridge National Laboratory

This paper outlines the content and performance of ScaLAPACK, a collection of mathematical software for linear algebra computations on distributed memory computers. The importance of developing standards for computational and message passing interfaces is discussed. We present the different components and building blocks of ScaLAPACK, and indicate the difficulties inherent in producing correct codes for networks of heterogeneous processors. Finally, this paper briefly describes future directions for the ScaLAPACK library and concludes by suggesting alternative approaches to mathematical libraries, explaining how ScaLAPACK could be integrated into efficient and user-friendly distributed systems.

Network Performance Modeling for PVM Clusters
Mark J. Clement, Michael R. Steed, Brigham Young University; Phyllis E. Crandall, University of Connecticut

The advantages of workstation clusters as a parallel computing platform include a superior price-performance ratio, availability, scalability and ease of incremental growth. However, the performance of traditional LAN technologies such as Ethernet and FDDI rings are insufficient for many parallel applications. This paper describes APACHE (Automated Pvm Application CHaracterization Environment), an automated analysis system that uses an application-independent model for predicting the impact of ATM on the execution time of iterative parallel applications. APACHE has been used to predict the performance of several core applications that form the basis for many real scientific and engineering problems. We present a comparison of the performance predicted by APACHE with observed execution times to demonstrate the accuracy of our model. Finally, we present a method for a simple cost-benefit analysis that can be used to determine whether an investment in ATM equipment is justified for a particular workstation cluster environment.

TUESDAY, NOVEMBER 19

1:30 p.m. - 3:00 p.m.
Room S2

VISUALIZATION & EDUCATION SESSION
Chair: Barbara Horner-Miller, Jet Propulsion Lab/Caltech

Scalable Parallel algorithms for Interactive Visualization of Curved Surfaces
Subodh Kumar, Chun-Fa Chang, Dinesh Manocha, University of North Carolina

We present efficient parallel algorithms for interactive display of higher order surfaces on current graphics systems. At each frame, these algorithms approximate the surface by polygons and rasterize them over the graphics pipeline. The time for polygon generation for each surface primitive varies between successive frames, and we address issues in distributing the load across processors for different environments. This includes algorithms to statically distribute the primitives to reduce dynamic load imbalance as well a distributed wait-free algorithm for machines on which re-distribution is efficient, e.g., shared memory machine. These algorithms have been implemented on different graphics systems and applied to interactive display of trimmed spline models. In practice, we are able to obtain almost linear speed-ups (as a function of number of processors). Moreover, the distributed wait-free algorithm is faster by 25-30% as compared to static and dynamic schemes.

STREN: A Highly Scalable Parallel Stereo Terrain Renderer for Planetary Mission Simulations
Ansel Teng, Jet Propulsion Lab; Scott Whitman, Cray Research Inc.; Meemong Lee, Jet Propulsion Lab

In this paper, we describe STREN, a parallel stereo renderer for fixed-location terrain rendering tasks required for the simulation of planetary exploration missions. The renderer is based on a novel spatial data representation, called the TANPO map. This data representation stores terrain data using a simple and compact structure and provides excellent locality for such rendering applications. Experimental results show that the renderer not only performs very well, but also scales perfectly to different numbers of processors.

Education In High Performance Computing Via The WWW: Designing And Using Technical Materials Effectively
Susan Mehringer, Cornell Theory Center

Cornell Theory Center, a national center for high performance computing, has been designing and delivering education programs on high-performance computing in traditional workshops for over ten years. With the advent and growth of the World Wide Web, we have been able to expand our training efforts to a distance education format, including online lectures and exercises, communication with CTC consultants and other participants and logins on CTC's world-class IBM RS/6000 SP. This description includes workshop design, technical content covered, design of the modules and participants' response.

TUESDAY, NOVEMBER 19

1:30 p.m. - 3:00 p.m.
Room S6

COMPILER ANALYSIS SESSION

Chair: Lori Pollock, University of Delaware

Compiler-Directed Shared-Memory Communication For Iterative Parallel ApplicationsGuhan Viswanathan, James R. Larus, University of Wisconsin-Madison

Many scientific applications are iterative and specify repetitive communication patterns. This paper shows how a parallel-language compiler and custom cache-coherence protocols in a distributed shared memory system together can implement shared-memory communication efficiently for applications with unpredictable but repetitive communication patterns. The compiler uses data-flow analysis to identify program points where repetitive communication occurs. At runtime, the custom protocol builds communication schedules in one iteration and uses it to pre-send data in following iterations. This paper contains measurements on three iterative applications (including adaptive programs with unstructured data accesses) to show that custom protocols increase the number of shared-data requests satisfied locally, thus reducing the amount of time spent waiting for remote data.

Dynamic Data Distribution With Control Flow Analysis
Jordi Garcia, Eduard Ayguade, Jesus Labarta, Universitat Politecnica de Catalunya

Data distribution is one of the key aspects in the design of a parallelizing compiler for a distributed-memory architecture to get efficiency from the system. This paper describes the design of a data distribution tool which automatically derives the data mapping for the arrays and the parallelization strategy for the loops in a Fortran 77 program. The layout generated can be static or dynamic, and the distribution is one-dimensional BLOCK or CYCLIC. The tool takes into account the control flow statements in the code in order to better estimate the behavior of the program. All the information regarding data movement and parallelism is contained in a single data structure named Communication-Parallelism Graph (CPG). The CPG is used to model a minimal path problem in which time is the objective function to minimize. It is solved using a general purpose linear programming solver, which finds the optimal solution for the whole problem. The experimental results will illustrate the quality of the solutions generated.

Transformations For Imperfectly Nested Loops
Induprakas Kodukula, Keshav Pingali, Cornell University

Loop transformations are critical for compiling high-performance code for modern computers. Existing work has focused on transformations for perfectly nested loops (that is, loops in which all assignment statements are contained within the innermost loop of a loop nest). In practice, most loop nests, such as those in matrix factorization codes, are imperfectly nested. In some programs, imperfectly nested loops can be converted into perfectly nested loops by loop distribution, but this is not always legal. In this paper, we present an approach to transforming imperfectly nested loops directly. Our approach is an extension of the linear loop transformation framework for perfectly nested loops, and it models permutation, reversal, skewing, scaling, alignment, distribution and jamming.

TUESDAY, NOVEMBER 19

3:30 p.m. - 5:00 p.m.
Room S2

GEOPHYSICAL APPLICATIONS SESSION

Chair: Robert Chervin, National Center for Atmospheric Research

Earthquake Ground Motion Modeling On Parallel Computers
Hesheng Bao, Jacobo Bielak, Omar Ghattas, Loukas F. Kallivokas, David R. O'Hallaron, Jonathan R. Shewchuk, Jifeng Xu, Carnegie Mellon University

We describe the design and discuss the performance of a parallel elastic wave propagation simulator that is being used to model earthquake-induced ground motion in large sedimentary basins. The components of the system include mesh generators, a mesh partitioner, a parceler and a parallel code generator, as well as parallel numerical methods for applying seismic forces, incorporating absorbing boundaries, and solving the discretized wave propagation problem. We discuss performance on the Cray T3D for unstructured mesh wave propagation problems of up to 77 million tetrahedra. By paying careful attention to each step of the process, we obtain excellent performance despite the highly irregular structure of the problem. The mesh generator, partitioner, parceler and code generator collectively form an integrated toolset called Archimedes, which automates the solution of unstructured mesh PDE problems on parallel computers, and is being used for other unstructured mesh applications beyond ground motion modeling.

Performance Analysis And Optimization On The Ucla Parallel Atmospheric General Circulation Model Code
John Lou, California Institute of Technology; John Farrara, University of California

An analysis is presented of several factors influencing the performance of a parallel implementation of the UCLA atmospheric general circulation model (AGCM) on massively parallel computer systems. Several modifications to the parallel AGCM code aimed at improving its numerical efficiency, interprocessor communication cost, load-balance and cache efficiency are discussed. The impact of some of the optimization strategies on the performance of the AGCM code as we implemented on several state-of-the-art parallel computers, including the Intel Paragon, Cray T3D and IBM SP2, is presented and analyzed.

Climate Data Assimilation On A Massively Parallel Supercomputer
Hong Q. Ding, Robert D. Ferraro, Jet Propulsion Laboratory

A climate data assimilation package, originally developed at Goddard Space Flight Center for the Cray C90, is implemented on a massively parallel distributed memory computer. The production codes solves a Kalman filter-like problem involving a large (1.7x109 nonzeroes) sparse matrix problem on an unstructured domain. The highly efficient and scalable implementation reduces the entire problem solution time by a factor of more than 100 from a one-head C90 to 512-node Intel Paragon. The time-critical preconditioned Conjugate Gradient solver achieves a sustained 18 GFLOPS performance. A detailed performance analysis of systematic runs on up to 512 nodes on the Paragon will be discussed.

TUESDAY, NOVEMBER 19

3:30 p.m. - 5:00 p.m.
Room S6

TOOLS SESSION

Chair: Daniel A. Reed, University of Illinois

Performance Analysis Using The Mips R10000 Performance Counters
Marco Zagha, Brond Larson, Steve Turner, Marty Itzkowitz, Silicon Graphics Inc.

Tuning supercomputer application performance often requires analyzing the interaction of the application and the underlying architecture. In this paper, we describe support in the MIPS R10000 for non-intrusively monitoring a variety of processor events - support that is particularly useful for characterizing the dynamic behavior of multi-level memory hierarchies, hardware-based cache coherence and speculative execution. We first explain how performance data is collected using an integrated set of hardware mechanisms, operating system abstractions and performance tools. We then describe several examples drawn from scientific applications which illustrate how the counters and profiling tools provide information that helps developers analyze and tune applications.

Profiling A Parallel Language Based On Fine-Grained Communication
Bjoern Haake, Klaus E. Schauser, Chris Scheiman, University of California at Santa Barbara

Fine tuning the performance of large parallel programs is a very difficult task. A profiling tool can provide detailed insight into the utilization and communication of the different processors, which helps identify performance bottlenecks. In this paper we present a profiler for the fine-grained parallel programming language Split-C, which provides a simple global address space memory model. As our experience shows, it is much more challenging to profile programs that make use of efficient, low-overhead communication. We incorporated techniques which minimize profiling effects on the running program. We quantify the profiling overhead and present several Split-C applications which show that the profiler is useful in determining performance bottlenecks.

Modeling, Evaluation And Testing Of The Paradyn Instrumentation System
Abdul Waheed, Diane Rover, Michigan State University; Jeffrey K. Hollingsworth, University of Maryland

This paper presents a case study of modeling, evaluating, and testing the data collection services (called an instrumentation system) of the Paradyn parallel performance measurement tool using well-known performance evaluation and experiment design techniques. The overall objective of the study is to use modeling- and simulation-based evaluation to provide feedback to the tool developers to help them choose system configurations and task scheduling policies that can significantly reduce the data collection overheads. We develop and parameterize a resource occupancy model for the Paradyn instrumentation system (IS) for an IBM SP-2 platform. This model is parameterized with a measurement-based workload characterization and subsequently used to answer several "what if" questions regarding configuration options and two policies to schedule instrumentation system tasks: collect-and-forward (CF) and batch-and-forward (BF) policies. Simulation results indicate that the BF policy can significantly reduce the overheads. Based on this feedback, the BF policy was implemented in the Paradyn IS as an option to manage the data collection. Measurement-based testing results obtained from this enhanced version of the Paradyn IS are reported in this paper and indicate more than 60% reduction in the direct IS overheads when the BF policy is used.

WEDNESDAY, NOVEMBER 20

10:00 a.m. - Noon
Room S2

PERFORMANCE II SESSION

Chair: TBA

An Analytical Model Of The Hint Performance Metric
Quinn O. Snell, John L. Gustafson, Ames Laboratory

The HINT Benchmark was developed to provide a broad-spectrum metric for computers and to measure performance over the full range of memory sizes and time scales. We have extended our understanding of why HINT performance curves look the way they do and can now predict the curves using an analytical model based on simple hardware specifications as input parameters.

Communication Performance Models In Prism: A Spectral Element-Fourier Parallel Navier-Stokes Solver
Constantinos Evangelinos, George Em Karniadakis, Brown University

In this paper we analyze communication patterns in the parallel three-dimensional Navier-Stokes solver Prism, and present performance results on the IBM SP2, the Cray T3D and the SGI Power Challenge XL. Prism is used for direct numerical simulation of turbulence in non-separable and multiply-connected domains. The numerical method used in the solver is based on mixed spectral element-Fourier expansions in (x-y) planes and z-direction, respectively. Each (or a group) of Fourier modes is computed on a separate processor as the linear contributions (Helmholtz solves) are completely uncoupled in the incompressible Navier-Stokes equations; coupling is obtained via the nonlinear contributions (convective terms). The transfer of data between physical and Fourier space requires a series of complete exchange operations which dominate the communication cost for a small number of processors. As the number of processors increases, global reduction and gather operations become important while complete exchange becomes more latency dominated. Predictive models for these communication operations are proposed and tested against measurements. A relatively large variation in communication timings per iteration is observed in simulations and quantified in terms of specific operations. A number of improvements are proposed that could significantly reduce the communications overhead with increasing numbers of processors, and generic predictive maps are developed for the complete exchange operation, which remains the fundamental communication in Prism. Results presented in this paper are representative of a wider class of parallel spectral and finite element codes for computational mechanics which require similar communication operations.

The C3I Parallel Benchmark Suite - Introduction And Preliminary Results
Richard C. Metzger, USAF Rome Laboratory; Brian VanVoorst, Luiz S. Pires, Rakesh Jha, Wing Au, Minesh Amin, Honeywell Technology Center; David A. Castanon, ALPHATECH, Inc.; Vipin Kumar, University of Minnesota

Current parallel benchmarks, while being appropriate for scientific applications, lack the defense relevance and representativeness for developers who are considering parallel computers for their Command, Control, Communication and Intelligence (C3I) systems. We present a new set of compact application benchmarks which are specific to the C3I application domain. The C3I Parallel Benchmark Suite (C3IPBS) program is addressing the evaluation of not only machine performance, but also the software implementation effort. Our methodology currently draws heavily from the PARKBENCH and NAS Parallel Benchmarks. The paper presents the benchmarking methodology, introduces the benchmarks and reports initial results and analysis. Finally, we describe the lessons that we have learned so far from formulating and implementing the C3I benchmarks.

Architecture And Application: Performance Of The Nec Sx-4 On The Ncar Benchmark Suite
Steven W. Hammond, Richard D. Loft, National Center for Atmospheric Research; Philip D.Tannenbaum, HNSX Supercomputers, Inc.

In November 1994, the NEC Corporation announced the SX-4 supercomputer. It is the third in the SX series of supercomputers and is upward compatible from the SX-3R vector processor with enhancements for scalar processing, short vector processing and parallel processing. In this paper we describe the architecture of the SX-4 which has an 8.0 ns clock cycle and a peak performance of 2 Gflops per processor. We also describe the composition of the NCAR Benchmark Suite, designed to evaluate the comuters for use on climate modeling applications. Additionally, we contrast this benchmark suite with other benchmarks. Finally, we detail the scalability and performance of the SX-4/32 relative to the NCAR Benchmark Suite.

WEDNESDAY, NOVEMBER 20

10:00 a.m. - Noon
Room S6

NETWORKING & ARCHITECTURE SESSION

Chair: Reagan W. Moore, San Diego Supercomputer Center

Minimal Adaptive Routing With Limited Injection On Toroidal K-Ary N-Cubes
Fabrizio Petrini, Marco Vanneschi, Dipartimento di Informatica, Universita di Pisa

Virtual channels can be used to implement deadlock free adaptive routing algorithms and increase network throughput. Unfortunately, they introduce asymmetries in the use of buffers of symmetric networks as the toroidal k-ary n-cubes. In this paper we present a minimal adaptive routing algorithm that tries to balance the use of the virtual channels by limiting the injection of new packets into the network. The experimental results, conducted on a 256-node torus, show that it is possible to increase the saturation point and to keep the network throughput stable at high traffic rates. The comparison with the Chaos router, a non minimal cut-through adaptive routing, shows that our algorithm obtains similar performance results using only a small fraction of buffers and a simpler router model.

Low-Latency Communication On The Ibm Risc System/6000 Sp
Chi-Chao Chang, Grzegorz Czajkowski, Chris Hawblitzel, Thorsten von Eicken, Cornell University

The IBM SP is one of the most powerful commercial MPPs. Yet, in spite of its fast processors and high network bandwidth, the SP's communication latency is inferior to older machines such as the TMC CM-5 or Meiko CS-2. This paper investigates the use of Active Messages (AM) communication primitives as an alternative to the standard message passing in order to reduce communication overheads and offer a good building block for higher layers of software. The first part of this paper describes an implementation of Active Messages (SP AM) which is layered directly on top of the SP's network adapter (TB2). With comparable bandwidth, SPAM's low overhead yields a round-trip latency that is 40% lower than IBM MPLs. The second part of the paper demonstrates the power of AM as a communication substrate by layering Split-C, as well as MPI over it. Split-C benchmarks are used to compare the SP to other MPPs and show that low message overhead and high throughput compensate for SP's high network latency. The MPI implementation is based on the freely available MPICH version and achieves performance equivalent to IBM's MPI-F on the NAS benchmarks.

Compiled Communication For All-Optical TDM Networks
Xin Yuan, Rami Melhem, Rajiv Gupta, The University of Pittsburgh

While all-optical networks offer large bandwidth for transferring data, the control mechanisms to dynamically establish all-optical paths incur large overhead. In this paper, we consider adapting all-optical multiplexed networks in multiprocessor or multicomputer environment by using compiled communiation as an alternative to dynamic network control. Compiled communication eliminates the runtime overhead by managing network resources statically. Thus, it can employ complex off-line algorithms to improve resource utilizations. We studied several off-line connection scheduling algorithms for minimizing the multiplexing degree required to satisfy communication requests. The performance of compiled communication is evaluated and compared with that of dynamically controlled communication for static communications in a number of application programs. Our results show that the compiled communication out-performs the dynamic communication to a large degree. Since most of communication patterns in scientific applications are static, we conclude that compiled communication is an effective mechanism for all-optical network in multiprocessor environments.

Increasing The Effective Bandwidth Of Complex Memory Systems In Multivector Processors
Anna M. del Corral, Jose M. Llaberia, Universitat Politecnica de Catalunya

In multivector processors, the cycles lost due to memory interferences between concurrent vector streams make the effective throughput lower than the peak throughput. Using the classical order, the vector stream references the memory modules using a temporal distribution that depends on the access patterns. In general, different access patterns determine different temporal distributions. These different temporal distributions could imply the presence of memory module conflicts even if the request rate of all the concurrent vector streams to every memory modules is less than or equal to their service rate. In addition, in a memory system where several memory modules are connected to each bus (complex memory system), bus conflicts are added to the memory module conflicts. This paper proposes an access order, different from the classical order, to reference the vector stream elements. The proposed order imposes a temporal distribution to reference the memory modules that reduces the average memory access time in vector processors with complex memory systems. When the request rate of all the vector streams to every memory module is greater than the service rate, the proposed order reduces the number of lost cycles, and the effective throughput increases. Under other conditions, the effective throughput reaches the peak throughput.

WEDNESDAY, NOVEMBER 20

3:30 p.m. - 5:00 p.m.
Room S2

ALGORITHMS II SESSION

Chair: TBA

Global Load Balancing With Parallel Mesh Adaption On Distributed-Memory Systems
Rupak Biswas, NASA Ames Research Center; Leonid Oliker, Research Institute for Advanced Computer Science; Andrew Sohn, New Jersey Institute of Technology

Dynamic mesh adaption on unstructured grids is a powerful tool for efficiently computing unsteady problems to resolve solution features of interest. Unfortunately, this causes load imbalance among processors on a parallel machine. This paper describes the parallel implementation of a tetrahedral mesh adaption scheme and a new global load balancing method. A huristic remapping algorithm is presented that assigns partitions to processors such that the redistribution cost is minimized. Results indicate that the parallel performance of the mesh adaption code depends on the nature of the adaption region and show a 35.5X speedup on 64 processors when 35% of the mesh is randomly adapted. For large-scale scientific computations, our load balancing strategy gives almost a sixfold reduction in solver execution times over non-balanced loads. Furthermore, our heuristic remapper yields processor assignments that are less than 3% off the optimal solutions but requires only 1% of the computational time.

Parallel Hierarchical Solvers And Preconditioners For Boundary Element Methods
Ananth Grama, Vipin Kumar, Ahmed Sameh, University of Minnesota

The method of moments is an important tool for solving boundary integral equations arising in a variety of applications. It transforms the physical problem into a dense linear system. Due to the large number of variables and the associated computational requirements, these systems are solved iteratively using methods such as GMRES, CG and its variants. The core operation of these iterative solvers is the application of the system matrix to a vector. This requires 0(n2) operations and memory using accurate dense methods. The computational complexity can be reduced to 0(n log n) and the memory requirement to 0(n) using hierarchical approximation techniques. The algorithmic speedup from approximation can be combined with parallelism to yield very fast dense solvers. In this paper, we present efficient parallel formulations of dense iterative solvers based on hierarchical approximations for solving the integral form of the Laplace equation. We study the impact of various parameters on the accuracy and performance of the parallel solver. We present two preconditioning techniques for accelerating the convergence of the iterative solver. These techniques are based on an inner-outer scheme and a block diagonal scheme based on a truncated Green's function. We present detailed experimental results on up to 256 processors of a Cray T3D.

Parallel Multileval K-Way Partitioning Scheme For Irregular Graphs
George Karypis, Vipin Kumar, University of Minnesota

In this paper we present a parallel formulation of a multilevel k-way graph partitioning algorithm. The multilevel k-way partitioning algorithm reduces the size of the graph by collapsing vertices and edges (coarsening phase), finds a k-way partition of the smaller graph, and then it constructs a k-way partition for the original graph by projecting and refining the partition to successively finer graphs (uncoarsening phase). A key innovative feature of our parallel formulation is that it utilizes graph coloring to effectively parallelize both the coarsening, and the refinement during the uncoarsening phase. Our algorithm is able to achieve a high degree of concurrency, while maintaining the high quality partitions produced by the serial algorithm. We test our scheme on a large number of graphs from finite element methods and transportation domains. Our parallel formulation on the Cray T3D produces high quality 128-way partitions on 128 processors in a little over two seconds, for graphs with a million vertices. Thus, our parallel algorithm makes it possible to perform dynamic graph partition in adaptive computations without compromising quality.

WEDNESDAY, NOVEMBER 20

3:30 p.m. - 5:00 p.m.
Room S6

PARALLEL PROGRAMMING SUPPORT SESSION

Chair: Charles Koelbel, Rice University

Double Standards: Bringing Task Parallelism To HPF Via The Message Passing Interface
Ian Foster, David R. Kohr, Jr., Argonne National Laboratory; Rakesh Krishnaiyer, Alok Choudhary, Syracuse University

High Performance Fortran (HPF) does not allow efficient expression of mixed task/data-parallel computations or the coupling of separately compiled data-parallel modules. In this paper, we show how a coordination library implementing the Message Passing Interface (MPI) can be used to represent these common parallel program structures. This library allows data-parallel tasks to exchange distributed data structures using calls to simple communication functions. We present microbenchmark results that characterize the performance of this library and that quantify the impact of optimizations that allow reuse of communication schedules in common situations. In addition, results from two-dimensional FFT, convolution and multiblock programs demonstrate that the HPF/MPI library can provide performance superior to that of pure HPF. We conclude that this synergistic combination of two parallel programming standards represents a useful approach to task parallelism in a data-parallel framework, increasing the range of problems addressable in HPF without requiring complex compiler technology.

OMPI: Optimizing Mpi Programs Using Partial Evaluation
Hirotaka Ogawa, Satoshi Matsuoka, The University of Tokyo

MPI is gaining acceptance as a standard for message-passing in high-performance computing, due to its powerful and flexible support of various communication styles. However, the complexity of its API poses significant software overhead, and as a result, applicability of MPI has been restricted to rather regular, coarse-grained computations. Our OMPI (Optimizing MPI) system removes much of the excess overhead by employing partial evaluation techniques, which exploit static information of MPI calls. Because partial evaluation alone is insufficient, we also utilize template functions for further optimization. To validate the effectiveness for our OMPI system, we performed baseline as well as more extensive benchmarks on a set of application cores with different communication characteristics, on the 64-node Fujitsu AP1000 MPP. Benchmarks show that OMPI improves execution efficiency by as much as a factor of two for communication-intensive application core with minimal code increase. It also performs significantly better than previous dynamic optimization techniques.

Particle-In-Cell Simulation Codes In High Performance Fortran
Erol Akarsu, Department of Electrical Engineering and Computer Science; Kivanc Dincer, Tomasz Hauptm, Geoffrey C. Fox, Northeast Parallel Architectures Center

Particle-in-Cell (PIC) plasma simulation codes model the interaction of charged particles with surrounding electrostatic and magnetic fields. PIC's computational requirements are classified as one of the grand-challenge problems facing the high-performance community. In this paper we present the implementation of 1D and 2D electrostatic PIC codes in High Performance Fortran (HPF) on an IBM SP-2. We used one of the most successful commercial HPF compilers currently available and augmented the compiler's missing HPF functions with extrinsic routines when necessary. We obtained a near linear speed-up in execution time and a performance comparable to the native message-passing implementations on the same platform.

THURSDAY, NOVEMBER 21

10:00 a.m. - noon
Room S2

SCHEDULING SESSION

Chair: Adam Beguelin, Carnegie Mellon University

Application-Level Scheduling On Distributed Heterogeneous Networks
Francine D. Berman, Rich Wolski, Silvia Figueira, Jennifer Schopf, Gary Shao, University of California, San Diego

Heterogeneous networks are increasingly being used as platforms for resource-intensive distributed parallel applications. A critical contributor to the performance of such applications is the scheduling of constituent application tasks on the network. Since often the distributed resources cannot be brought under the control of a single global scheduler, the application must be scheduled by the user. To obtain the best performance, the user must take into account both application-specific and dynamic system information in developing a schedule which meets his or her performance criteria. In this paper, we define a set of principles underlying application-level scheduling and describe our work-in-progress building AppleS (application-level scheduling) agents. We illustrate the AppleS approach with a detailed description and results for a distributed 2D Jacobi application on two production heterogeneous platforms.

Netsolve: A Network Server For Solving Computational Science Problems
Henri Casanova, Jack Dongarra, University of Tennessee, Knoxville

This paper presents a new system, called NetSolve, that allows users to access computational resources, such as hardware and software, distributed across the network. The development of NetSolve was motivated by the need for an easy-to-use, efficient mechanism for using computational resources remotely. Ease of use is obtained as a result of different interfaces, some of which require no programming effort from the user. Good performance is ensured by a load-balancing policy that enables NetSolve to use the computational resources available as efficiently as possible. NetSolve offers the ability to look for computational resources on a network, choose the best one available, solve a problem (with retry for fault-tolerance) and return the answer to the user.

Multimethod Communication For High-Performance Metacomputing
Ian Foster, Jonathan Geisler, Steven Tuecke, Argonne National Laboratory; Carl Kesselman, California Institute of Technology

Metacomputing systems use high-speed networks to connect supercomputers, mass storage systems, scientific instruments and display devices with the objective of enabling parallel applications to utilize geographically distributed computing resources. However, experience shows that high performance can often be achieved only if applications can integrate diverse communication substrates, transport mechanisms and protocols, chosen according to where communication is directed, what is communicated or when communication is performed. In this paper, we describe a software architecture that addresses this requirement. This architecture allows multiple communication methods to be supported transparently in a single application, with either automatic or user-specified selection criteria guiding the methods used for each communication. We describe an implementation of this architecture, based on the Nexus communication library, and use this implementation to evaluate performance issues. This implementation was used to support a wide variety of applications in the I-WAY metacomputing experiment at Supercomputing'95; we use one of these applications to provide a quantitative demonstration of the advantages of multimethod communication in a heterogeneous networked environment.

Building A World-Wide Virtual Machine Based On Web And HPCC Technologies
Kivanc Dincer, Northeast Parallel Architectures Center; Geoffrey C. Fox, Northeast Parallel Architectures Center

In this paper we report on a parallel/distributed virtual machine prototype called World-Wide Virtual Machine (WWVM) that is designed to attack grand challenge problems beyond the capabilities of a single supercomputer. The prototype is based on emerging Web and existing HPCC technologies and utilizes the pool of Web servers on the internet as a flexible, convenient and inexpensive metacomputing resource. World Wide Web supplies a standard open interface to the world regardless of the machine type. Through this interface Web servers can be used either as computation nodes or coordinators managing other connected nodes of the virtual machine. We have constructed a Web-based parallel/distributed programming environment for solving metaproblems consisting of MPI and PVM message-passing programs and High Performance Fortran programs.

THURSDAY, NOVEMBER 21

10:00 a.m. - Noon
Room S6

DATA MINING & MODELING SESSION

Chair: Sam Milosevich, Eli Lilly and Company

Parallel Data Mining For Association Rules On Shared-Memory Multiprocessors
M. J. Zaki, M. Ogihara, S. Parthasarathy, W. Li, University of Rochester

Data mining is an emerging research area whose goal is to extract significant patterns or interesting rules from large databases. High-level inference from large volumes of routine business data can provide valuable information to businesses, such as customer buying patterns, shelving criterion in supermarkets and stock trends. Many algorithms have been proposed for data mining of association rules. However, research so far has mainly focused on sequential algorithms. In this paper we present parallel algorithms for data mining of association rules and study the degree of parallelism, synchronization and data locality issues on the SGI Power Challenge shared-memory multiprocessor. We further present a set of optimizations for the sequential and parallel algorithms. Experiments show that a significant improvement of performance is achieved using our proposed optimizations. We also achieved good speedup for the parallel algorithm, but we observe a need for parallel I/O techniques for further performance gains.

Dynamic Computation Migration In Dsm Systems
Wilson C. Hsieh, University of Washington; M. Frans Kaashoek, MIT Laboratory for Computer Science; William E. Weihl, DEC Systems Research Center

Dynamic computation migration is the runtime choice between computation and data migration. Dynamic computation migration speeds up access to concurrent data structures with unpredictable read/write patterns. This paper describes the design, implementation and evaluation of dynamic computation migration in a multithreaded distributed shared-memory system, MCRL. Two policies are studied, STATIC and REPEAT. Both migrate computation for writes. STATIC migrates data for reads, while REPEAT maintains a limited history of accesses and sometimes migrates computation for reads. On a concurrent, distributed B-tree with 50% lookups and 50% inserts, STATIC improves performance by about 17% on both Alewife and the CM-5. REPEAT generally performs better than STATIC. With 80% lookups and 20% inserts, REPEAT improves performance by 23% on Alewife and by 46% on the CM-5.

Performance Modeling For The Panda Array I/O Library
Ying Chen, Marianne, Winslett Szu-wen Kuo, Yong Cho, University of Illinois; Mahesh Subramaniam, Oracle Corporation; Kent Seamons, Transarc Corporation

We present an analytical performance model for Panda, a library for synchronized I/O of large multidimensional arrays on parallel and sequential platforms, and show how the Panda developers use this model to evaluate Panda's parallel I/O performance and guide future Panda development. The model validation shows that system developers can simplify performance analysis, identify potential performance bottlenecks, and study the design trade-offs for Panda on massively parallel platforms more easily than by conducting empirical experiments. More importantly, we show that the outputs of the performance model can be used to help make optimal plans for handling application I/O requests, the first step toward our long-term goal of automatically optimizing I/O request handling in Panda.

Striping In Disk Array Rm2 Enabling The Tolerance Of Double Disk Failures
Chan-Ik Park, Tae-Young Choe, POSTECH

There is a growing demand in high reliability beyond what current RAID can provide and there are various levels of user demand for data reliability. An efficient data placement scheme called RM2 has been proposed in Park95, which makes a disk array system tolerable against double disk failures. In this paper, we consider how to choose an optimal striping unit for RM2 particularly when no workload information is available except read/write ratio. A disk array simulator for RM2 has been developed for experimental works. It is shown that RM2 has an optimal striping unit of two and half tracks in the case of disk read operations and one third of a single track if any disk write operations are involved.

THURSDAY, NOVEMBER 21

3:30 p.m. - 5:00 p.m.
Room S2

PARTICLE DYNAMICS SESSION

Chair: Hugh M. Caffey, Hewlett-Packard

Lightweight Computational Steering Of Very Large Scale Molecular Dynamics Simulations
David M. Beazley, University of Utah; Peter S. Lomdahl, Los Alamos National Laboratory

We present a computational steering approach for controlling, analyzing and visualizing very large scale molecular dynamics simulations involving tens to hundreds of millions of atoms. Our approach relies on extensible scripting languages and an easy to use tool for building extensions and modules. The system is easy to modify, works with existing C code, is memory efficient and can be used from inexpensive workstations over standard Internet connections. We demonstrate how we have been able to explore data from production MD simulations involving as many as 104 million atoms running on the CM-5 and Cray T3D. We also show how this approach can be used to integrate common scripting languages (including Python, Tcl/Tk and Perl), simulation code, user extensions and commercial data analysis packages.

Design Of A Large Scale Discrete Element Soil Model For High Performance Computing Systems
Alex R. Carrillo, David A. Horner, John F. Peters, John E. West, U.S. Army Engineer Waterways Experiment Station

The Discrete Element Model (DEM) is an alternative to the classical continuum mechanics approach for problems with flow-like deformations of solids. DEM is especially well-suited for problems such as soil plowing, where the media is subject to large discontinuous deformations. The goal of this project was to develop a large-scale three dimensional DEM soil modeling system capable of handling problems with potentially several million particles. This paper discusses the development of that model and the factors driving it toward a high performance computing (hpc) solution. It also discusses how work in progress is making use of heterogeneous hpc environments and high speed networks to produce a real-time user interaction and visualization capability.

Molecular Simulation Of Rheological Properties Using Massively Parallel Supercomputers
R. K. Bhupathiraju, S. T. Cui, S. Gupta, University of Tennessee; H. D. Cochran, Oak Ridge National Laboratory; P. T. Cummings, University of Tennessee

Advances in parallel supercomputing now make possible molecular-based engineering and science calculations that will soon revolutionize many technologies, such as those involving polymers and those involving aqueous electrolytes. We have developed a suite of message-passing codes for classical molecular simulation of such complex fluids and amorphous materials and have completed a number of demonstration calculations of problems of scientific and technological importance with each (described at the World Wide Web site http://flory.engr.utk.edu/ldrd). In this paper, we will focus on the molecular simulation of rheological properties, particularly viscosity, of simple and complex fluids using parallel implementations of non-equilibrium molecular dynamics. Such calculations represent significant challenges computationally because, in order to reduce the thermal noise in the calculated properties within acceptable limits, large systems and/or long simulated times are required.

THURSDAY, NOVEMBER 21

3:30 p.m. - 5:00 p.m.
Room S6

DATA & SCHEDULING SESSION

Chair: Richard Marciano, San Diego Supercomputer Center

Virtual Memory Versus File Interfaces For Large, Memory-Intensive Scientific Applications
Yoonho Park, Ridgway Scott, University of Houston; Stuart Sechrest, University of Michigan

Scientific applications often require some strategy for temporary data storage to do the largest possible simulations. The use of virtual memory for temporary data storage has received criticism because of performance problems. However, modern virtual memory found in recent operating systems such as Cenju-3/DE, give application writers control over virtual memory policies. We demonstrate that custom virtual memory policies can dramatically reduce virtual memory overhead and allow applications to run out-of-core efficiently. We also demonstrate that the main advantage of virtual memory, namely programming simplicity, is not lost.

Impact Of Job Mix Optimizations For Space Sharing Schedulers
Jaspal Subhlok, Thomas Gross, Carnegie Mellon University; Takashi Suzuoka, Toshiba Corporation

Modern parallel systems with N nodes can concurrently service multiple jobs requesting up to N nodes. One of the challenges for the operating system is to give reasonable service to a diverse group of jobs. A sequence of large jobs, each requiring over half of the available nodes, can reduce the machine utilization by up to 50%, but scheduling a long running job on the idle nodes may block the stream of large jobs. Various policies have been proposed for scheduling parallel computers, but as the users of current supercomputers know, these policies are far from perfect. This paper reports on the measurement of the usage of a 512-node IBM SP2 at Cornell Theory Center, a 96-node Intel Paragon at ETH Zurich and a 512-node Cray T3D at Pittsburgh Supercomputing Center. We discuss the characteristics of the different workloads and examine their impact on job scheduling. We specifically show how two simple scheduling optimizations based on reordering the waiting queue can be used effectively to improve scheduling performance on real workloads. Supercomputer workloads from different installations exhibit some common characteristics, but they also differ in important ways. We demonstrate how this knowledge can be exploited in the design and tuning of schedulers.