Roberto Ansaloni (1), Anne Molcard(2)
(1)Cray Research S.r.l., (2)
Institute for the study of Geophysical
Environmental Methodologies (IMGA-CNR)
A new numerical model, SEOM (Spectral Element Ocean Model), is considered to study the general circulation of the Mediterranean Sea. Spectral element methods combine the geometric flexibility of finite element techniques with the rapid convergence rate of spectral schemes. The current version solves the shallow water equations. The domain decomposition philosophy allows to exploit the power of parallel machines due to the large inter-element computational complexity. The original MIMD master/slave version of SEOM, written in Fortran 90 and PVM, has been ported to the Cray T3D. When critical for performance, Cray specific high-performance one-sided communication routines (SHMEM) have been adopted to fully exploit the Cray T3D interprocessor network. Tests performed with highly unstructured and irregular grid, on up to 128 processors, show an almost linear scalability even with unoptimized domain decomposition techniques. Results from one-year simulations on the Mediterranean Sea are shown for realistic bottom and coastline geometry.
Roberto Ansaloni (Cray Research)
The program solves the three dimensional time dependent thin-layer Navier Stokes equations on a structured grid. A typical use is modeling an aircraft wing. The code includes both serial and two message passing versions--one written in PVM and the other in MPI. An overview of the program developed to date by Dr. Vatsa is documented on the web for the IBM SP2 and for a heterogeneous cluster of workstations:
http://hpccp-www.larc.nasa.gov:80/~dana/t1.html
This poster describes a third message passing implementation using
SHMEM in a homogeneous environment on the Cray J932 and Cray T3E. Inter-processor
communication and the impact on load balance is presented for both machines.
Both the parallel-vector and message-passing implementations of the code
are compared on the J932 in dedicated and production environments. Plots
of performance vs machine size, speedup, and solution time are presented
for both the J932 and the T3E for various problem sizes.
Mitsuhisa Sato
Real World Computing Partnership
Hidemoto Nakada
Electrotechnical Laboratory
Satoshi Sekiguchi
Electrotechnical Laboratory
Satoshi Matsuoka
The University of Tokyo
Umpei Nagashima
Ochanomizu University
Hiromitsu Takagi
Nagoya Institiue of Technology
Ninf is an ongoing global network-wide computing infrastructure project which allows users to access computational resources including hardware, software and scientific data distributed across a wide area network with an easy-to-use interface. Ninf is intended not only to exploit high performance in network parallel computing, but also to provide high quality numerical computation services and accesses to scientific database published by other researchers. Computational resources are shared as Ninf remote libraries executable at a remote Ninf server. Users can build an application by calling the libraries with the Ninf Remote Procedure Call, which is designed to provide a programming interface similar to conventional function calls in existing languages, and is tailored for scientific computation. In order to facilitate location transparency and network-wide parallelism, Ninf metaserver maintains global resource information regarding computational server and databases, allocating and scheduling coarse-grained computation to achieve good global load balancing. Ninf also interfaces with existing network service such as the WWW for easy accessibility.
Mitsuhisa Sato
Robert R. Lipman,
Judith E.
Devaney
National Institute of Standards and
Technology
Information Technology Laboratory
Gaithersburg, Maryland 20899
WebSubmit enables users to run applications via the Web. The initial goal of WebSubmit is to make it easier for users to run applications on supercomputers. This is accomplished by creating a web page interface to the application on the supercomputer. The first implementation of WebSubmit is for running Gaussian 94, a molecular dynamics program, on an IBM SP2. The user enters input on a web page form to submit a Gaussian 94 job to the SP2. The status of the job may be monitored from the web page in addition to using other utility functions. Additional implementations of WebSubmit will cover using the LoadLeveler on the SP2 and other applications and hardware platforms. All of the web pages use CGI scripts written in Tcl. For more information, see: http://www.nist.gov/itl/div887/sasg/gauss/
Robert Lipman
Pete Dean
Sandia National Laboratories
This poster exhibit presents work being done to provide a wide/ local area network that will enable uniform, transparent, and efficient distributed classified and unclassified computing among the three defense program laboratories. This network, which has received accreditation for transporting classified information, uses the Energy Systems Network (ESNet) as the wide area network, together with end-to-end encryptors and Kerberos authentication to provide the classified services. The technical challenges stem from the levels of performance, security, and services that will be required from the network to support the ASCI effort.
Pete Dean
Lyndon Pierson
Sandia National Laboratories
This poster exhibit presents work being done to assure that super-high speed encryption can be implemented to satisfy ASCI objectives, and to assure that these high speed implementations can be made to interoperate with slower speed, less expensive encryption implementations through the rate adaptation provided by ATM "Variable Bit Rate" (VBR) services. ATM end-to-end encryption devices must maintain separate encryption contexts and keys for each encrypted virtual circuit. The first few ATM encryption prototypes have demonstrated the feasibility of this concept, and have explored some of the difficulties of key management and crypto synchronization in this key-agile environment. Following these prototypes, the first few ATM encryption products are now beginning to be marketed. Integration of innovative methods of scaling encryption speed with techniques for key management, crypto synchronization, and key agility are required to make ATM encryption viable at OC-48 (2.4 Gb/s) and higher. Development of this technology will enable national defense applications requiring the secure exchange of massive amounts of data between widely separated sites.
Lyndon Pierson
Robert L. Clay and Alan B. Williams
Sandia National Laboratories
ISIS++ is an object-oriented framework for solving sparse linear systems of equations. Though it was developed to solve systems of equations originating from large- scale, 3-D, Finite Element Analyses, it has applications in many other fields. A key feature of ISIS++ is the simple interchangability of components - both from within the ISIS++ system and from other packages. This framework facilitates integrating components from various libraries, and in particular the matrix-vector functional units and data structures. The advantages of this approach include the ability to leverage existing work in the field. Thus, the library can be built using the matrix-vector implementation best suited to the task and compute system at hand, with no changes to the solver or preconditioner source. Thus, ISIS++ is transparently portable across a wide range of computer architectures, from desktop PC's to MPP supercomputers.
Robert L. Clay
S. Ashby, C. Baldwin, W. Bosl, R. Falgout, R. Maxwell, J. Murphy,
N. Rosenberg, D. Shumaker, C. San Soucie, S. Smith, A. Tompson
Lawrence Livermore National Laboratory
International Technology Corporation
This poster and video presentation will describe a multidisciplinary effort to develop a sophisticated simulation code for modeling multiphase flow and multicomponent transport through three-dimensional heterogeneous porous media. The simulator includes scalable subsurface modeling capabilities, a fast flow solver, and accurate component transport schemes. In particular, we employ grid-independent conceptual models and use geostatistical techniques to reproduce fine-scale heterogeneities. Fluid flow velocities are calculated via a scalable and fast multigrid algorithm. We offer the user the choice of a highly accurate Godunov procedure for advective transport or a PIC code for advective-diffusive transport coupled with reactive effects. The simulator runs on a variety of computing platforms, ranging from a single workstation to massively parallel computers. We will show a video highlighting our efforts to model several complex real-world sites that present many computational challenges. These include site-scale modeling to analyze various pumping strategies for remediation efforts and regional-scale modeling to study water resource management issues. The sites to be modeled are large and/or need to be highly resolved, resulting in problems having 8M computational zones. The sites also have complex geometries and boundary conditions, varying degrees of subsurface heterogeneity, and need sophisticated pumping strategies. We will demonstrate the scalability of the ParFlow simulator on a 256-node CRAY T3D.
Steven Ashby
Eugene D. Brooks III and Karen H. Warren
Lawrence Livermore National Laboratory
Livermore, California 94550
Practitioners of high performance computing have faced the rather
simple problem of effectively utilizing vector processing
architecture for the solution of scientific problems. We currently
have a much more diverse set of architectures to exploit
simultaneously. The problem of efficiently targeting this
dissimilar set of computer architectures has had poor solutions
if any at all.
We have assembled ideas from several sources to create a parallel
extension of ANSI C that can be used efficiently on a wide range
of architectures. The design goal of the parallel
programming model is to to achieve reducibility on simpler
architectural targets as we move up the evolutionary chain of
architecture complexity. PCP is a relatively simple programming
language that allows the user explicit control of both data placement
and communication in a shared address space. It offers both loop
parallelism and task parallelism via processor teams.
Rob Neely, Bob Corey, Evi Dube, Scott Futral, Juliana Hsu, Jim Maltby
Rose McCallen, Al Nichols, Ivan Otero, Tim Pierce, Richard Sharp
Lawrence Livermore National Laboratory
This poster describes our approach to work in the parallelization of ALE3D - a general purpose 3D finite element code incorporating Arbitrary Lagrange-Eulerian (ALE) continuum mechanics, explicit and implicit time integration, slide surfaces, coupled heat transfer, and chemical transport. The parallel design is driven by the requirement of having a single portable code that runs efficiently on architectures ranging from workstations to SMP's to MPP's, and combinations thereof. We discuss various issues encountered during the parallelization of each of the major packages presented above, and describe a design that allows us to efficiently take advantage of architectures which combine distributed and shared memory environments. A scheme to parallelize slide surfaces is presented that provides good dynamic load balancing during a continuously changing problem geometry. Preliminary performance results showing both execution speedups and solvable problem sizes on several parallel architectures will also be presented.
Rob Neely
The computational capability required for production modeling of complex astrophysical systems in three spatial dimensions dictates the use of parallel computing methods. This paper describes the plan for parallelization of a three-dimensional, unstructured-mesh, multi-physics-package astrophysics code. The initial parallel implementation of this code will employ a spatial-domain-decomposition and message-passing programming paradigm for use on distributed-memory multiprocessors. Techniques for the minimization of parallel overheads, such as load imbalance and communications, will be discussed in the context of a multi-physics package code. For instance, it is anticipated that different domain decompositions may be utilized for several of the physics packages in order to minimize both the parallel overheads and the memory requirements per processor. Plans for future, hybrid parallel computations, which use both message-passing and shared-memory programming techniques on clusters of symmetric multiprocessors will also be discussed.
Dr. Richard Procassini
We describe the initial design and implementation of a dense out-of-core solver as an extension to the ScaLAPACK library. Current implementation include LU factorization with partial pivoting, Cholesky factorization for symmetric positive definite matrices and QR factorization for general rectangular matrices. Work on band solvers are also currently underway. We implemented a left-looking column-panel oriented algorithm with a panel size that varies during the factorization to fully utilize all available memory. ScaLAPACK and PBLAS routines are reused to achieve high performance for in-core computations.
I/O is performed in high level routines that read or write general sub-sections of ScaLAPACK 2D block cyclic distributed arrays to disk. These routines support a shared file on the Intel Paragon (all data reside in a single file) and distributed files on PVM cluster (data distributed on local disks).
Preliminary results with double precision solvers on 64 nodes of the xps35 Intel Paragon at Center for Computational Sciences, Oak Ridge National Laboratory, show out-of-core factorization require approximately extra 20% overhead compared to in-core solvers.
Contact information:
Les Cottrell, Gary Haney, Terry Healy, Connie Logg, David Martin, Bill Wing,
Lois White
Stanford Linear Accelerator Center, Oak Ridge National Laboratory,
Brookhaven National Laboratory, Fermi National Accelerator Laboratory
As the explosive growth of the Internet continues, the continued use of the Internet as a vehicle for conducting scientific research is being questioned. Presented herein are the results of detailed work undertaken by a focal group of the U.S. Dept. of Energy's Energy Sciences Network (ESnet) chartered to address the impacts of Internet growth. This work will show the impact of Internet growth on the consistency and stability of Internet connections to some of ESnet's primary national and international research partners and will also provide some recommendations for Internet monitoring.
Gary Haney, Oak Ridge National Laboratory, (423) 574-4629, hny@ornl.gov
William Saphir, Alex Woo, Maurice Yarrow
NASA Ames Research Center
We present results performance results for version 2.1 of the NAS Parallel Bench marks (NPB) on the following architectures: IBM SP2/66 MHz SGI Power Challenge Array/90 MHz Cray Research T3D Cray Research T3E Intel Paragon NPB 2 is an implementation, based on Fortran 77 and the MPI message passing standard, of the original NAS Parallel Benchmark specifications. The NPB 2 suite is intended to be run with little or no tuning, in contrast to NPB vendor implementations, which are highly optimized for specific architectures. NPB 2 results complement, rather than replace, previously reported NPB results. Because they have not been optimized by vendors, NPB 2 implementations approximate the performance a typical user can expect for a portable parallel program on a distributed memory parallel computer. Together these results provide a well-calibrated comparison of the real-world performance of several parallel computers. By comparing these results to NPB 1 results, we draw conclusions about what optimization must be done to obtain high performance on these systems.
William Saphir
Udaya A. Ranawake, University of Maryland Baltimore County
Bruce Fryxell, George Mason University
John E. Dorband, NASA Goddard Space Flight Center
We consider the parallel implementation of an Euler equation solver using the piecewise parabolc method (PPM) on a HP-Convex Exemplar SPP1000. The performance of a message passing implementation based on PVM is compared against several implementations based on the shared memory programming model. The different versions based on the shared memory paradigm utilize different memory class addressing schemes in order to determine the best memory layout for the shared data structures. A calculation on a 450 by 1800 grid using the shared memory version of the program delivers 56 mflops per node on all 15 processors of an Exemplar. We also discuss the programming effort involved in optimizing this code.
Udaya A. Ranawake
Abdullah I. Meajil (1), Tarek El-Ghazawi (1), and Thomas Sterling (2)
(1)Department of Electrical Engineering and Computer Science
The George Washington University
(2)Center of Excellence in Space Data & Information Sciences
NASA/Goddard Space Flight Center
Experimental design of parallel computers calls for quantifiable methods to compare and evaluate the requirements of different workloads within an application domain. Such methods can establish the basis for scientific design of parallel computers driven by application needs, to optimize performance to cost. In this research, we introduce a new workload representation and workload similarity model that can contribute to important applications such as: Parallel Benchmark Design, Parallel Computer Architecture Design, and Performance Prediction on real parallel machines. This parallel workload characterization is based on our parallel instruction centroid and parallel workload similarity models. The centroid is a workload approximation which captures the type and amount of parallel work generated by the workload on the average. When captured with abstracted information about communication requirements, the result is a powerful tool in understanding the architectural requirements of workloads and their potential performance on target parallel machines. Experimental results using the NASA/NAS Parallel Benchmark suite are used to demonstrate the use of our models.
Abdullah I. Meajil
Zahira S. Khan
Dept. of Mathematics and Computer Science, Bloomsburg University
In this paper is discussed the design and performance of a parallel multijoin algorithm for executing the multijoin operation of relational databases. The performance of the algorithm is improved by joining the relations in the ascending order of the join size. The multijoin algorithm consists of three phases. In the first phase, a sampling technique without replacement is used to determine the join selectivity ratios for each of the participating relations. These join selectivities determine the order in which the relations are to be joined. The second phase consists of hashing the relations to be joined. During the third phase the partitioned relations are actually joined. The performance of the algorithm depends on various factors including the size of the relations, join selectivities, data distribution, number of processors used, and the size of the sample.
Zahira S. Khan
Zahira S. Khan
Dept. of Mathematics and Computer Science, Bloomsburg University
An undergraduate course entitled "Introduction to parallel processing" was taught at Bloomsburg University in the Fall semester, 1995. The goals of this course included making the students proficient in parallel programming techniques, providing experience working on state of the art parallel architectures, and motivating students to conduct and present their research work at departmental seminars. An academic grant from Pittsburgh Supercomputing Center (PSC) was obtained to provide students with access to the Cray C90. The students attended a three day workshop at PSC and received training on executing and debugging programs on the T3D. The text book for the course was "The Art of Parallel Programming," by Bruce Lester. This text included MultiPascal as a language that simulates shared memory and multicomputer architectures and provides performance statistics for the programs. The paper compares the advantages and disadvantages of using the Cray with those using Multipascal in the classroom environment.
Zahira S. Khan
E. Angeles
Acknowledgment : CRAY RESEARCH INC, CONACYT
C. Moreno
David F. Hegarty and M. Tahar Kechadi
Advanced Computational Research Group,
University College Dublin, Ireland.
In this poster we present a parallel simulation environment whose aim is to automatically parallelize computer simulations of complex polymers, DNA and proteins. The environment aims to mask the heterogeneity of the available hardware and communication resources from the user. We used three criteria in the design: ease of use, exploiting hardware heterogenity, and achieving high performance through parallelism. This leads to three linked system components, a graphical user interface, a virtual machine and a runtime system. These together achieve the specification of the problem, the placement of tasks onto the parallel machine, and the dynamic adaption of the decomposition to maintain efficency and react to a changing environment. We present an algorithm which balances the workload while maintaining the locality of the original decomposition. The algorithm is analysed using a theoretical model and experimental results obtained from implementations on a Cray T3D and a workstation cluster.
David Hegarty,
Shandya Bhat
Eastern Michigan University
The object of this study is to determine the most stable conformation of two analogous molecules: trans-azobenzene and trans- stilbene. No conclusive structural information was available for these two widely known organic compounds before the present study. Both theoretical and experimental information available to date was inconclusive. The present study employs ab initio molecular orbital calculations with a 6-31G** basis set. Calculations were carried out using the GAUSSIAN package of programs on the PSC Supercluster (GAUSSIAN92) and a Decterm Alpha AXP system at Eastern Michigan University (GAUSSIAN94). For both molecules the energy minimum was found to be very shallow. Electronic and steric factors determining the relative stability of planar and nonplanar conformations are also analyzed.
Shandya Bhat
Cazier J.-B., Gaertner K., Fichtner W.
Integrated Systems Lab, ETH Zurich
In semiconductor device simulation, solving three dimensional problems is not cheap. The reasons are the huge amount of unknowns to be considered and the large condition numbers involved. The use of multigrid methods can reduce the size of the problem to be solved by a direct process dramatically. The aim is to extend the algebraic multigrid methods for the continuity equations from regular to irregular grids, and from a sequential to a parallel vector machine. Results of a first implementation on a Cray J90 (within the framework of the Cray Research and the ETHZ (Eidgenossische Technische Hochschule Zurich) cooperation) will be given and discussed with respect to the absolute performance limits and its parallel efficiency. A sketch of the algorithm will also be presented.
K. Gaertner, IIS, ETZ, ETH Zurich, Gloriastr. 35, CH-8092 Zurich, Switzerland, e-mail: gaertner@iis.ee.ethz.ch
Job statistics from a supercomputing center can be monitored for many different purposes. It is obviously important for system tuning and problem detection. From the user's - and maybe funding agents' - point of view, the turnaround time obtained may be very important and can be described with various parameters. Unless moitored, turnaround times may approach equilibrium with workstations accessible locally to the users (IJSCA-HPC 9:4, 312-4, 1995). The poster will show different ways of displaying queueing data from the Supercomputing Centers supported by the Swedish Council for High-Performance Computing (HPDR). The relevance of different choices of parameters to be monitored will be discussed and related to analytical queueing theory. Possible implications of queueing theory for national computing policies will also be discussed.
Ann-Marie Pendrill
The poster will show the evolution of the APE supercomputer family during the last 10 year.
APE1 the first project started in 1985 with the aim of developing a SIMD array processor capable of 1 GFlop of peak performance. It was concluded in 1989 after that were produced two systems.
APE100 the second generation of APE supercomputers started in 1990 with the target to gain a factor 10 in the peak performance. Till now more than 20 systems of different size were producted for a total of more than 300 GFlops.
APEmille the third generation of machine currently under development targetted to produce computers systems in the TeraFlops range.
The poster will illustrate the APE architecture and its evolution. Further it will describe the state of the current APEmille project.
Bartoloni Alessandro
Anthony-Trung Nguyen, Univ. of Illinois, Champaign-Urbana
Maged Michael, Univ. of Rochester
Most publicly-available multiprocessor simulation tools only simulate RISC architectures. Therefore, they cannot capture the instruction mix and memory reference patterns of popular architectures like Intel's x86. Augmint, an execution-driven simulation toolkit, fills this gap by supporting Intel's x86 architecture. Augmint takes a thread-based parallel application with m4 macros like the SPLASH and SPLASH-2 benchmark suites. Augmint runs on an x86-based uniprocessor PC under UNIX or Windows NT and can simulate multiple processors with very little overhead. It supports a thread-based programming model with a shared global address space and a private stack space per processor. Users can plug in their own architecture simulators. Augmint supports a simulator interface compatible with that of the MINT simulation toolkit for MIPS architectures, thus allowing the reuse of most architecture simulators written for MINT. The source code of Augmint is publicly available from http://www.csrd.uiuc.edu/iacoma/augmint.
Anthony-Trung Nguyen
Xudong Troy Wu and Edward F. Hayes
Department of Chemistry, Ohio State University
The reaction H + O2 --> OH + O is one of the key steps in the combustion of hydrocarbons (e.g. natural gas, gasoline, diesel fuel, coal, etc.) The H-O-O complex is a stable molecule that is energetically accessible from both the reactants and products of this overall reaction. In this study, the rotational and vibrational bound states of H-O-O complex have been calculated on a Cray T3D. The algorithm that have been developed for this program show good scalability using up to 128 processors and the maximum performance achieved is 3.2 GFlops.
The computational method involves three key elements: 1) Use of Implicitly Restarted Lanczos Method (IRLM) to obtain the bound state eigenvectors and eigenvalues. 2) Transformation of the Hamiltonian for the problem with an efficient Sequential Diagonalization Truncation (SDT) algorithm. 3) Acceleration of the convergence of the IRLM method using Chebychev Preconditioning.
Mark Newsome (1), Cherri Pancake (1), and Joe Hanus (2)
(1) Department of Computer Science
(2) Department of Botany and Plant Pathology
Oregon State University
QueryDesigner is a Web-based tool for constructing query interfaces directly on Netscape Web browsers. The tool is meant for users who are not computer experts to set up their own forms and hypertext-based query interfaces to remote SQL databases. No experience in SQL and HTML programming is necessary. After choosing a target SQL database on the Internet, the user can build a personalized query interface by making menu selections and filling out forms---the tool automatically establishes network connections, and composes HTML and SQL code. The generated query form can be used immediately to issue a query, customized, or saved for later use. Results returned from the database are dynamically formatted into hypertext for navigating related information in the database. Our tool has been used successfully to implement query interfaces for several biological databases.
Mark Newsome
T. P. Kelliher, R. M. Owens, M. J. Irwin
MicroSystems Research Laboratory
Department of Computer Science and Engineering
The Pennsylvania State University
University Park, PA 16802
(T. P. Kelliher is with the Department of Mathematics and Computer
Science,
Westminster College, New Wilmington, PA 16172.)
The Micro-Grain Array Processor (MGAP-2) is an array of 49,152 micro-grain processors, implemented as a planar mesh, operating at 50MHz, and capable of computing 4.9 teraops per second. Each processor has 32-bits of local dual-port RAM, computes two three-input boolean functions per clock, and has a dynamically reconfigurable interconnect to each of its four neighbors. This communication flexibility allows algorithms to be mapped onto the array in an efficient manner and the processors to be dynamically grouped into larger computational units. The entire MGAP-2 system fits onto a single 9Ux400 mm VME board.
We have developed a high level language, *C++, for programming the MGAP-2 and have targeted efficient systolic, low communication complexity algorithms for applications such as basic arithmetic and image processing operations, motion estimation, speech recognition, computational molecular biology, simulation of physical phenomenon using a cellular automaton model, Hough Transform, Discrete Wavelet Transform, Discrete Cosine Transform, and Singular Value Decomposition.
Thomas P. Kelliher
M.B.Ignatiev, Y.E.Sheinin, D.E.Tatkov
The State Academy for Aerospace Instrumentation, St.-Petersburg, RussiaThe full scale parallel programming for general purpose mass-parallel computers and distributed systems demands new paradigms and means.
A new interactive visual language for parallel programming, called VISA, was developed. The VISA is not a WYSIWYG-style visual language, but a true programming language for specifying parallel algorithms. The graphical clauses of a parallel program are constructed of icons according to the syntactic and semantic rules of the language. The semantics of control operators of the language is based on the Developing Asynchronous Processes model of parallel computations. A program in VISA is presented in a graphical form as a dynamic network of operators and data objects. A full scale parallel program is a complex multicomponent multilinked structure. It is impossible neither to construct ("write") it, no to understand ("read") it outside a CASE system. A practicable visual parallel programming language can be only an interactive language.
The prototype integrated programming tools set for VISA is presented. It is written in C++ and works on PC under Windows 3.11.
State Academy for Aerospace Instrumentation
Yongwha Chung and Viktor K. Prasanna
University of Southern California
The goal of the research is to develop scalable and portable parallel algorithms for intermediate and high level vision problems. Parallelizing intermediate and high level vision applications is challenging due to the irregular computation and communication features of these algorithms. In this work, we parallelize a system to detect and describe buildings from monocular views of aerial scenes. The computational tasks of this system include image feature extraction, perceptual grouping, shadow analysis, and hypotheses selection/verification. To our knowledge, our system is the first one to provide interactive performance for intermediate and high level vision tasks on general-purpose HPC platforms. We first define a realistic model of distributed memory machines to estimate communication cost. Based on this, we design an algorithmic framework which enhances processing node utilization and overlaps communication with computation by maintaining algorithmic threads in each processing node. For example, given an 1024 x 1024 image, the image feature extraction and one of the perceptual grouping steps can be performed in 0.717 seconds on a 64-node T3D. A serial implementation takes 29.643 seconds on a single-node T3D. We use C and MPI for our implementations to make them portable to other HPC platforms. By using our system, the execution time to produce a 3D description of buildings can be reduced from a few hours to a few seconds. This research was supported in part by NSF under grant CCR-9317301 and in part by DARPA under grant F49620-93-1-0620.
Viktor K. Prasanna
Jin-Woo Suh and Viktor K. Prasanna
University of Southern California
Recently, HPC technology has been employed to realize real-time embedded signal processing applications such as Space-Time Adaptive Processing (STAP), Synthetic Aperture Radar(SAR), and Sonar systems. For the evaluation of these systems, many real-time benchmarks have been proposed by the DoD HPC community. These include Hartstone, Rhealstone, TPC, MITRE+Rome Lab. and PARKBENCH benchmarks. These real-time benchmarks differ from traditional HPC benchmarks in many ways: 1) time is considered as the most critical factor, 2) real-time performance is measured rather than off-line performance, 3) benchmarks are usually executed many times to evaluate fluctuations in run time, and 4) throughput as opposed to latency is a very important measure of a systems's ability to meet time constraints. For scalable implementations of real-time benchmarks, we have developed scalable communication primitives for N-to-M processor pipeline. In this algorithm, the number of communication steps is reduced to ceiling(lg(M/N+1))+N-1. Previous algorithms take MN steps. Using our algorithm, we have implemented 2D real-time FFT benchmark that has been recently defined by MITRE and Rome Lab. on SP2 and T3D. The results have been very encouraging. The number of processors needed is reduced by 25% compared with earlier implementations, for the FFT operations needed in SeaSAT SAR processing. Our code written using C and MPI is portable to other HPC platforms.
Viktor K. Prasanna
Takashi Amisaki
Shimane University
Shinjiro Toyoda
Fuji Xerox Co., Ltd.
Hiroo Miyagawa
Taisho Pharmaceutical Co., Ltd.
Akihiro Kusumi
The University of Tokyo
Eiri Hashimoto, Hitoshi Ikeda, Nobuaki Miyakawa
Fuji Xerox Co. Ltd.
Kunihiro Kitamura
Taisho Pharmaceutical Co., Ltd.
Molecular dynamics (MD) simulation presents a challenging problem to computer technology, i.e., simulation of the behavior of large and complex systems such as biomolecules requires very long time. This is due to a large number of pairwise, long-range, non-bonded interactions between constituent particles, which increases as O(N^2) as the number of particles N in the system increases. To overcome this problem, we developed MD Engine, which is a hardware accelerator designed to be plugged into a workstation. MD Engine is composed of a homogeneous array of custom processor chips, which calculate pairwise forces exerted on each particle by all other particles in the system. It accommodates periodic boundary conditions and the Ewald method to evaluate Coulombic forces. With an MD Engine consisting of 24 processors plugged into a SPARCstation 10, an MD simulation of a biomembrane system (22,264 atoms) proceeds faster than an R8000 workstation by a factor of 48.
Takashi Amisaki
DongSheng Cai
Department of Physics, University of California at Los Angeles
Institute of Infromation Sciences and Electronics, University of Tsukuba
The present poster discusses data-parallel algorithms suitable for parallel skelton Particle-In-Cell (PIC) codes using HPF/MPI. The algorithms are based on a vector model of computations, i. e. the scan model. The purpose of this paper is to show how the model can be applied to a set of vector algorithms in parallel PIC codes using HPF/MPI. A skelton PIC code is a cycle consisting of four steps: (1) Solving fields on a grid, (2) Interpolating fields to particle positions; (3) Advancing particle positions and velocities with the fields; and (4) Interpolating particle charge and current densities to the grid. The cycle is the essential part of the PIC code and the skelton code is developed in order to analyze the performace of PIC code on the various platforms. The code is written in HPF/MPI for the portability of the code. The code is developed on the CRAY T3D, and Personal PC clusters where Linux are used.
DongSheng Cai
Don Morton
Academic year
Department of Mathematical Sciences
Cameron University
Lawton, OK 73505
Summers
Arctic Region Supercomputing Center
University of Alaska
Fairbanks, AK 99775
A heterogeneous, distributed, adaptive finite element code, originally developed by the author for the Cray Y-MP/T3D system has been modified to take advantage of the Single-Program Multiple-Data (SPMD) paradigm. The original program utilized a Cray Y-MP process for addressing issues of global mesh modifications, and actual finite element computations were distributed to Cray T3D processes. Packaging this work into a SPMD environment provides us greater flexibility in choosing an architecture, and removes a communications bottleneck that was encountered in the Y-MP/T3D implementation. The SPMD version of the program has been successfully implemented on a Cray T3D in standalone mode and a cluster of Pentium PC's running the Linux operating system. Timing data for both architectures will be provided.
Dr. Don Morton
Marsha Mooradian-Maui High Performance Computing
Center
Vicki Kajioka-Hawaii State Department of
Education
The Hawaii State Department of Education (HSDOE) in collaboration with the Maui High Performance Computing Center will provide a multimedia tour focusing on the innovative N.I.I. Technology Telecommunications for Teacher staff development programs (T 3). Utilizing community resources, planning and collaboration, Hawaii has implemented a variety of significant advancements in the integration of technology across the curriculum throughout the 245 schools statewide.
The exhibit will highlight the relationship between the Maui High Performance Computing Center (MHPCC), the HSDOE and over 150 community businesses through Tech Corps Hawaii, a new non-profit corporation. The poster exhibit will present the technical infrastructure for connecting to the N.I.I. through a project called "Let's Get Wired" and the extensive training programs which have been successfully completed over the past three years.
Three successful Computer Integration projects will also be featured The Hawaii Super Computing Challenge Competition, The Electronic School - A Virtual Education Community and the T3 - Technology Telecommunications Teacher professional development project. These programs have enhanced student learning in Hawaii by utilizing the Internet as a resource for students, teachers and parents. Web pages created by students and teachers will provide examples of the innovative cross- curricular activities.
Contact Information:
Marsha Mooradian: Maui High Performance Computing Center
Nikos P. Chrisochoides and Florian Sukup
Cornell Theory Center
The unpredictable and irregular nature of parallel algorithms for unstructured computations makes difficult their efficient implementation on top of synchronous communication primitives such as blocking sends/recvs. As an alternative one can use asynchronous communication mechanisms such as Active Messages that do not require rendezvous between the sender and receiver. Processors don't have to busy wait for requested data or remote service requests and they can proceed with their remaining work. Thus asynchronous communication can improve program's performance by masking communication and synchronization overheads. Unfortunately, programmers have to address a number of difficult problems that are inherent to asynchronous programming paradigm. In this project we make an attempt to help the programmer by providing a runtime library that makes asyncronous programming easier and more intuitive to the user. In addition the runtime library provides sophisticated data transfers that improve the performance of naive implementations. We demonstrate the effectiveness of our runtime library by implementing a kernel, the Bowyer-Watson algorithm, that is very useful to the parallel Delaunay triangulation methods for unstructured grid generation. The efficient implementation of Bowyer-Watson algorithm on multiprocessors is a challenging problem: its computation and communication patterns are variable and unpredictable. The results are quite impressive. We eliminate 66 % of the communication for small to medium size messages.
Nikos P. Chrisochoides and Florian Sukup
Jerry Gerner, Steven Hotovy, and David Schneider
Cornell Theory Center
Computational scientists and engineers and other would-be users of high-performance computers usually turn to parallel computing because their problems are "big" -- and in several different dimensions, including wall-clock time necessary to solve the problem ("mean time to publication"), memory requirements, and data storage and I/O bandwidth. For researchers with applications that involve 100's of MB's to 10's of GB's (or more) of problem data, the availability of parallel I/O facilities on the Cornell Theory Center's 512-node IBM SP2 has proven to be an important factor in their abiltiy to "do 'big' science". We provide some background on and motivation for parallel I/O, a description of the current PIOFS (Parallel I/O File System) configuration for the SP2, and some description of several scientific applications (using the SP2 and PIOFS) and the results they have obtained.
Jerry Gerner
Nigel Goddard and Greg Hood
Pittsburgh Supercomputing Center
Parallel computing platforms are becoming ubiquitous, providing computational power up to three orders of magnitude beyond desktop machines. We extended the Genesis simulator to run on these platforms, enabling effective investigation of much larger problems. Portability of simulations across serial and parallel platforms, and optimization of data communication are key goals. Extensions to the Genesis scripting language hide much of the complexity from the user while allowing explicit control over partitioning of the simulation and inter-processor synchronization. We envision multi-cell network models and parameter searching applications as those most likely to benefit from this work. Cray T3D experiments demonstrate superlinear speedup for low processor count, due to n-squared complexity in the serial algorithm. Speedup decreases to linear at 16 processors and sub-linear beyond, although the exact cutoffs are highly model dependent. The package is now in production at PSC and being ported to other MPPs.
Nigel Goddard
Alexander J. Ropelewski, Joseph Geigel,
Hugh B. Nicholas Jr., David W. Deerfield II
Pittsburgh Supercomputing Center
Characteristic sequence data sets must be unbiased and representative of the entire range of biological molecules in the data. For instance a model that accurately describes a family of proteins needs to represent the entire phylogenetic range in which the protein is found and not be biased toward some subset of the known proteins. We describe a scalable parallel-vector technique for selecting a characteristic set of protein sequences. This approach is based on rigorous comparisons of every pair of sequences; an approach that requires a computation that is proportional to the products of the lengths of the sequences compared. This computation can be both vectorized and parallized, allowing the characteristic data set to be selected using semi-emperical statistical techniques. Characteristic data sets selected in this matter are appropriate for multiple sequence alignments and for the study of common biochemical properties. This research was funded by NIH-NCRR grant 1 P41 RR06009.
Alexander J. Ropelewski
Pittsburgh Supercomputing Center1
4400 Fifth Avenue
Pittsburgh, PA 15213
University of Pittsburgh Medical Center2
Department of Pathology
200 Lothrop Street
Pittsburgh, PA 15213
A joint effort between the Pittsburgh Supercomputing Center (PSC) and the University of Pittsburgh Medical Center (UPMC) is producing a large archive of pathology images with associated search and display software. Image sets contain standard magnifications of microscope slides tagged with pathologist's evaluations for use as known examples for training and comparison with unknown images. Methods for classifying, comparing, and retrieving images by content are the primary focus of the PSC portion of the project.
Initial tests of classification methods have been incorporated into an automated tool for grading severity of prostate cancer images. Results correlate well with grades assigned by UPMC pathologists. We are extending the classification methods to construct image signatures for the entire archive which can be used to identify images matching content query patterns. The poster exhibit outlines the role of PSC supercomputers in constructing effective image signatures and providing high speed image retrieval.
Arthur W. Wetzel
Armen Ezekielian
Ohio Supercomputer Center
A description of strongly-interacting bound states of quarks and antiquarks based on the theory of Quantum Chromodynamics (QCD) is one of the primary goals of theoretical elementary particle physics. Whereas the high-energy regime of QCD is believed well-understood, theoretical calculations in the low-energy part of the theory in which bound states of quarks and antiquarks live, have been difficult to obtain from QCD. Such calculations are critical to obtaining a more complete knowledge of elementary particles and their interactions. The field of lattice gauge theory has been responsible for the majority of numerical simulations which probe the bound-state structure of strongly-interacting systems. An alternative method proposed more recently has been the use of light-front quantization to obtain an effective QCD Hamiltonian. This Hamiltonian is then diagonalized numerically to obtain the energy eigenvalues and eigenvectors of the bound state in question. In the current study an effective Hamiltonian is derived from a quantization of QCD on the light front. Energy eigenvalues and wave functions (eigenvectors) for quark-antiquark bound states are obtained numerically. A comparison of the numerical results with experimental data is performed.
Armen Ezekielian
This numerical simulation provides the trajectories and reaction histories of coal particles in passage through a pulverized coal flame. Direct numerical simulation (DNS), instead of heuristic models with empirical parameters, is used to determine the flow field in the combustor geometry. Particles injected at various radial locations in the entering jet are transported by the unsteady fluid flow with the drag force due to the local fluid velocity. Particle reaction histories, modeled with a single-step, first-order rate equation for the volatile release and the Extended Resistance Equation for the char reaction, are calculated by including experimentally-based temperature and oxygen concentration fields, which provide the effect of a flame. As the particle reactions (devolatization and combustion) proceed, there are changes in particle density and size, which also affect the particle motion and energy.
This study shows the potential for developing computational histories of reacting particles in pulverized coal flames at a level of detail evidently exceeding what has been possible in the past.
Contact: Charlie Bender, directorBack to OSC SC '96 Participation Page: http://www.osc.edu/SC_96
Tim Rozmajzl
Ohio Supercomputer Center
The mixing of high-speed, imperfectly-expanded, turbulent jets with surrounding air is an important consideration in the design of high-speed aircraft. In particular, the mixing characteristics of these jets play an important role in the production and propagation of jet noise. Numerical simulation of such jets provides an effective means of investigating the unsteady flow mechanisms that contribute to the mixing process. A detailed understanding of the critical flow features involved in the mixing process and their effect on jet noise is necessary for implementing design modifications to enhance mixing and reduce jet noise. In the current study the time-dependent Navier-Stokes equations are solved numerically for an underexpanded rectangular jet and for an overexpanded round jet with a convergent-divergent nozzle. The rectangular jet operates at a fully-expanded Mach number of approximately 1.44, and the round jet has an exit Mach number of 1.4. Numerical results include the time-varying distribution of flow variables such as density, pressure, temperature, Mach number and vorticity. In addition, a Fourier analysis of fluctuations in the flow variables is presented. Where possible, numerical results are compared with experiment.
Tim Rozmajzl