Kivanc Dincer
dincer@npac.syr.edu
http://www.npac.syr.edu/users/dincer/
Geoffrey C. Fox
gcf@npac.syr.edu
http://www.npac.syr.edu/users/gcf/
Recent advances in both network bandwidths and microprocessor performance are radically altering high-performance computing environments. There is a strong trend toward building virtual computers from heterogeneous resources distributed on a network. Such virtual computers with enormous computing power and storage capacities can be used to attack many computationally challenging problems whose computing and storage requirements are beyond the capacity of a single dedicated parallel (or super) computer. The virtual computer itself may contain different types of architectures and is very suitable for solving metaproblems consisting of many constituent subproblems, each of which is suitable for a specific type of computer architecture, yet the metaproblem as a whole is outside the scope of a single computer architecture. It is important to run the load in the most suitable computer for a particular problem in order to use such an environment as efficiently as possible.
The existing and emerging World-Wide Web (WWW) infrastructure is expected to make great contributions toward building a virtual computing environment that will allow large-scale distributed and high-performance computing and communications (HPCC) applications to effectively utilize the computing potential of the existing pool of network-connected resources. The WWW brings to every type of machine in the world a standard open interface through which we can manipulate individual machines on a network. Web servers can be easily converted to computation nodes or coordinators mastering other nodes by using Common Gateway Interface (CGI) [10] extensions. The total computational power of such an assembly of machines is enormous.
In this paper we describe how we built such a virtual machine prototype, called World-wide Virtual Machine (WWVM), by using the Web and HPCC technologies. Modified Parallel Virtual Machine (PVM) [5] daemons, along with CGI-extended Web servers, form the basis of the communication layer, while Perl scripts construct the coordination engine of this virtual computer. We aim to demonstrate the usefulness of network-based heterogeneous parallel and distributed processing and to provide a shared resource for collaborating researchers on the WWW.
The WWVM can work in two modes: In the first mode, WWVM can be configured as a message-passing distributed memory computer. Each Web server in the current configuration is an attached computation node of this virtual machine.
In the second mode, WWVM works like a large heterogeneous distributed machine, or a metacomputer for solving metaproblems. The nodes of this metacomputer can be a single supercomputer, a workstation farm, or even another WWVM configured as a message-passing machine. The site taking care of the interaction with the user becomes the metaproblem coordinator and divides the problem among other coordinator Web servers running on each specific architecture type. The coordinator Web servers are master sites of a specific virtual machine view and may divide the work into smaller pieces according to the availability of resources under their control. They fetch the inputs for the tasks (subproblems) assigned to them and notify the metaproblem coordinator when they are done. In Section 4.2, we will discuss how the problem is divided among servers by using a coordination language.
The prototype consists of various types of workstations distributed across the Syracuse University campus. We prove the success of this prototype system by incorporating a Web-based programming environment, demonstrating Message Passing Interface (MPI) [7] and PVM message-passing programs and High Performance Fortran (HPF) [6] programs. We use the NPAC F90D/HPF compiler [2] and its associated F90D/HPF runtime libraries [1] for HPF processing. High Performance Fortran is an emerging shared-memory programming standard that extends Fortran 90's array constructs with data parallel constructs, intrinsic functions, and data distribution directives. PVM and MPI are message-passing libraries that have become de facto standards.
Section 2 summarizes the basic layers of the WWVM's system architecture. Sections 3 and 4 address the Web interface and coordination layer functions, respectively. Section 5 explains how the Web servers are combined to form a virtual computer. We conclude by commenting on the system security issues and briefly describing the important aspects of our work.
Our design has a modular structure with three layers: (Figure 1).
The Web interface layer accepts service requests from the user, some of which are compilation and execution requests for the programs supplied through an editor embedded into the Web interface. The others are related to the selection and configuration of the computational resources for the specific type of problem on hand. When a metaproblem is to be defined, data and control flow of the meta problem are also defined through this interface.
The configuration and coordination layer serves as a processing engine that analyzes the data supplied through the Web interface and takes necessary actions to set up a virtual machine view, which is a specific virtual machine configuration setup by the user by connecting selected Web servers on the internet. This layer saves the application data and control flow information and coordinates the working of Web compute servers. This layer is also responsible for starting up tasks and transferring the data and status information between the source and destination sites.
The virtual machine layer combines the Web servers on the internet or other machines coordinated by those servers to form a parallel/distributed virtual computer from resources on a geographically distributed network.
The Web interface provides a metacomputing programming environment for the WWVM that allows the user to specify tasks and the dependencies and input/output relationships between them. Individual tasks may be written in HPF, Fortran, or C using MPI or PVM message-passing libraries.
Using Netscape or another Web browser of choice, the WWVM user contacts the WWVM Web server and is presented with a fill-out form as shown in Figure 2. He/she manipulates push buttons, check boxes and text areas to designate the type of services (HPF, Fortran, or C compilation, and serial or parallel execution) and the platform (or a WWVM configuration view) where those services are to be performed, and task and data flow information that defines a metaproblem.
The user also has configuration options such as adding or deleting new nodes and forming his own virtual machine view. The user browses through the Web and chooses the Web sites volunteering to become part of the virtual machine. In fact, in many cases he is restricted to visiting only the Web sites he is known to be collaborating with. If a site is volunteered to join to configuration, then any trusted party should be able to use the host with a Web password.
After the user has specified the resources he needs through the supplied Web form, this information is parsed and analyzed by the interface layer modules, and configuration parameters are passed to the coordination layer along with the user's programs and task description information. The interface layer is also responsible for displaying outputs to the user.
The configuration and coordination layer has three subsystems: configuration, coordination, and compilation which we will explore in detail in the following paragraphs.
The configuration subsystem is responsible for configuring a virtual machine view based on the given configuration parameters passed from the interface layer. It communicates with coordinators or master sites of each virtual machine view and instructs them to form a virtual machine with the other chosen sites or machines. We discuss this in detail in the next section.
The metaproblem coordination subsystem maps the individual tasks of a complex metaproblem onto different virtual machine views of WWVM and manages overall coordination and synchronization of tasks executed on each view. Another important task of this subsystem is to prepare the environment for the compilation subsystem by transferring required files to all the nodes before compilation and execution.
We use a coordination language called Data Flow/Task Dependency Description Language that helps the programmer describe the main tasks and task groups of the problem and the interaction between these components to the coordination subsystem.
Each task is an independent computation entity that reads its inputs from a data file (from the user through the Web-interface or from the output of another task) and writes its outputs to another data file or to the Web-based output console. When closely correlated tasks with similar platform requirements are brought together, they form a task group. These tasks can be launched on the same virtual machine view and can therefore--most of the time--can use faster means of data and message-passing mechanisms than the tasks that were scheduled on different nodes.
For each task group we assume that the inputs from other nodes are read in at once before beginning to execute its tasks, and that all its outputs are saved back into the output files when it is finished.
The tasks are described in the following way using the Web interface:
< TASK = "T1"
EXECUTABLE = "task1.exe"
SOURCE = ``simpleGE.hpf :: HPF''
ARCHITECTURES = ``SUN4'',``ALPHA''
INPUT = ``matrix.dat :: http://npac.syr.edu:8080/task1.in''
OUTPUT= "reverse.dat::task1.out1'', ``stdout::task1.out2''
>
The TASK keyword describes a task named T1 that has an executable called task1.exe on SUN4 and ALPHA platforms. Heterogeneous systems, or systems that do not share the same file system, interpret this ARCHITECTURES statement and handles transfer of the source codes to the appropriate processing nodes before compilation or execution. The executable codes between compatible machines can be directly transferred between servers. We can easily identify that the source file, simpleGE.hpf, is written in HPF. In order to execute it, a file called task1.in should be ready, and that saves its outputs into task1.out1 and task1.out2 once it is executed.
Once all tasks are defined, the user may organize them into task groups. For example, the second task group (Figure 3) is defined as:
< TASK_GROUP = "TG_2"
CONTAINS = "T5", ``T6'', ``T7''
>
Once the task and task group descriptions are parsed and analyzed, an acyclic data flow/task dependence graph similar to the one in Figure 3 is constructed. This tree, augmented with platform specifications, will later be distributed to the coordinator nodes of the virtual machine views. The coordinator Web servers launch a task or task group once all of its inputs are ready after being notified by the metaproblem coordinator. In response they inform the metaproblem coordinator about their normal or abnormal termination. Appropriate actions can be taken, depending on this status. For example, there may be another attempt when failure has occurred or results may be conveyed to the application layer.
Pulling the data from appropriate tasks is done in a distributed manner. Each task gets its own data file from the supplying task's server location itself, either using URL_GET if it is on a remote server, or using local copy or symbolic link mechanisms if it is on a local file system. We assume that the data is passed to other nodes on different machines by writing it to directories that can be accessed by WWW servers and reading it using Web-based access mechanisms on the other sites.
Platform specification for the task groups is kept separate from the task group specifications for flexibility. Tasks are mapped onto different views of the WWVM:
< VIEW = "VM_1"
TASK_GROUP = "TG_1"
>
< VIEW = "VM_2"
TASK_GROUP = "TG_2"
>
< VIEW = "VM_3"
TASK_GROUP = "TG_3"
>
< VIEW = "VM_1"
TASK_GROUP = "TG_4"
>
In the example, after the node executing T1 is done, the metaproblem coordinator is notified of its completion. The metaproblem coordinator activates the next task group by sending a < START > message to that group's master node when all the tasks supplying input to that task group are completed; that is, once task groups TG_2 and TG_3 end, TG_4 is fired.
The language used in writing the task's code, the type of the message-passing libraries used (if any), and the virtual machine view that will be used to launch this task are all derived from the task description information supplied through the Web interface.
The compilation subsystem is able to compile C, Fortran, and HPF programs that are tasks of a metaproblem. It activates a Perl script to carry out the requested compile/link/load/execute cycle and to present the results. The user can then alter the input program (to correct bugs, perhaps), select a new code, or change the service options and may resubmit a new form for recompilation.
For compiling the HPF programs we used NPAC's experimental HPF compiler, which translates HPF source code into Fortran 77 with calls to the message-passing and runtime support (RTS) libraries. The RTS libraries, which are written using MPI, provide support for common collective communication operations and intrinsic functions of HPF.
Programs using PVM libraries can run directly on top of the WWVM platform. On the other hand, the ones written using MPI libraries must be processed before being executed on this platform. We wrote an MPI-to-PVM interface layer that emulates MPI calls in terms of PVM calls. This interface is based on Mississippi State University's public domain software, Unify [3], which includes a subset of MPI functions implemented on top of PVM (version 3.3.2). PVM functions were modified to add the communication ``contexts'' needed for MPI's protocols. It consists of simple point-to-point and collective communication, communicator management, and environment management functions. Both C and Fortran wrappers are provided. The system is supported only for Sun and SGI workstations. Since our RTS libraries were written using MPI, we implemented functions that are necessary to run any HPF or MPI program on top of this interface. One of our important extensions to Unify is the implementation of routines using cartesian coordinates processor topology. We also ported Unify to other systems that will be part of the WWVM.
The compiled codes are linked with the runtime and message-passing libraries to obtain executable codes on each machine. The executables are then mapped onto the nodes of the virtual machine view, with one or more processes per node according to the number of machines coordinated by each coordinator.
The virtual machine layer combines resources comprised of heterogeneous workstations (SUN, DEC, SGI, IBM) and small parallel machines (SP-2, CM-5). We will discuss its main building blocks for the virtual machine layer below.
Communication daemons are responsible for low-level message-passing among the Web computation nodes. Once a daemon is started on a host it stays alive until that host is removed from the machine configuration.
PVM is a de facto message-passing parallel processing system that has gained wide-spread acceptance in the user community. We chose to adapt PVM daemons to our environment instead of writing new communication daemons in order to eliminate the long initiation times and uncertainty associated with starting from scratch.
In PVM each user has his own daemons on each node of the virtual machine. These daemons coordinate communication, spawn processes on the same machine, and detect and report host failures. A node table containing housekeeping information about each node of the virtual machine is compiled and maintained by the master daemon and is distributed to all other daemons at start-up time.
There are further advantages coming directly from using PVM daemons. Some of these follow:
We modified the PVM daemon structure in various ways to fit into our framework without affecting the interface seen from outside. This gave us the ability to use existing codes written using PVM without change.
In native PVM, while adding a new host, the PVM daemon at the local host activates a PVM daemon on each new host by using ``rsh'' or ``rexec'' utilities. This requires a user account and valid password on each new host. However, when there is a single host that starts remote shells executing on other hosts, the number of open files quickly exceeds the limits of the operating system or simultaneous execution of large number of copies of the same executable file may overload the file server causing multiple lost requests. In our Web-based computing environment the user has no real accounts (therefore no UNIX passwords) on each machine that is part of the virtual machine. He/she accesses the whole system through the Web site that is responsible for the user interface.
The following are the primary CGI configuration items of the virtual machine layer.
The console is located at the master site, i.e., the server site that is directly interfaced with the metaproblem coordinator. The console is used to add or delete nodes to/from the virtual machine configuration or to monitor the general configuration of the machine (see Figure 4). When a new node is to be added, the communication daemons at that node are started by the console. The console has three main functions while configuring a virtual machine:
When a new node is going to be added to the configuration, the console sends an < ADD NODE> command to the local communication daemon at the master server site (Figure 4). This module takes the configuration parameters of the master server daemon (steps 1-3), opens an HTTP connection to the slave site's Web server (step 4), starts its slave manager and passes the configuration parameters to it. Then it waits until the slave node's configuration parameters arrive which it passes to its own daemon (step 8). This completes the < ADD NODE> operation.
The user can remove a node from the virtual machine configuration. The console instructs its own communication daemon to delete the specified node from the virtual machine by removing its address from the node table. The new table is broadcast to all other nodes.
The console obtains the node table from the local communication daemon and displays the current virtual machine configuration. This configuration contains information such as machine architecture types and relative speed of each machine.
The slave manager accepts incoming requests from the master site (or from other slave sites) and takes the necessary actions. If the incoming request is an < ADD NODE> command, then the request line parameters (Web equivalent of C command line arguments) include the master site's communication daemon's configuration. The slave manager starts a local communication daemon and passes the incoming parameters to it (step 5 of Figure 4). In response, the daemon sends him its own configuration parameters (Step 6), which are forwarded to the master site as an HTML document (Step 7).
The communication daemon is killed if no acknowledgment is received from the master site daemon within a certain amount of time. This timeout feature is useful in detecting link or node failures.
The following modules are used by the master and slave managers to construct a virtual machine.
As always, it was a challenge to provide access to more powerful, higher-level computing services without compromising security. Although we used Web authentication based on Web collaboration passwords and IP addresses, this mechanism may not be considered secure since the user-id's and passwords are transmitted on the internet in ``plain-text'' form, allowing intruders listen to lines and steal legitimate users' passwords. We precluded some possible mischief in such a case by powerfully restricting the abilities of whatever user or pseudo user the HTTP daemon runs (for example, user ``nobody'') and by monitoring the access records carefully for signs of abuse [4].
Several companies are working on providing secure Web protocols and systems that can be used in commercial applications. There are currently two protocols available for conducting a secure transaction. The first is Secure Socket Layers (SSL) [14], by Netscape Communications. The other is Secure-HTTP (S-HTTP) [15], proposed by EIT. We have already begun seeing servers and browsers supporting secure transactions. We believe that once secure means of encryption and authentication become widespread, Web security issues will converge to UNIX security issues.
We demonstrated the feasibility of using Web technologies combined with HPCC tools to construct a heterogeneous parallel/distributed virtual machine. We successfully integrated a Web-based metaproblem programming environment on top of this platform that enables definition and execution of metaproblems consisting of HPF, Fortran, or C tasks using PVM and MPI message-passing libraries.
We would like to thank Wojtek Furmanski, Tom Haupt, Donald Leskiw, and Xiaoming Li for providing their feedback in various stages of this work, and Elaine Weinman for proofreading this manuscript.
Kivanc Dincer is a Ph.D. candidate at
Syracuse University and a research
assistant at Northeast Parallel Architectures Center. He has an M.S. degree
in Computer Science from Iowa State University. He worked on the
design and development of several HPCC-related research projects including
NPAC F90D/HPF compiler and Parallel Compiler Runtime Consortium runtime
support systems development projects. His major research interests
include parallel processing, distributed computing and Web-based parallel
programming environments.
(dincer@npac.syr.edu)
(http://www.npac.syr.edu/users/dincer/)
Geoffrey C. Fox is an internationally recognized expert in the use of parallel architectures and the development of concurrent algorithms. He leads a major project to develop prototype high performance Fortran (Fortran90D) compilers. He is also a leading proponent for the development of computational science as an academic discipline and a scientific method. His research on parallel computing has focused on development and use of this technology to solve large scale computational problems. Fox directs InfoMall, which is focused on accelerating the introduction of high speed communications and parallel computing into New York State industry and developing the corresponding software and systems industry. Much of this activity is centered on NYNET with ISDN and ATM connectivity throughout the state including schools where Fox is leading developments of new K-12 applications that exploit modern technology. (gcf@npac.syr.edu) (http://www.npac.syr.edu/users/gcf/)