Cheng Che Chen
cchen@leland.stanford.edu
Jaswinder Pal Singh
jps@cs.princeton.edu
http://www.princeton.edu/~jps/
Russ B. Altman
altman@smi.stanford.edu
http://www-smi.stanford.edu/people/altman/
The determination of molecular structure is a key
element of molecular biology. Proteins and nucleic acids are
complex three-dimensional structures with thousands of atoms.
An understanding of their structure can facilitate the study
of how these molecules function and aid in the design of drugs
which augment, interfere with, or otherwise affect their functions.
The problem of determining molecular structure is
usually approached experimentally, using X-ray crystallography,
nuclear magnetic resonance (NMR), or a variety of biochemical
experiments that yield distances, angles, and other structural
information [11]. In addition to these experimental data, there
is a large body of knowledge from general chemistry: bond lengths,
bond angles, and torsion angles between atoms. The quality and
abundance of data from these sources can vary significantly.
Thus, it is important to combine these sources of information
effectively to produce not only a structure consistent with the
data, but also a measure of the variability in the estimated structure.
A probabilistic algorithm for integrating multiple
sources of uncertain data to compute an estimated structure was
reported in [1]. The parallelization of this algorithm was presented
in [6]. An extension to the algorithm to handle more general
probability distributions was published in [2]. In this paper,
we first briefly describe the formulation of the molecular structure
determination problem and its algorithm. Then we show that by
taking advantage of the problem structure of our application domain,
we are able to organize the computation hierarchically. The
hierarchical decomposition speeds up the computation tremendously
by reducing the sequential computational complexity. In addition,
the use of hierarchy exposes a new axis of parallelism that is
both important and challenging to exploit. The results we have
obtained for two problems reveal several issues in the automation
and efficient parallelization of the hierarchical approach. Finally
we outline the directions we are investigating to address these
issues.
In the following discussion, we use measurements,
observations, constraints, and data interchangeably
to denote the available information about the structure of a molecule.
We represent the unknown rectangular coordinates of the atoms of the molecule under examination by an n-dimensional state vector
where n = 3p is three times the number of atoms. An idealized measurement of the molecule is modeled by a vector-valued function y of the state vector,
For example, NMR measurements yield short-range distances between neighboring atoms. A distance function between two atoms of the molecule has the form
where x is the state vector of atomic coordinates,
and (xi, yi, zi)
and (xj, yj, zj)
are the positions of the two atoms i and j, respectively.
Random errors and noise in the measurement process are modeled as additive terms, so that an actual observation assumes the form
where the noise v is modeled as a Gaussian
random vector. If the measurement process is unbiased, then repeated
independent measurements will cluster around the "true"
value, and the random noise will have zero mean. The covariance
matrix R of the Gaussian noise term is related to the precision
of the measurement technology. High-precision measurements have
tight variances. In the notation of probability, we assume v ~ N(0, R).
In addition to the state vector x, we also
maintain a covariance matrix C. The diagonal entries of
C are the variances of the coordinates of the state vector
and provide a measure of the uncertainty of the estimated numbers.
The off-diagonal entries are the covariances between the coordinates;
they indicate linear correlations between the components of x.
The information contained in the covariance matrix is useful
in assessing, for example, which parts of the molecule are better
defined by the data. Hence the pair (x, C)
represents our best estimate of the molecular structure along
with an indication of the variability of the estimated numbers,
given the available data.
We may incrementally update our structure estimate as new observations become available. Figure 1 shows the procedure for the application of one constraint [8].
The algorithm is able to process observations in
sequence rather than all at once because the effects of previous
updates are summarized as linear correlations in the covariance
matrix. Given an initial structure estimate, we apply this procedure
over the entire set of constraints to produce an updated structure.
Because of the nonlinearity in the measurement functions, we
typically have to re-initialize the covariance matrix and repeat
the cycle of updates until the estimate converges to an equilibrium
point. This algorithm can be extended to process non-Gaussian
constraints [2].
Let's briefly analyze the computational complexity
of this algorithm. Suppose the dimension of the state vector
is n. Furthermore, assume that we have M scalar
observations organized into vectors of uniform dimension m,
so that the sequential update algorithm is applied M/m
times.
Application of the update procedure using one m-dimensional observation vector entails the following computation:
The total amount of computation once over the entire
set of constraints is M/m times the above steps. To assess
the effect of the batch factor m on the total computational
time, we consider the average work per scalar constraint by dividing
the total computational time by the total dimension of constraints.
We see from the above categories that the average work per scalar
constraint for vector operations decreases as m increases;
for Cholesky factorization it grows as O(m2)
with m; and for the filter gain matrix computation it increases
as O(mn) per scalar observation.
Therefore we expect the average work per scalar observation
to increase linearly at the beginning as we process larger observation
vectors at a time. However, when m is increased to a comparable
magnitude with n, the O(m2) growth in
Cholesky factorization will start to dominate.
This simple analysis suggests that we process observations
in small chunks, although it does not indicate exactly what size
of chunks to use since asymptotic analysis does not capture the
behavior of the algorithm for small values of its parameters.
Also, it does not take into account the number of cycles of constraint
application to convergence. In practice, convergence seems to
be slightly faster with larger constraint batches.
Biological molecules have a natural hierarchy in
their structure. Observations relating to the structure of the
molecules tend to be largely localized following this hierarchy
as well. For example, the amino acid residues which make up a
protein consist of a common backbone and distinguishing sidechains.
Residues which are nearby in the linear sequence of the protein's
primary structure may arrange themselves in space to form secondary
structures such as helices and sheets. These hierarchical subunits
may interact to form further aggregate features [10].
In this section, we present a hierarchical decomposition
of the state estimate update process. It derives from the problem
structure of the application domain, namely locality of most observations.
Mathematically the key observation is that updating the structure
estimate with localized information involves computation with
many zeros and leaves other currently uncorrelated (zero covariance)
parts of the estimate unmodified. This fact can be exploited
to reduce the computational complexity of the update procedure
by recursively dividing the molecule into smaller and smaller
parts and applying every constraint to the smallest unit which
wholly contains it.
Let our current estimate of the molecular structure be (x-, C-). Suppose the state vector estimate consists of two uncorrelated parts (whose dimensions are n1 and n2, respectively, with n1 + n2 = n); that is,
Let's examine what happens when we update our estimate with an observation which pertains to only one part (say, x(1)) of the structure. The function (say, of dimension m) which relates this observation to the underlying state of the structure has the form:
If we trace the execution of the update process (Figure 1), inside the loop we first differentiate the function h with respect to the n state variables of x to produce a linearized relation Hj (dimension (m x n)), evaluated at the current estimate x+,j-1. However, since h depends only on the first n1 state variables, the last n2 elements of each row of Hj are zeros,
This in turn leads to a localized non-zero structure for the (n x m) matrix C-Hjt. The (m x m) matrix (HjC-Hjt + R) is the covariance of the locally linearized model of the observation and in principle could be fully dense. However, the (n x m) gain matrix Kj has the same non-zero structure as C-Hjt.
The zeros in the lower half of the filter gain matrix prevent change to the lower n2 coordinates of the state vector estimate. Furthermore, since
the update to the state covariance estimate will be restricted to the upper left block:
Thus this observation of a local part of the molecule
produces only a corresponding local update to our structure estimate.
We should note that if we now introduce a constraint
between the two parts of the molecule, then the off-diagonal zero
blocks of the state covariance estimate will start to be filled
in, and both halves of the state estimate will begin to change
in response to new data.
The above derivation suggests a decomposition principle to reduce useless computation with zeros. Given an initial structure estimate (x-, C-):
We should emphasize that this hierarchical organization
achieves the same computation as the original flat problem. The
difference is in the elimination of useless operations with zeros
in the application of localized observations.
We can visualize a hierarchical organization of the
molecular structure such as shown in Figure 2. Our original molecule
of interest is positioned at the root. A node may give rise to
some number of children, corresponding to initially uncorrelated
sub-structures of the node. Associated with each node are observations
local to itself; that is, we try to apply constraints at the lowest
level of the tree possible. The update process starts at the
leaf level. We update a leaf node with its associated constraints
by applying the procedure of Figure 1. When all the children
of a node have been updated, we may proceed to update the node
itself.
In this section we use a set of simple problems to
compare the computational efficiency of the hierarchical approach
with that of the flat organization.
An RNA [10] molecule consists of a linear sequence
of bases folded into a unique shape in space. There are four
types of bases, A, C, G, and U, each consisting of a common backbone
and a distinguishing sidechain. Beside the covalent bonds which
chain the bases together, complementary bases may pair up out
of the linear order. We focus on the simplest motif of RNA structure,
a double helix. Structurally, an RNA double helix is made up
of a series of base pairs twisted into a spiral pattern.
In the flat organization, we simply treat the helix
as a long vector of atoms. For hierarchical decomposition, we
recursively divide the helical structure into two sub-helices
of equal length, down to individual base pairs. We further decompose
a base pair into two bases, each of which is divided into the
leaf level of a backbone and a sidechain. This decomposition
is illustrated in Figure 2.
The constraints we use to define the shape of the helix fall into five categories:
For hierarchical computation, the first two types
of distances may be applied to the leaf nodes of backbones and
sidechains. One level up in the hierarchy, the individual bases
may be assigned the third type of distances to orient the backbone
and the sidechain. The fourth type of constraints position two
bases into a base pair. Above the level of base pairs, we apply
the last category of distances whenever we join two sub-helices
to form a parent helix.
Table 1 and the accompanying Figure 5 show the execution
times, for both flat and hierarchical organization, for helices
of various lengths (in number of base pairs). We see that the
average execution time per scalar constraint with hierarchical
decomposition grows much more slowly than with flat organization
of the same problems.
| Flat Organization | Hierarchical Decomposition | ||||||
| Helix Length | Number of Atoms | Number of Constraints | Total Time (sec) | Time per Constraint | Total Time (sec) | Time per Constraint | Speedup |
| 1 | 43 | 675 | 1.16 | 0.00172 | 0.65 | 0.00096 | 1.78 |
| 2 | 86 | 1574 | 7.78 | 0.00494 | 2.42 | 0.00154 | 3.21 |
| 4 | 170 | 3294 | 54.09 | 0.01642 | 8.45 | 0.00257 | 6.40 |
| 8 | 340 | 6810 | 427.23 | 0.06274 | 30.98 | 0.00455 | 13.79 |
| 16 | 680 | 13824 | 3436.18 | 0.24857 | 114.20 | 0.00826 | 30.09 |
We can qualitatively understand the phenomenon as
follows. Suppose the size of the molecule (dimension of the state
vector) is n. The computational complexity of the update
procedure of Figure 1 grows as O(n2) for a single
scalar constraint. Therefore for flat organization the quadratic
growth of per-constraint execution time is expected.
Next consider a hierarchical decomposition of a molecular structure in which a parent node gives rise to two children of the same size, half as large as the parent. If the depth of the binary tree is d, then at level i, (0 <= i <= d), there will be 2i nodes, each of size O(2d-i). The average execution time per constraint depends on how much of the computation we can push down the tree. Suppose a node at level i has mi scalar constraints, then the average execution time of a constraint is
In an optimistic scenario (such as the present examples of helices), most observations are localized, and we are able to assign most of them to nodes down toward the leaf level so that a node gets roughly some constant number of constraints. Then the average run time per constraint grows linearly with the size of the molecule:
On the other hand, if a node has as many constraints as those of its children combined, then the computational advantage of a hierarchical scheme over a flat organization is more modest:
We believe that most realistic problems lie between
these two bounds and that the advantages of hierarchical computation
are substantial in practice.
In this section, we discuss the potential parallelism
in our hierarchical algorithm. We first focus on the computation
within one node of the structure hierarchy. The computation required
for each node of the hierarchy is an instance of the flat problem
and involves repeated application of the update procedure of Figure 1
over all the constraints of the node.
A coarse-grained way to parallelize computation for a node is across the application of constraints. We may divide the set of constraints assigned to the node into two disjoint subsets. In parallel, these two groups of constraints can produce two different updated structure estimates, (x1, C1) and (x2, C2). These two estimates may then be combined to produce a final estimate using the following procedure:
If we partition the collection of constraints into
more than two subsets, then the separately produced results have
to be combined pairwise in a "tournament" fashion.
This coarse-grained approach has the advantage of
low inter-processor communication requirement in the processing
of the constraints: the processing of disjoint subsets is independent.
However, there are two drawbacks. First, the combination process
is an expensive computational overhead which does not exist in
sequential application of constraints. If the node state vector
has dimension n, the combination process requires essentially
the same amount of work as applying a constraint vector of the
same dimension. If the total dimension of the entire set of constraints
is M, then M needs to be much larger than n
for this scheme to be worthwhile. Second, this parallel approach
requires duplication of the state data structures (x-, C-),
which take up the lion's share of the space required by a node.
In other words, this approach is more suitable for
processing massive, redundant observations on structures whose
state information is modest in size relative to the available
storage. In contrast, data on biological structures are usually
expensive to obtain and not in great abundance. Thus from the
point of view of both computational overhead and scalability of
memory requirement, parallelizing intra-node computation across
constraint application is not attractive. See [6] for a numerical
example.
We therefore rely on parallelism within the constraint
application procedure (Figure 1) for each set of constraints.
As shown in [6], this involves parallel matrix computation and
yields good performance for nodes of realistic sizes.
Because the nodes in the structure hierarchy become
progressively smaller toward the leaf level, sole reliance on
intra-node parallelism is inadequate. Hierarchical decomposition
exposes a new axis of parallelism beyond parallel matrix computation
within a node. Although the children of a node in the structure
hierarchy must be updated before the node, the computations for
the subtrees rooted at those children are independent (more generally,
nodes that do not have an ancestor-descendent relationship can
be processed in parallel). Therefore given a number of processors
assigned to a node, we may distribute these processors over the
child branches. Moreover, if a node is assigned only a subset
of the available processors, we may migrate the associated data
closer to those processors in a physically distributed memory
environment.
This presents a new challenge for efficient parallelization.
How do we assign processors to nodes to achieve load balance?
We would like to compute statically as much of the assignment
as possible to have better control over data locality.
Our strategy is to estimate the amount of work at
every node and divide processors among nodes accordingly. To
determine how to do this, we constructed a set of nodes of various
sizes. For each node size, we also varied the size of the constraint
vector (number of scalar constraints) input into the update procedure
at a time (Figure 1). We computed the average execution time
of a scalar constraint for each experimental setting. The results
are shown in Table 2 and the accompanying Figure 6.
| Node Size (Number of Atoms) | |||||
| Constraint Batch Dimension | 43 | 86 | 170 | 340 | 680 |
| 1 | 0.00535 | 0.02008 | 0.07784 | 0.34601 | 1.41522 |
| 2 | 0.00324 | 0.01181 | 0.04571 | 0.19945 | 0.80863 |
| 4 | 0.00204 | 0.00712 | 0.02670 | 0.11354 | 0.45738 |
| 8 | 0.00154 | 0.00507 | 0.01868 | 0.07613 | 0.30157 |
| 16 | 0.00141 | 0.00435 | 0.01537 | 0.06001 | 0.23427 |
| 32 | 0.00176 | 0.00514 | 0.01689 | 0.06301 | 0.23850 |
| 64 | 0.00246 | 0.00628 | 0.01916 | 0.06657 | 0.25133 |
| 128 | 0.00387 | 0.00899 | 0.02429 | 0.07583 | 0.27472 |
| 256 | 0.00747 | 0.01533 | 0.03788 | 0.11143 | 0.38431 |
| 512 | 0.01630 | 0.02915 | 0.06277 | 0.15257 | 0.46112 |
Also evident from Figure 6 is the quadratic growth
of per-constraint execution time with the node size, which can
be inferred from the slope of the lines on the log plot.
We applied a constrained least-squares polynomial fit to the data points of Table 2, excluding those for very small constraint batch sizes. We imposed two simple checks on the regression:
Although the constrained regression may not fit the
data points as well as an unconstrained one, it ensures against
catastrophic failure of the fitted model away from the data points.
The result of the regression is
where t is the estimated execution time of
an equivalent scalar constraint, n the node size (dimension
of its state vector), and m the constraint batch dimension.
We only included linear dependency on m because its higher-order
effects are negligible over the range of values that we typically
use and because the higher-order fits were not very stable and
failed to satisfy the checks.
Although the above experiments used only distance
constraints-the most prevalent form of data for molecular structures-we
could in principle obtain a work estimation formula for each type
of observation in which we are interested. This experiment may
be done only once for each type of constraint (and not for every
new problem) because the computational steps are the same except
for the evaluation of the constraint functions.
Given a hierarchical decomposition of a structure, we can now distribute the computation to processors using the following heuristic steps:
A potential problem with this static assignment is
that load imbalance may still occur if the mismatch between processor
partition and subtree work ratio is significant.
We now present the results of parallel computation for two problems. One is the RNA double helix of length 16 base pairs (Figure 2) described in Section 3.1. The other is the prokaryotic 30S ribosomal subunit. This molecular complex consists of 21 proteins and the 16S rRNA molecule, which contains about 65 double helices plus roughly the same number of interconnecting coils. The positions of the proteins have been determined by neutron diffraction mapping [4][5] and serve as reference points for placing other parts of the macromolecule. The helices and coils consist of RNA bases with geometrical constraints between them. They are positioned using experimental data which provide information about distances between helices and from helices to the proteins. The modeled problem has about 900 pseudo-atoms and 6500 constraints. A low-resolution discrete conformational space search [3] was first performed to generate initial structure estimates. This preprocessing step serves to combat low-quality locally optimal solutions which may be encountered by analytical procedures such as the current algorithm. See [7] for details of the problem and its computation. The decomposition of the 30S ribosome problem is indicated in Figure 4.
The execution platforms on which we performed our
experiments are the Stanford DASH multiprocessor [9] and a Silicon
Graphics Challenge. The DASH is an experimental research multiprocessor
built at Stanford University. It supports an implicit shared-memory
communication abstraction in hardware, with hardware-supported
cache coherence. The machine we used has 32 processors organized
into 8 clusters. A cluster comprises 4 33-MHz MIPS R3000 processors
connected by a shared bus, and clusters are connected together
in a mesh network. Every processor has separate 64KB first-level
instruction and data caches and a 256KB second-level cache. Although
the address space is logically shared, the main memory is physically
distributed. Every cluster has an equal fraction of the physical
memory of the machine. All caches in the system are kept coherent
in hardware using a distributed directory-based protocol.
The SGI Challenge is also a shared-memory multiprocessor.
It has 16 100-MHz MIPS R4400 processors with R4010 FPUs, connected
to a 12 GB/sec bus. Each CPU has separate 16KB first-level instruction
and data caches and a 1MB unified second-level cache. Unlike
the DASH, the Challenge has centralized memory organization.
For both problems, we measured the execution time
of the program over one complete cycle of application of all the
constraints. It typically takes between 20 to 200 cycles over
the constraint set for the solution to converge, but the pattern
of computation is the same across cycles. We have excluded time
for input, output, and initialization to discount cold-start effects.
On DASH, we distributed some of the larger, contiguous pieces
of data for a node to only the processor clusters which are assigned
to the node to improve locality in main memory, in a round-robin
fashion to avoid hot spots. We also recorded the time spent in
the different types of array operations which make up almost all
the execution time of the application in a typical run.
The results are shown in Table 3 through Table 6
and their accompanying figures. One confirmation of the effectiveness
of the parallel hierarchical approach is that, compared to the
largest problem we computed in [6], the current problems are much
larger and yet executed in a fraction of the time. We see good
speedups for both problems, around 24 for 32 processors on DASH
and 14 for 16 processors on the Challenge. From the tables it
is clear that most of the performance comes from efficient system
solves, dense matrix multiplications, and dense matrix-vector
multiplications. These operations have low communication requirement,
and their computation can be effectively tiled for cache locality.
On the other hand, the dense-sparse matrix multiplication routines
achieve only about 55-75% of the ideal speedup on DASH. This
can again be understood in terms of locality, since the sparse
matrix to some degree randomly accesses its dense counterpart,
and remote cache misses (the proportion of which increases with
more processors) are more expensive than local cache misses on
DASH. The two types of operations which do not parallelize well
on both execution platforms are Cholesky factorization, because
of the small size of the matrix to be factored and because of
dependency in the computation of the matrix sub-blocks; and vector
operations, which suffer from being interleaved with other array
operations in the code such that the chances of the needed vectors
being in the cache the next time around are small.
| NP | time | spdup | d-s | chol | sys | m-m | m-v | vec |
| 1 | 483.22 | 1.00 | 22.33 | 1.95 | 55.07 | 384.97 | 3.14 | 0.99 |
| 2 | 246.56 | 1.96 | 11.48 | 1.07 | 27.53 | 193.48 | 1.37 | 0.69 |
| 4 | 122.09 | 3.96 | 5.34 | 0.58 | 13.38 | 95.13 | 0.54 | 0.34 |
| 6 | 93.00 | 5.20 | 3.59 | 0.53 | 9.28 | 59.87 | 0.47 | 0.27 |
| 8 | 57.54 | 8.40 | 2.49 | 0.38 | 6.32 | 43.81 | 0.20 | 0.19 |
| 10 | 52.93 | 9.13 | 2.28 | 0.36 | 5.39 | 36.81 | 0.17 | 0.18 |
| 12 | 44.37 | 10.80 | 2.00 | 0.33 | 4.54 | 30.46 | 0.13 | 0.16 |
| 14 | 42.01 | 11.50 | 1.83 | 0.30 | 3.89 | 27.08 | 0.11 | 0.17 |
| 16 | 33.20 | 14.55 | 1.91 | 0.28 | 3.70 | 24.11 | 0.11 | 0.17 |
| 20 | 31.14 | 15.52 | 1.57 | 0.31 | 3.41 | 20.12 | 0.10 | 0.15 |
| 24 | 25.07 | 19.27 | 1.40 | 0.27 | 2.56 | 17.25 | 0.09 | 0.15 |
| 28 | 24.58 | 19.66 | 1.28 | 0.30 | 2.38 | 15.52 | 0.08 | 0.14 |
| 32 | 20.00 | 24.16 | 1.35 | 0.28 | 2.12 | 13.31 | 0.07 | 0.15 |
| NP: | number of processors | d-s: | dense-sparse matrix multiplications |
| time: | work time, which is total execution time minus initialization | chol: | Cholesky factorization |
| minus input time minus output time | sys: | system solves | |
| spdup: | speedup factor | m-m: | dense matrix multiplications |
| m-v: | dense matrix-vector multiplications | ||
| vec: | vector operations |
| NP | time | spdup | d-s | chol | sys | m-m | m-v | vec |
| 1 | 924.57 | 1.00 | 17.33 | 0.83 | 33.18 | 861.05 | 3.01 | 0.61 |
| 2 | 446.42 | 2.07 | 9.09 | 0.50 | 16.90 | 411.72 | 1.26 | 0.33 |
| 4 | 215.95 | 4.28 | 4.67 | 0.29 | 8.35 | 197.34 | 0.29 | 0.17 |
| 6 | 137.95 | 6.70 | 2.58 | 0.22 | 5.09 | 120.30 | 0.21 | 0.12 |
| 8 | 110.48 | 8.37 | 2.29 | 0.34 | 4.73 | 92.14 | 0.16 | 0.10 |
| 10 | 87.98 | 10.51 | 1.90 | 0.17 | 3.13 | 75.98 | 0.09 | 0.10 |
| 12 | 72.60 | 12.74 | 1.71 | 0.17 | 3.01 | 62.32 | 0.12 | 0.08 |
| 14 | 67.83 | 13.63 | 1.70 | 0.16 | 2.62 | 56.28 | 0.07 | 0.08 |
| 16 | 60.02 | 15.40 | 1.53 | 0.18 | 2.31 | 51.07 | 0.07 | 0.08 |
| 20 | 49.09 | 18.83 | 1.42 | 0.16 | 1.93 | 41.57 | 0.06 | 0.08 |
| 24 | 43.93 | 21.05 | 1.43 | 0.33 | 1.62 | 37.10 | 0.05 | 0.08 |
| 32 | 38.14 | 24.24 | 1.17 | 0.16 | 1.37 | 32.22 | 0.04 | 0.08 |
| NP | time | spdup | d-s | chol | sys | m-m | m-v | vec |
| 1 | 159.99 | 1.00 | 6.96 | 0.69 | 19.48 | 128.86 | 0.49 | 0.33 |
| 2 | 82.65 | 1.94 | 3.42 | 0.35 | 9.76 | 66.38 | 0.25 | 0.16 |
| 4 | 42.20 | 3.79 | 1.65 | 0.19 | 4.93 | 33.77 | 0.13 | 0.09 |
| 6 | 32.30 | 4.95 | 1.13 | 0.15 | 3.28 | 22.53 | 0.09 | 0.06 |
| 8 | 21.79 | 7.34 | 0.84 | 0.12 | 2.46 | 17.21 | 0.06 | 0.05 |
| 10 | 18.83 | 8.50 | 0.69 | 0.11 | 1.97 | 13.98 | 0.05 | 0.04 |
| 12 | 15.98 | 10.01 | 0.59 | 0.10 | 1.67 | 11.55 | 0.04 | 0.05 |
| 14 | 14.49 | 11.04 | 0.50 | 0.10 | 1.43 | 10.05 | 0.04 | 0.04 |
| 16 | 11.59 | 13.80 | 0.47 | 0.10 | 1.26 | 8.87 | 0.03 | 0.04 |
| NP | time | spdup | d-s | chol | sys | m-m | m-v | vec |
| 1 | 272.53 | 1.00 | 5.37 | 0.32 | 11.55 | 253.52 | 0.29 | 0.15 |
| 2 | 145.41 | 1.87 | 2.68 | 0.17 | 5.73 | 134.46 | 0.15 | 0.08 |
| 4 | 72.56 | 3.76 | 1.33 | 0.10 | 2.88 | 66.68 | 0.08 | 0.05 |
| 6 | 50.35 | 5.41 | 0.91 | 0.08 | 2.06 | 45.19 | 0.05 | 0.03 |
| 8 | 37.26 | 7.31 | 0.69 | 0.06 | 1.45 | 33.98 | 0.04 | 0.03 |
| 10 | 29.44 | 9.26 | 0.56 | 0.06 | 1.17 | 26.77 | 0.03 | 0.03 |
| 12 | 24.96 | 10.92 | 0.48 | 0.05 | 0.96 | 22.44 | 0.03 | 0.03 |
| 14 | 21.91 | 12.44 | 0.43 | 0.05 | 0.84 | 19.69 | 0.03 | 0.03 |
| 16 | 18.86 | 14.45 | 0.40 | 0.06 | 0.74 | 16.85 | 0.02 | 0.03 |
For the Helix problem, we see a noticeable dip in
the speedup whenever the number of processors is not a power of
2. This is a short-coming of the static scheduling algorithm.
Whenever the processors at a node cannot be evenly divided into
two groups for the two subtrees, the computation effectively proceeds
at the speed of the smaller group because the two groups must
synchronize at the node after update of the subtrees. The ribo30S
structure does not suffer this problem because its branching factor
is larger, allowing for more even division of work.
In comparison with our previous, non-hierarchical
version [6], the current application incurs noticeably higher
memory overhead. The hierarchical code dynamically allocates
nodes since their number and sizes can vary widely. It also handles
more general types of constraints which use pointers to link together
variable portions of the data structure. Hence data are much
more scattered about in the address space. Unrelated data for
different nodes may interleave depending on the placement decision
of malloc(), and many small data structures, especially the constraints,
cannot be properly placed and migrated at page granularity without
massive data reorganization. This problem can be ameliorated
by better data structure design to reduce fragmentation of logically
contiguous data.
We have demonstrated the efficiency of a hierarchical
approach in molecular structure estimation and successfully applied
parallel computation to its solution. The results of our experiments
suggest several areas of further study to extend the effectiveness
and functionality of our parallel algorithm.
Currently the application requires the user to specify
the entire structure hierarchy (although a simple and non-optimal
recursive bisection method is in place to decompose a flat problem
specification). A computationally efficient hierarchical decomposition
allows most of the constraints to be ascribed to nodes toward
the leaf level of the tree. Effective automatic decomposition
of a flat structure specification seems to us to require good
solution to the graph partitioning problem. We can think of the
atoms of the molecule as nodes in a graph and constraints between
atoms as edges between the nodes. A heuristic to partition the
graph into a small number of loosely coupled subgraphs will lead
to an efficient decomposition of the molecular structure because
most of the observations will be localized in the sub-structures,
which can then be recursively subdivided.
An alternative, bottom-up approach may be to have
the user specify the leaf nodes of the structure hierarchy. The
leaves usually correspond to the building blocks of the molecular
structure such as nucleotides or amino acid residues and already
encapsulate a lot of information about locality of interaction.
The program can then look for a good hierarchical grouping of
the nodes to minimize the number of constraints that need to be
applied at the higher levels of the hierarchy.
The results of the Helix problem shows that static
scheduling of processors leads to some loss of efficiency in a
hierarchy with low branching factors when processors cannot be
evenly divided among the subtrees of a node. This is a more difficult
problem to solve. Task (constraint) stealing between sibling
tree nodes is made difficult by the fact that the computation
within a node is an instance of the flat problem, which consists
of many phases separated by synchronization among the processors
assigned to that node. The stolen task results must be combined
with those of the originating node, further increasing computational
cost. However, dynamic reassignment of processors to nodes by
periodic global synchronization does merit further investigation.
This approach will make use of work estimation so that every
processor (or processor group) can take roughly the same amount
of time to process some number of constraints at the corresponding
nodes between global synchronization and processor re-grouping.
From the results on DASH it is evident that the bend
in the speedup curve correlates strongly with the increase in
the overhead of memory operations. This increase comes from using
pointers to link variable portions of data structures and the
consequent fragmentation of logically contiguous data. For our
application, this situation is not difficult to improve because,
although many data structures have variable parts, they have the
property that, once instantiated, they remain fixed throughout
the execution of the program. More careful memory management
to group related data together will improve data locality and
should not be difficult to implement.
Finally, we have not examined the impact of hierarchy
on convergence. All the computations and comparisons have been
made for one complete cycle of application of all the constraints.
Multiple iterations are usually required for the solution to
settle to a stable value. The difference between the hierarchical
organization and the flat computation is in the order of constraint
application. Hierarchical computation processes constraints in
order of locality of interaction, whereas flat computation is
not guided by this property of the application domain. A comparison
of three different ways of ordering constraints for flat computation
can be found in [1], which as expected shows that incorporating
domain knowledge into the mathematical procedure helps convergence.
We believe hierarchical organization of constraints should further
speed convergence in addition to reducing the computational complexity
within an iteration.
In the area of molecular structures there are two
lines of research related to the work described here. One is
distance geometry [12][13]. It takes as input a set of bounds
on the interatomic distances of the molecule. Within these bounds,
trial distances are generated and refined until a valid distance
matrix results. The first three eigenvectors of a closely related
matrix then gives the coordinates of a candidate structure which
satisfies the distance bounds. The other goes along the line
of energy minimization and molecular dynamics [14][16], which
represent interactions between atoms by energy terms. Optimization
techniques are applied to find a conformation with the lowest
energy. A systematic comparison with respect to speed, accuracy,
and sensitivity of these methods can be found in [15].
Biocomputing applications are an emerging arena in
which high-performance computing has increasing relevance. One
important component of computational biology is molecular structure
determination. We have presented a parallel hierarchical method
for estimating molecular structure from multiple sources of uncertain
data and demonstrated the application on two shared-address-space
multiprocessors using data sets from real biological problems.
While we have obtained good speedups on the problems we have
examined, the results of our experiments point to three areas
of further improvement: automatic structure decomposition, processor
load balancing, and data locality. These issues have relevance
not only to our own application, but also to parallel hierarchical
computation in general. We are currently investigating ways to
address them.
We would like to thank John Hennessy for his support.
We also thank Radhika Thekkath for helpful comments and Richard
Chen for providing us with the 30S ribosome problem. This work
is supported in part by ARPA contract DABT63-94-C-0054.
RBA is a Culpeper Medical Scholar and is supported by NIH LM-05652. Computing support has been provided in part by CAMIS resource NIH LM-05305.
[2] Altman, R. B., C. C. Chen, W. B. Poland, J. P. Singh. Probabilistic Constraint Satisfaction with Non-Gaussian Noise. Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, pp. 15-22, San Francisco, CA, 1994.
[3] Altman, R. B., B. Weiser, H. F. Noller. Constraint Satisfaction Techniques for Modeling Large Complexes: Application to Central Domain of the 16S Ribosomal Subunit. Proceedings of Second International Conference on Intelligent Systems for Molecular Biology, pp.10-18, Stanford, August 1994, AAAI Press.
[4] Capel, M. S., D. M. Engelman, B. R. Freeborn, M. Kjeldgaard, J. A. Langer, V. Ramakrishnan, D. G. Schindler, D. K. Schneider, B. P. Schoenborn, I. Y. Sillers, et. al. A Complete Mapping of the Proteins in the Small Ribosomal Subunit of Escherichia coli. Science, 238(4832): 1403-6, 1987.
[5] Capel, M. S., M. Kjeldgaard, D. M. Engelman, P. B. Moore. Positions of S2, S13, S16, S17, S19 and S21 in the 30S Ribosomal Subunit of Escherichia coli. J. Mol. Bio., 200(1): 65-87, 1988.
[6] Chen, C. C., J. P. Singh, W. B. Poland, R. B. Altman. Parallel Protein Structure Prediction from Uncertain Data. Proceedings of the Supercomputing '94 Conference, pp.570-579, Washington, D. C., 1994.
[7] Chen, R., D. Fink, R. B. Altman. Computing the Structure of Large Complexes: Applying Constraint Satisfaction Techniques to Modeling the 16S Ribosomal RNA. Biomolecular NMR Spectroscopy (Markley, J. L. and S. J. Opella, eds.), pp.279-299, Oxford University Press, 1995.
[8] Gelb, A., ed. Applied Optimal Estimation. the MIT Press, Cambridge, MA, 1984.
[9] Lenoski, D., J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy. The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor. Proceedings of the 17th Annual International Symposium on Computer Architecture, pp.148-159, May 1990.
[10] Stryer, L. Biochemistry. W. H. Freeman, New York, 1991.
[11] Wuthrich, K. NMR of Proteins and Nucleic Acids. John Wiley and Sons, 1986.
[12] Crippen, G. M. Distance Geometry and Conformational Calculations. John Wiley and Sons, 1981.
[13] Havel, T. F., I. D. Kuntz, G. M. Crippen. The Theory and Practice of Distance Geometry. Bulletin of Mathematical Biology, 45(5), pp.665-720, 1983.
[14] Levitt, M., R. Sharon. Accurate Simulation of Protein Dynamics in Solution. Proceedings of the National Academy of Science, 85, pp.7557-7561, 1988.
[15] Liu, Y., D. Zhao, R. Altman, O. Jardetzky. A Systematic Comparison of Three Structure Determination Methods from NMR Data: Dependence upon Quality and Quantity of Data. Journal of Biomolecular NMR, 2, pp.373-388, 1992.
[16] Nemethy, G., H. A. Scheraga. Theoretical Studies of Protein Conformation by Means of Energy Computations. FASEB Journal, 4, pp.3189-3197, November 1990.