Parallel Hierarchical Molecular Structure Estimation

Cheng Che Chen
Electrical Engineering Dept.
Stanford University

cchen@leland.stanford.edu

Jaswinder Pal Singh
Computer Science Dept.
Princeton University

jps@cs.princeton.edu

http://www.princeton.edu/~jps/

Russ B. Altman
Section on Medical Informatics
Stanford University

altman@smi.stanford.edu
http://www-smi.stanford.edu/people/altman/

Abstract: Determining the three-dimensional structure of biological molecules such as proteins and nucleic acids is an important element of molecular biology because of the intimate relation between form and function of these molecules. Individual sources of data about molecular structure are subject to varying degrees of uncertainty. We have previously examined the parallelization of a probabilistic algorithm for combining multiple sources of uncertain data to estimate the structure of molecules and predict a measure of the uncertainty in the estimated structure. In this paper we extend our work on two fronts. First we present a hierarchical decomposition of the original algorithm which reduces the sequential computational complexity tremendously. The hierarchical decomposition in turn reveals a new axis of parallelism not present in the "flat" organization of the problems, as well as new parallelization issues. We demonstrate good speedups on two cache-coherent shared-memory multiprocessors, the Stanford DASH and the SGI Challenge, with distributed and centralized memory organization, respectively. Our results point to several areas of further study to make both the hierarchical and the parallel aspects more flexible for general problems: automatic structure decomposition, processor load balancing across the hierarchy, and data locality management in conjunction with load balancing. We outline the directions we are investigating to incorporate these extensions.
Keywords: molecular structures, hierarchical computation, shared-memory parallel processing, computing with uncertainty.

1. Introduction

The determination of molecular structure is a key element of molecular biology. Proteins and nucleic acids are complex three-dimensional structures with thousands of atoms. An understanding of their structure can facilitate the study of how these molecules function and aid in the design of drugs which augment, interfere with, or otherwise affect their functions.

The problem of determining molecular structure is usually approached experimentally, using X-ray crystallography, nuclear magnetic resonance (NMR), or a variety of biochemical experiments that yield distances, angles, and other structural information [11]. In addition to these experimental data, there is a large body of knowledge from general chemistry: bond lengths, bond angles, and torsion angles between atoms. The quality and abundance of data from these sources can vary significantly. Thus, it is important to combine these sources of information effectively to produce not only a structure consistent with the data, but also a measure of the variability in the estimated structure.

A probabilistic algorithm for integrating multiple sources of uncertain data to compute an estimated structure was reported in [1]. The parallelization of this algorithm was presented in [6]. An extension to the algorithm to handle more general probability distributions was published in [2]. In this paper, we first briefly describe the formulation of the molecular structure determination problem and its algorithm. Then we show that by taking advantage of the problem structure of our application domain, we are able to organize the computation hierarchically. The hierarchical decomposition speeds up the computation tremendously by reducing the sequential computational complexity. In addition, the use of hierarchy exposes a new axis of parallelism that is both important and challenging to exploit. The results we have obtained for two problems reveal several issues in the automation and efficient parallelization of the hierarchical approach. Finally we outline the directions we are investigating to address these issues.

In the following discussion, we use measurements, observations, constraints, and data interchangeably to denote the available information about the structure of a molecule.

2. The Original Sequential Algorithm

We represent the unknown rectangular coordinates of the atoms of the molecule under examination by an n-dimensional state vector

x = (x₁, y₁, z₁, …, x_p, y_p, z_p)^t,

where n = 3p is three times the number of atoms. An idealized measurement of the molecule is modeled by a vector-valued function y of the state vector,

y = h(x).

For example, NMR measurements yield short-range distances between neighboring atoms. A distance function between two atoms of the molecule has the form

where x is the state vector of atomic coordinates, and (x_i, y_i, z_i) and (x_j, y_j, z_j) are the positions of the two atoms i and j, respectively.

Random errors and noise in the measurement process are modeled as additive terms, so that an actual observation assumes the form

z = y + v = h(x) + v,

where the noise v is modeled as a Gaussian random vector. If the measurement process is unbiased, then repeated independent measurements will cluster around the "true" value, and the random noise will have zero mean. The covariance matrix R of the Gaussian noise term is related to the precision of the measurement technology. High-precision measurements have tight variances. In the notation of probability, we assume v ~ N(0, R).

In addition to the state vector x, we also maintain a covariance matrix C. The diagonal entries of C are the variances of the coordinates of the state vector and provide a measure of the uncertainty of the estimated numbers. The off-diagonal entries are the covariances between the coordinates; they indicate linear correlations between the components of x. The information contained in the covariance matrix is useful in assessing, for example, which parts of the molecule are better defined by the data. Hence the pair (x, C) represents our best estimate of the molecular structure along with an indication of the variability of the estimated numbers, given the available data.

We may incrementally update our structure estimate as new observations become available. Figure 1 shows the procedure for the application of one constraint [8].

Figure 1: Sequential Update Algorithm

The algorithm is able to process observations in sequence rather than all at once because the effects of previous updates are summarized as linear correlations in the covariance matrix. Given an initial structure estimate, we apply this procedure over the entire set of constraints to produce an updated structure. Because of the nonlinearity in the measurement functions, we typically have to re-initialize the covariance matrix and repeat the cycle of updates until the estimate converges to an equilibrium point. This algorithm can be extended to process non-Gaussian constraints [2].

Let's briefly analyze the computational complexity of this algorithm. Suppose the dimension of the state vector is n. Furthermore, assume that we have M scalar observations organized into vectors of uniform dimension m, so that the sequential update algorithm is applied M/m times.

Application of the update procedure using one m-dimensional observation vector entails the following computation:

The (m x n) matrix H_j is sparse for the kinds of observation data we typically encounter. Its formation involves evaluation and differentiation of the measurement functions and usually runs in O(m) time.
Dense-sparse matrix multiplications (C_-H_j^t and H_jC_-H_j^t) require O(mn) operations.
The Cholesky factorization of the (m x m) matrix (H_jC_-H_j^t + R) takes O(m³) time.
Computation of the filter gain matrix K_j involves two triangular system solves with the dense (n x m) matrix C_-H_j^t and the Cholesky factor obtained in the previous step. This computation requires O(m²n) time.
Update to the state vector estimate consists mainly of a dense matrix-vector multiplication and runs in O(mn) time.
Update of the state covariance estimate requires a dense matrix multiplication and takes O(mn²) operations.
Miscellaneous vector operations take O(m + n) time.

The total amount of computation once over the entire set of constraints is M/m times the above steps. To assess the effect of the batch factor m on the total computational time, we consider the average work per scalar constraint by dividing the total computational time by the total dimension of constraints. We see from the above categories that the average work per scalar constraint for vector operations decreases as m increases; for Cholesky factorization it grows as O(m²) with m; and for the filter gain matrix computation it increases as O(mn) per scalar observation.

Therefore we expect the average work per scalar observation to increase linearly at the beginning as we process larger observation vectors at a time. However, when m is increased to a comparable magnitude with n, the O(m²) growth in Cholesky factorization will start to dominate.

This simple analysis suggests that we process observations in small chunks, although it does not indicate exactly what size of chunks to use since asymptotic analysis does not capture the behavior of the algorithm for small values of its parameters. Also, it does not take into account the number of cycles of constraint application to convergence. In practice, convergence seems to be slightly faster with larger constraint batches.

3. Hierarchical Decomposition

Biological molecules have a natural hierarchy in their structure. Observations relating to the structure of the molecules tend to be largely localized following this hierarchy as well. For example, the amino acid residues which make up a protein consist of a common backbone and distinguishing sidechains. Residues which are nearby in the linear sequence of the protein's primary structure may arrange themselves in space to form secondary structures such as helices and sheets. These hierarchical subunits may interact to form further aggregate features [10].

In this section, we present a hierarchical decomposition of the state estimate update process. It derives from the problem structure of the application domain, namely locality of most observations. Mathematically the key observation is that updating the structure estimate with localized information involves computation with many zeros and leaves other currently uncorrelated (zero covariance) parts of the estimate unmodified. This fact can be exploited to reduce the computational complexity of the update procedure by recursively dividing the molecule into smaller and smaller parts and applying every constraint to the smallest unit which wholly contains it.

Let our current estimate of the molecular structure be (x_-, C_-). Suppose the state vector estimate consists of two uncorrelated parts (whose dimensions are n₁ and n₂, respectively, with n₁ + n₂ = n); that is,

Let's examine what happens when we update our estimate with an observation which pertains to only one part (say, x⁽¹⁾) of the structure. The function (say, of dimension m) which relates this observation to the underlying state of the structure has the form:

If we trace the execution of the update process (Figure 1), inside the loop we first differentiate the function h with respect to the n state variables of x to produce a linearized relation H_j (dimension (m x n)), evaluated at the current estimate x_+,j-1. However, since h depends only on the first n₁ state variables, the last n₂ elements of each row of H_j are zeros,

H_j = (H_j⁽¹⁾ 0).

This in turn leads to a localized non-zero structure for the (n x m) matrix C_-H_j^t. The (m x m) matrix (H_jC_-H_j^t + R) is the covariance of the locally linearized model of the observation and in principle could be fully dense. However, the (n x m) gain matrix K_j has the same non-zero structure as C_-H_j^t.

The zeros in the lower half of the filter gain matrix prevent change to the lower n₂ coordinates of the state vector estimate. Furthermore, since

the update to the state covariance estimate will be restricted to the upper left block:

Thus this observation of a local part of the molecule produces only a corresponding local update to our structure estimate.

We should note that if we now introduce a constraint between the two parts of the molecule, then the off-diagonal zero blocks of the state covariance estimate will start to be filled in, and both halves of the state estimate will begin to change in response to new data.

The above derivation suggests a decomposition principle to reduce useless computation with zeros. Given an initial structure estimate (x_-, C_-):

Identify and partition the initial estimate into a number of uncorrelated parts:
Peel off (x_-⁽ⁱ⁾, C_-⁽ⁱⁱ⁾), 1 <= i <= q, and those constraints which functionally depend only on it, as a separate subproblem. We are left with constraints whose functional dependencies span partition boundaries.
Recursively apply steps 1 and 2 to decompose each subproblem.
In post-order, compute each subproblem with the constraints assigned to it and to no smaller subunit.

We should emphasize that this hierarchical organization achieves the same computation as the original flat problem. The difference is in the elimination of useless operations with zeros in the application of localized observations.

We can visualize a hierarchical organization of the molecular structure such as shown in Figure 2. Our original molecule of interest is positioned at the root. A node may give rise to some number of children, corresponding to initially uncorrelated sub-structures of the node. Associated with each node are observations local to itself; that is, we try to apply constraints at the lowest level of the tree possible. The update process starts at the leaf level. We update a leaf node with its associated constraints by applying the procedure of Figure 1. When all the children of a node have been updated, we may proceed to update the node itself.

3.1 Hierarchical vs. Flat Organization

In this section we use a set of simple problems to compare the computational efficiency of the hierarchical approach with that of the flat organization.

An RNA [10] molecule consists of a linear sequence of bases folded into a unique shape in space. There are four types of bases, A, C, G, and U, each consisting of a common backbone and a distinguishing sidechain. Beside the covalent bonds which chain the bases together, complementary bases may pair up out of the linear order. We focus on the simplest motif of RNA structure, a double helix. Structurally, an RNA double helix is made up of a series of base pairs twisted into a spiral pattern.

Figure 2: Hierarchical Decomposition of RNA Double Helix

In the flat organization, we simply treat the helix as a long vector of atoms. For hierarchical decomposition, we recursively divide the helical structure into two sub-helices of equal length, down to individual base pairs. We further decompose a base pair into two bases, each of which is divided into the leaf level of a backbone and a sidechain. This decomposition is illustrated in Figure 2.

The constraints we use to define the shape of the helix fall into five categories:

Distances between atoms in the backbones.
Distances between atoms in the sidechains.
Distances between an atom in the backbone and one in the sidechain of a base.
Distances between atoms on opposite sides of a base pair.
Distances between atoms on opposite sides of two adjacent base pairs.

For hierarchical computation, the first two types of distances may be applied to the leaf nodes of backbones and sidechains. One level up in the hierarchy, the individual bases may be assigned the third type of distances to orient the backbone and the sidechain. The fourth type of constraints position two bases into a base pair. Above the level of base pairs, we apply the last category of distances whenever we join two sub-helices to form a parent helix.

Table 1 and the accompanying Figure 5 show the execution times, for both flat and hierarchical organization, for helices of various lengths (in number of base pairs). We see that the average execution time per scalar constraint with hierarchical decomposition grows much more slowly than with flat organization of the same problems.

			Flat Organization		Hierarchical Decomposition
Helix Length	Number of Atoms	Number of Constraints	Total Time (sec)	Time per Constraint	Total Time (sec)	Time per Constraint	Speedup
1	43	675	1.16	0.00172	0.65	0.00096	1.78
2	86	1574	7.78	0.00494	2.42	0.00154	3.21
4	170	3294	54.09	0.01642	8.45	0.00257	6.40
8	340	6810	427.23	0.06274	30.98	0.00455	13.79
16	680	13824	3436.18	0.24857	114.20	0.00826	30.09

Table 1: Helix Run Times and Speedup of Hierarchical vs. Flat Organization

Figure 5: Helix Computational Efficiency of Hierarchical vs. Flat Organization

We can qualitatively understand the phenomenon as follows. Suppose the size of the molecule (dimension of the state vector) is n. The computational complexity of the update procedure of Figure 1 grows as O(n²) for a single scalar constraint. Therefore for flat organization the quadratic growth of per-constraint execution time is expected.

Next consider a hierarchical decomposition of a molecular structure in which a parent node gives rise to two children of the same size, half as large as the parent. If the depth of the binary tree is d, then at level i, (0 <= i <= d), there will be 2ⁱ nodes, each of size O(2^d-i). The average execution time per constraint depends on how much of the computation we can push down the tree. Suppose a node at level i has m_i scalar constraints, then the average execution time of a constraint is

In an optimistic scenario (such as the present examples of helices), most observations are localized, and we are able to assign most of them to nodes down toward the leaf level so that a node gets roughly some constant number of constraints. Then the average run time per constraint grows linearly with the size of the molecule:

t = O(2^d) = O(n).

On the other hand, if a node has as many constraints as those of its children combined, then the computational advantage of a hierarchical scheme over a flat organization is more modest:

We believe that most realistic problems lie between these two bounds and that the advantages of hierarchical computation are substantial in practice.

4. Parallelization

In this section, we discuss the potential parallelism in our hierarchical algorithm. We first focus on the computation within one node of the structure hierarchy. The computation required for each node of the hierarchy is an instance of the flat problem and involves repeated application of the update procedure of Figure 1 over all the constraints of the node.

4.1 Intra-Node Computation

A coarse-grained way to parallelize computation for a node is across the application of constraints. We may divide the set of constraints assigned to the node into two disjoint subsets. In parallel, these two groups of constraints can produce two different updated structure estimates, (x₁, C₁) and (x₂, C₂). These two estimates may then be combined to produce a final estimate using the following procedure:

Figure 3: Combination of Independent Updates of a Node

If we partition the collection of constraints into more than two subsets, then the separately produced results have to be combined pairwise in a "tournament" fashion.

This coarse-grained approach has the advantage of low inter-processor communication requirement in the processing of the constraints: the processing of disjoint subsets is independent. However, there are two drawbacks. First, the combination process is an expensive computational overhead which does not exist in sequential application of constraints. If the node state vector has dimension n, the combination process requires essentially the same amount of work as applying a constraint vector of the same dimension. If the total dimension of the entire set of constraints is M, then M needs to be much larger than n for this scheme to be worthwhile. Second, this parallel approach requires duplication of the state data structures (x_-, C_-), which take up the lion's share of the space required by a node.

In other words, this approach is more suitable for processing massive, redundant observations on structures whose state information is modest in size relative to the available storage. In contrast, data on biological structures are usually expensive to obtain and not in great abundance. Thus from the point of view of both computational overhead and scalability of memory requirement, parallelizing intra-node computation across constraint application is not attractive. See [6] for a numerical example.

We therefore rely on parallelism within the constraint application procedure (Figure 1) for each set of constraints. As shown in [6], this involves parallel matrix computation and yields good performance for nodes of realistic sizes.

4.2 Hierarchical Computation

Because the nodes in the structure hierarchy become progressively smaller toward the leaf level, sole reliance on intra-node parallelism is inadequate. Hierarchical decomposition exposes a new axis of parallelism beyond parallel matrix computation within a node. Although the children of a node in the structure hierarchy must be updated before the node, the computations for the subtrees rooted at those children are independent (more generally, nodes that do not have an ancestor-descendent relationship can be processed in parallel). Therefore given a number of processors assigned to a node, we may distribute these processors over the child branches. Moreover, if a node is assigned only a subset of the available processors, we may migrate the associated data closer to those processors in a physically distributed memory environment.

This presents a new challenge for efficient parallelization. How do we assign processors to nodes to achieve load balance? We would like to compute statically as much of the assignment as possible to have better control over data locality.

4.3 Processor Assignment

Our strategy is to estimate the amount of work at every node and divide processors among nodes accordingly. To determine how to do this, we constructed a set of nodes of various sizes. For each node size, we also varied the size of the constraint vector (number of scalar constraints) input into the update procedure at a time (Figure 1). We computed the average execution time of a scalar constraint for each experimental setting. The results are shown in Table 2 and the accompanying Figure 6.

	Node Size (Number of Atoms)
Constraint Batch Dimension	43	86	170	340	680
1	0.00535	0.02008	0.07784	0.34601	1.41522
2	0.00324	0.01181	0.04571	0.19945	0.80863
4	0.00204	0.00712	0.02670	0.11354	0.45738
8	0.00154	0.00507	0.01868	0.07613	0.30157
16	0.00141	0.00435	0.01537	0.06001	0.23427
32	0.00176	0.00514	0.01689	0.06301	0.23850
64	0.00246	0.00628	0.01916	0.06657	0.25133
128	0.00387	0.00899	0.02429	0.07583	0.27472
256	0.00747	0.01533	0.03788	0.11143	0.38431
512	0.01630	0.02915	0.06277	0.15257	0.46112

Table 2: Average Execution Time (in sec) Per Scalar Constraint

Figure 6: Projected Views of Per-Constraint Execution Time

We see that over a wide range of node sizes, the update procedure runs the fastest if we divide the constraints into batches of size 16, although from the view of FLOP count the run time should increase monotonically with the constraint batch size. As shown earlier in [6], when we segment the set of constraints into very small vectors and thus have to apply the update procedure (Figure 1) many times, the algorithm degenerates to vector operations. It repeatedly streams through one large data structure (the state covariance matrix) without much reuse in the cache and suffers large number of misses. Processing constraints in moderate-sized batches allows tiling of matrix computations, which exploits temporal locality well.

Also evident from Figure 6 is the quadratic growth of per-constraint execution time with the node size, which can be inferred from the slope of the lines on the log plot.

We applied a constrained least-squares polynomial fit to the data points of Table 2, excluding those for very small constraint batch sizes. We imposed two simple checks on the regression:

The leading coefficient should be positive, in order for the regression model to be a growth function.
The sum of coefficients and separately the constant term should be non-negative, so that the fitted polynomial does not predict a negative execution time for very small values of its parameters.

Although the constrained regression may not fit the data points as well as an unconstrained one, it ensures against catastrophic failure of the fitted model away from the data points.

The result of the regression is

Equation 1: Work Estimation

where t is the estimated execution time of an equivalent scalar constraint, n the node size (dimension of its state vector), and m the constraint batch dimension. We only included linear dependency on m because its higher-order effects are negligible over the range of values that we typically use and because the higher-order fits were not very stable and failed to satisfy the checks.

Although the above experiments used only distance constraints-the most prevalent form of data for molecular structures-we could in principle obtain a work estimation formula for each type of observation in which we are interested. This experiment may be done only once for each type of constraint (and not for every new problem) because the computational steps are the same except for the evaluation of the constraint functions.

Given a hierarchical decomposition of a structure, we can now distribute the computation to processors using the following heuristic steps:

Estimate the work at each node using Equation 1. Accumulate the numbers upward so that the estimated work in any subtree is available.
Assign all processors to work on the root of the tree.
Given a node which has already been assigned processors, order the subtrees anchored at its child nodes in increasing amount of work.
For every bipartition of the processors assigned to the node, find a partition point among the child subtrees to divide the work in a ratio as close to the processor partition as possible. Select the best match.
According to the best match, separate the processors and the child nodes into two respective groups. Recursively perform Step 4 on the two groups until every child node is assigned some number of processors.
Repeat Steps 3-5 for the child nodes.

A potential problem with this static assignment is that load imbalance may still occur if the mismatch between processor partition and subtree work ratio is significant.

4.4 Results of Parallel Computation

We now present the results of parallel computation for two problems. One is the RNA double helix of length 16 base pairs (Figure 2) described in Section 3.1. The other is the prokaryotic 30S ribosomal subunit. This molecular complex consists of 21 proteins and the 16S rRNA molecule, which contains about 65 double helices plus roughly the same number of interconnecting coils. The positions of the proteins have been determined by neutron diffraction mapping [4][5] and serve as reference points for placing other parts of the macromolecule. The helices and coils consist of RNA bases with geometrical constraints between them. They are positioned using experimental data which provide information about distances between helices and from helices to the proteins. The modeled problem has about 900 pseudo-atoms and 6500 constraints. A low-resolution discrete conformational space search [3] was first performed to generate initial structure estimates. This preprocessing step serves to combat low-quality locally optimal solutions which may be encountered by analytical procedures such as the current algorithm. See [7] for details of the problem and its computation. The decomposition of the 30S ribosome problem is indicated in Figure 4.

Figure 4: Hierarchical Decomposition of ribo30S

The execution platforms on which we performed our experiments are the Stanford DASH multiprocessor [9] and a Silicon Graphics Challenge. The DASH is an experimental research multiprocessor built at Stanford University. It supports an implicit shared-memory communication abstraction in hardware, with hardware-supported cache coherence. The machine we used has 32 processors organized into 8 clusters. A cluster comprises 4 33-MHz MIPS R3000 processors connected by a shared bus, and clusters are connected together in a mesh network. Every processor has separate 64KB first-level instruction and data caches and a 256KB second-level cache. Although the address space is logically shared, the main memory is physically distributed. Every cluster has an equal fraction of the physical memory of the machine. All caches in the system are kept coherent in hardware using a distributed directory-based protocol.

The SGI Challenge is also a shared-memory multiprocessor. It has 16 100-MHz MIPS R4400 processors with R4010 FPUs, connected to a 12 GB/sec bus. Each CPU has separate 16KB first-level instruction and data caches and a 1MB unified second-level cache. Unlike the DASH, the Challenge has centralized memory organization.

For both problems, we measured the execution time of the program over one complete cycle of application of all the constraints. It typically takes between 20 to 200 cycles over the constraint set for the solution to converge, but the pattern of computation is the same across cycles. We have excluded time for input, output, and initialization to discount cold-start effects. On DASH, we distributed some of the larger, contiguous pieces of data for a node to only the processor clusters which are assigned to the node to improve locality in main memory, in a round-robin fashion to avoid hot spots. We also recorded the time spent in the different types of array operations which make up almost all the execution time of the application in a typical run.

The results are shown in Table 3 through Table 6 and their accompanying figures. One confirmation of the effectiveness of the parallel hierarchical approach is that, compared to the largest problem we computed in [6], the current problems are much larger and yet executed in a fraction of the time. We see good speedups for both problems, around 24 for 32 processors on DASH and 14 for 16 processors on the Challenge. From the tables it is clear that most of the performance comes from efficient system solves, dense matrix multiplications, and dense matrix-vector multiplications. These operations have low communication requirement, and their computation can be effectively tiled for cache locality. On the other hand, the dense-sparse matrix multiplication routines achieve only about 55-75% of the ideal speedup on DASH. This can again be understood in terms of locality, since the sparse matrix to some degree randomly accesses its dense counterpart, and remote cache misses (the proportion of which increases with more processors) are more expensive than local cache misses on DASH. The two types of operations which do not parallelize well on both execution platforms are Cholesky factorization, because of the small size of the matrix to be factored and because of dependency in the computation of the matrix sub-blocks; and vector operations, which suffer from being interleaved with other array operations in the code such that the chances of the needed vectors being in the cache the next time around are small.

NP	time	spdup	d-s	chol	sys	m-m	m-v	vec
1	483.22	1.00	22.33	1.95	55.07	384.97	3.14	0.99
2	246.56	1.96	11.48	1.07	27.53	193.48	1.37	0.69
4	122.09	3.96	5.34	0.58	13.38	95.13	0.54	0.34
6	93.00	5.20	3.59	0.53	9.28	59.87	0.47	0.27
8	57.54	8.40	2.49	0.38	6.32	43.81	0.20	0.19
10	52.93	9.13	2.28	0.36	5.39	36.81	0.17	0.18
12	44.37	10.80	2.00	0.33	4.54	30.46	0.13	0.16
14	42.01	11.50	1.83	0.30	3.89	27.08	0.11	0.17
16	33.20	14.55	1.91	0.28	3.70	24.11	0.11	0.17
20	31.14	15.52	1.57	0.31	3.41	20.12	0.10	0.15
24	25.07	19.27	1.40	0.27	2.56	17.25	0.09	0.15
28	24.58	19.66	1.28	0.30	2.38	15.52	0.08	0.14
32	20.00	24.16	1.35	0.28	2.12	13.31	0.07	0.15

Table 3: Helix Work Time and Distribution on DASH

NP:	number of processors	d-s:	dense-sparse matrix multiplications
time:	work time, which is total execution time minus initialization	chol:	Cholesky factorization
	minus input time minus output time	sys:	system solves
spdup:	speedup factor	m-m:	dense matrix multiplications
		m-v:	dense matrix-vector multiplications
		vec:	vector operations

Figure 7: Helix Speedup and Time Distribution on DASH

NP	time	spdup	d-s	chol	sys	m-m	m-v	vec
1	924.57	1.00	17.33	0.83	33.18	861.05	3.01	0.61
2	446.42	2.07	9.09	0.50	16.90	411.72	1.26	0.33
4	215.95	4.28	4.67	0.29	8.35	197.34	0.29	0.17
6	137.95	6.70	2.58	0.22	5.09	120.30	0.21	0.12
8	110.48	8.37	2.29	0.34	4.73	92.14	0.16	0.10
10	87.98	10.51	1.90	0.17	3.13	75.98	0.09	0.10
12	72.60	12.74	1.71	0.17	3.01	62.32	0.12	0.08
14	67.83	13.63	1.70	0.16	2.62	56.28	0.07	0.08
16	60.02	15.40	1.53	0.18	2.31	51.07	0.07	0.08
20	49.09	18.83	1.42	0.16	1.93	41.57	0.06	0.08
24	43.93	21.05	1.43	0.33	1.62	37.10	0.05	0.08
32	38.14	24.24	1.17	0.16	1.37	32.22	0.04	0.08

Table 4: ribo30S Work Time and Distribution on DASH

Figure 8: ribo30S Speedup and Time Distribution on DASH

NP	time	spdup	d-s	chol	sys	m-m	m-v	vec
1	159.99	1.00	6.96	0.69	19.48	128.86	0.49	0.33
2	82.65	1.94	3.42	0.35	9.76	66.38	0.25	0.16
4	42.20	3.79	1.65	0.19	4.93	33.77	0.13	0.09
6	32.30	4.95	1.13	0.15	3.28	22.53	0.09	0.06
8	21.79	7.34	0.84	0.12	2.46	17.21	0.06	0.05
10	18.83	8.50	0.69	0.11	1.97	13.98	0.05	0.04
12	15.98	10.01	0.59	0.10	1.67	11.55	0.04	0.05
14	14.49	11.04	0.50	0.10	1.43	10.05	0.04	0.04
16	11.59	13.80	0.47	0.10	1.26	8.87	0.03	0.04

Table 5: Helix Work Time and Distribution on Challenge

Figure 9: Helix Speedup and Time Distribution on Challenge

NP	time	spdup	d-s	chol	sys	m-m	m-v	vec
1	272.53	1.00	5.37	0.32	11.55	253.52	0.29	0.15
2	145.41	1.87	2.68	0.17	5.73	134.46	0.15	0.08
4	72.56	3.76	1.33	0.10	2.88	66.68	0.08	0.05
6	50.35	5.41	0.91	0.08	2.06	45.19	0.05	0.03
8	37.26	7.31	0.69	0.06	1.45	33.98	0.04	0.03
10	29.44	9.26	0.56	0.06	1.17	26.77	0.03	0.03
12	24.96	10.92	0.48	0.05	0.96	22.44	0.03	0.03
14	21.91	12.44	0.43	0.05	0.84	19.69	0.03	0.03
16	18.86	14.45	0.40	0.06	0.74	16.85	0.02	0.03

Table 6: ribo30S Work Time and Distribution on Challenge

Figure 10: ribo30S Speedup and Time Distribution on Challenge

For the Helix problem, we see a noticeable dip in the speedup whenever the number of processors is not a power of 2. This is a short-coming of the static scheduling algorithm. Whenever the processors at a node cannot be evenly divided into two groups for the two subtrees, the computation effectively proceeds at the speed of the smaller group because the two groups must synchronize at the node after update of the subtrees. The ribo30S structure does not suffer this problem because its branching factor is larger, allowing for more even division of work.

In comparison with our previous, non-hierarchical version [6], the current application incurs noticeably higher memory overhead. The hierarchical code dynamically allocates nodes since their number and sizes can vary widely. It also handles more general types of constraints which use pointers to link together variable portions of the data structure. Hence data are much more scattered about in the address space. Unrelated data for different nodes may interleave depending on the placement decision of malloc(), and many small data structures, especially the constraints, cannot be properly placed and migrated at page granularity without massive data reorganization. This problem can be ameliorated by better data structure design to reduce fragmentation of logically contiguous data.

5. Further Work

We have demonstrated the efficiency of a hierarchical approach in molecular structure estimation and successfully applied parallel computation to its solution. The results of our experiments suggest several areas of further study to extend the effectiveness and functionality of our parallel algorithm.

Currently the application requires the user to specify the entire structure hierarchy (although a simple and non-optimal recursive bisection method is in place to decompose a flat problem specification). A computationally efficient hierarchical decomposition allows most of the constraints to be ascribed to nodes toward the leaf level of the tree. Effective automatic decomposition of a flat structure specification seems to us to require good solution to the graph partitioning problem. We can think of the atoms of the molecule as nodes in a graph and constraints between atoms as edges between the nodes. A heuristic to partition the graph into a small number of loosely coupled subgraphs will lead to an efficient decomposition of the molecular structure because most of the observations will be localized in the sub-structures, which can then be recursively subdivided.

An alternative, bottom-up approach may be to have the user specify the leaf nodes of the structure hierarchy. The leaves usually correspond to the building blocks of the molecular structure such as nucleotides or amino acid residues and already encapsulate a lot of information about locality of interaction. The program can then look for a good hierarchical grouping of the nodes to minimize the number of constraints that need to be applied at the higher levels of the hierarchy.

The results of the Helix problem shows that static scheduling of processors leads to some loss of efficiency in a hierarchy with low branching factors when processors cannot be evenly divided among the subtrees of a node. This is a more difficult problem to solve. Task (constraint) stealing between sibling tree nodes is made difficult by the fact that the computation within a node is an instance of the flat problem, which consists of many phases separated by synchronization among the processors assigned to that node. The stolen task results must be combined with those of the originating node, further increasing computational cost. However, dynamic reassignment of processors to nodes by periodic global synchronization does merit further investigation. This approach will make use of work estimation so that every processor (or processor group) can take roughly the same amount of time to process some number of constraints at the corresponding nodes between global synchronization and processor re-grouping.

From the results on DASH it is evident that the bend in the speedup curve correlates strongly with the increase in the overhead of memory operations. This increase comes from using pointers to link variable portions of data structures and the consequent fragmentation of logically contiguous data. For our application, this situation is not difficult to improve because, although many data structures have variable parts, they have the property that, once instantiated, they remain fixed throughout the execution of the program. More careful memory management to group related data together will improve data locality and should not be difficult to implement.

Finally, we have not examined the impact of hierarchy on convergence. All the computations and comparisons have been made for one complete cycle of application of all the constraints. Multiple iterations are usually required for the solution to settle to a stable value. The difference between the hierarchical organization and the flat computation is in the order of constraint application. Hierarchical computation processes constraints in order of locality of interaction, whereas flat computation is not guided by this property of the application domain. A comparison of three different ways of ordering constraints for flat computation can be found in [1], which as expected shows that incorporating domain knowledge into the mathematical procedure helps convergence. We believe hierarchical organization of constraints should further speed convergence in addition to reducing the computational complexity within an iteration.

6. Related Work

In the area of molecular structures there are two lines of research related to the work described here. One is distance geometry [12][13]. It takes as input a set of bounds on the interatomic distances of the molecule. Within these bounds, trial distances are generated and refined until a valid distance matrix results. The first three eigenvectors of a closely related matrix then gives the coordinates of a candidate structure which satisfies the distance bounds. The other goes along the line of energy minimization and molecular dynamics [14][16], which represent interactions between atoms by energy terms. Optimization techniques are applied to find a conformation with the lowest energy. A systematic comparison with respect to speed, accuracy, and sensitivity of these methods can be found in [15].

7. Conclusion

Biocomputing applications are an emerging arena in which high-performance computing has increasing relevance. One important component of computational biology is molecular structure determination. We have presented a parallel hierarchical method for estimating molecular structure from multiple sources of uncertain data and demonstrated the application on two shared-address-space multiprocessors using data sets from real biological problems. While we have obtained good speedups on the problems we have examined, the results of our experiments point to three areas of further improvement: automatic structure decomposition, processor load balancing, and data locality. These issues have relevance not only to our own application, but also to parallel hierarchical computation in general. We are currently investigating ways to address them.

8. Acknowledgment

We would like to thank John Hennessy for his support. We also thank Radhika Thekkath for helpful comments and Richard Chen for providing us with the 30S ribosome problem. This work is supported in part by ARPA contract DABT63-94-C-0054.

RBA is a Culpeper Medical Scholar and is supported by NIH LM-05652. Computing support has been provided in part by CAMIS resource NIH LM-05305.

9. References

[1] Altman, R. B. A Probabilistic Approach to Determining Biological Structure: Integrating Uncertain Data Sources. Int. J. Human Comp. Studies, 42, pp.593-616, 1995.

[2] Altman, R. B., C. C. Chen, W. B. Poland, J. P. Singh. Probabilistic Constraint Satisfaction with Non-Gaussian Noise. Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, pp. 15-22, San Francisco, CA, 1994.

[3] Altman, R. B., B. Weiser, H. F. Noller. Constraint Satisfaction Techniques for Modeling Large Complexes: Application to Central Domain of the 16S Ribosomal Subunit. Proceedings of Second International Conference on Intelligent Systems for Molecular Biology, pp.10-18, Stanford, August 1994, AAAI Press.

[4] Capel, M. S., D. M. Engelman, B. R. Freeborn, M. Kjeldgaard, J. A. Langer, V. Ramakrishnan, D. G. Schindler, D. K. Schneider, B. P. Schoenborn, I. Y. Sillers, et. al. A Complete Mapping of the Proteins in the Small Ribosomal Subunit of Escherichia coli. Science, 238(4832): 1403-6, 1987.

[5] Capel, M. S., M. Kjeldgaard, D. M. Engelman, P. B. Moore. Positions of S2, S13, S16, S17, S19 and S21 in the 30S Ribosomal Subunit of Escherichia coli. J. Mol. Bio., 200(1): 65-87, 1988.

[6] Chen, C. C., J. P. Singh, W. B. Poland, R. B. Altman. Parallel Protein Structure Prediction from Uncertain Data. Proceedings of the Supercomputing '94 Conference, pp.570-579, Washington, D. C., 1994.

[7] Chen, R., D. Fink, R. B. Altman. Computing the Structure of Large Complexes: Applying Constraint Satisfaction Techniques to Modeling the 16S Ribosomal RNA. Biomolecular NMR Spectroscopy (Markley, J. L. and S. J. Opella, eds.), pp.279-299, Oxford University Press, 1995.

[8] Gelb, A., ed. Applied Optimal Estimation. the MIT Press, Cambridge, MA, 1984.

[9] Lenoski, D., J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy. The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor. Proceedings of the 17th Annual International Symposium on Computer Architecture, pp.148-159, May 1990.

[10] Stryer, L. Biochemistry. W. H. Freeman, New York, 1991.

[11] Wuthrich, K. NMR of Proteins and Nucleic Acids. John Wiley and Sons, 1986.

[12] Crippen, G. M. Distance Geometry and Conformational Calculations. John Wiley and Sons, 1981.

[13] Havel, T. F., I. D. Kuntz, G. M. Crippen. The Theory and Practice of Distance Geometry. Bulletin of Mathematical Biology, 45(5), pp.665-720, 1983.

[14] Levitt, M., R. Sharon. Accurate Simulation of Protein Dynamics in Solution. Proceedings of the National Academy of Science, 85, pp.7557-7561, 1988.

[15] Liu, Y., D. Zhao, R. Altman, O. Jardetzky. A Systematic Comparison of Three Structure Determination Methods from NMR Data: Dependence upon Quality and Quantity of Data. Journal of Biomolecular NMR, 2, pp.373-388, 1992.

[16] Nemethy, G., H. A. Scheraga. Theoretical Studies of Protein Conformation by Means of Energy Computations. FASEB Journal, 4, pp.3189-3197, November 1990.