Adolfy Hoisie
Stefan Goedecker
Jurg Hutter
Density Functional theory (DFT) is one of the most popular methods for electronic structure calculations. It is at the same time highly accurate and faster than traditional quantum chemistry methods such as Hartree Fock (HF) theory. Recently, a lot of interest focused on improving the standard Local Density Approximation (LDA) of density functional theory. Gradient corrected DFT is one such scheme. Very encouraging results have been reported that demonstrate accuracy comparable with the best quantum chemistry methods such as Coupled Cluster methods. Gradient corrected schemes cancel most of the non-physical self-interactions of the LDA approximation, which are responsible for most of its deficiencies with respect to accuracy.
Another approach to improve upon LDA is to introduce the self-interaction corrections [1] (SIC) directly. This leads to a numerically more complicated scheme since now each orbital feels a different potential. Therefore, it is no longer possible to view the electronic structure problem as an eigenvalue problem in a external potential that has to satisfy a self-consistency condition. Using a preconditioned gradient minimization approach [2], with an DIIS convergence accelerator [3], we are able to find the electronic ground state in a number of iterations not much larger than the number of iterations necessary in ordinary LDA theory. Plane waves are used as basis set in this calculation. This necessitates pseudo-potentials to eliminate the highly localized electronic core states, that are impossible to describe by a reasonable number of plane waves.
An electronic structure program of this type consists of a fairly large number of subtasks performing different calculations that vary widely with respect to their computational needs (e.g., memory bandwidth or floating point performance). Several new algorithmic techniques were developed, that reformulate the computational subtasks in such a way as to obtain very high performance on modern RISC architectures. Details will be given in the next section. In this section we will discuss the general outline of the main computational tasks.
The program is based on the orbital parallel paradigm [4], i.e. each processor calculates one or several localized orbitals. Physical interactions between the orbitals lead to inter-processor communication. The two basic interactions involved are related to the fact that the orbitals repel each other electrostatically and that they have to be mutually orthogonal to satisfy the Pauli exclusion principle. Since the calculation of one localized orbital and its potential takes most of the CPU time, this parallelization scheme leads to rather modest communication requirements and high parallel performance. Its good scalability on a IBM SP2 is shown in Figure 1 for the case of a 64 atom Silicon system.
Figure 1: Speedup of this electronic structure code for a 64
atom Silicon system on a IBM SP2. The solid line is the measured speedup,
the dashed line the ideal speedup.
The electrostatic potential acting on a given orbital is obtained by Fast Fourier
Transform (FFT) techniques.
The charge density giving rise to this potential is obtained by summing
over the charge densities of all the other orbitals in the system. In practice,
this is effectuated by doing first a global inter-processor reduction sum over all
the orbitals followed by subtracting the charge density of the orbital
on which the potential is acting. If one uses FFT techniques to solve
Poisson's equation the total charge (i.e.
electronic plus ionic charge) has to be zero. Due to the electrostatic self-interaction
correction, this is not the case. Therefore, one has to add a
localized neutralizing charge distribution, that is analytically representable,
and whose electrostatic potential is known analytically as well.
The simplest charge-potential pair satisfying these requirements is a Gaussian charge
distribution that gives rise to a potential of the form , where
is the error function. The centre of the Gaussian charge distribution is
chosen to coincide with the centre of the localized orbitals in order to preserve conservation
of crystal momentum in periodic structures.
The calculations of the exchange correlation potential in the SIC-LDA scheme requires again
the total electronic density and the individual orbital densities for evaluating
the exchange correlation self-interaction correction.
Most of the published exchange correlation potential forms contain many
special functions such as logarithms and square roots, that are very slow
to calculate. We used a new rational functional form, that is
an order of magnitude faster to evaluate than these other forms.
Traditionally, pseudo-potentials are applied in Fourier space rather than real space. This leads first to large cubically scaling terms (that prevent one from calculating very large atomic systems) and second to large memory requirements. In our approach, we have implemented a new pseudopotential with optimal properties for real space implementation [5]. The computational load due to this form of pseudopotential scales only quadratically with the molecular system size. Also, since it has a very simple analytical form, it can be evaluated on the fly (i.e. it is recalculated every time it is needed) and has therefore very small memory requirements. This is essential in bringing down the memory requirements to a very modest 80 Mbytes for our test run. Since the recalculation of this pseudopotential leads to short loops, involving predominantly multiplications, this kernel runs at roughly 60 Mflops. It is the slowest part of our program. However, taking into account the large memory saving, it proved to be a useful tradeoff.
The kinetic energy of each orbital is calculated in Fourier space, where the kinetic energy operator is a diagonal matrix. In this part FFTs are thus again involved. Full advantage is taken of the fact that the orbitals are real (leading to real-to-complex FFT) and of the fact that in the initial stages of the FFT most of the Fourier coefficients are zero. The sparsity of the Fourier coefficients leads to shorter loops, thereby reducing the flop rate by approximately 10 Mflops/sec. Nevertheless this proved to be worthwhile, since it reduced the elapsed time of this part of the calculation by roughly one third.
Several new techniques are utilized for the orthogonalization
of the orbitals. For a symmetric Löwdin orthogonalization, one needs the
overlap matrix between the occupied orbitals. Optimal load balancing can be obtained only
if each processor calculates some fraction of each element of this overlap matrix .
This computation requires a data structure that is entirely different from the basic
orbital parallel data structure. We have developed a data transformation, consisting
of two data transpositions. One of them is an on-processor transposition,
whereas the second one is a global inter-processor transposition handled
by a single call to the all-to-all MPI routine. The sequence of
these two transformation reorders the data in the required way. The calculation of the
matrix
is done by a fixed point iteration, allowing
full parallelization of this part as well. This orthogonalization task is
the only computation in the code scaling cubically.
By making use of the sparsity of the overlap matrix
in the case of large systems,
the scaling of this part could actually easily be reduced.
However, because the orthogonalization is fully parallelized,
its contribution to the CPU time is very small even for large systems.
For the 64 Silicon atom system considered
here it contributes only 3 percent to the total CPU time, while for a 256 atom system,
its contribution rises to 10 percent. Reducing the scaling would introduce additional
overhead for these system sizes and effectively slow down the program.
Fast Fourier Transformations have a large ratio of load/stores to floating point
operations. On RISC processors, FFT performance is usually limited by
the speed at which data can be fed into the CPU. We have therefore
implemented a data access pattern [6]
that leads to mainly stride one data access. In
addition, this access pattern creates long inner loops, an important
optimization consideration for the R8000 processor
which has a rather deep pipeline. The 4 MByte-large secondary cache of
the R8000 architecture boosts performance significantly compared to other RISC processors
without this feature,
since it can hold all the data needed for moderately large 3-D FFT
transforms. In addition, we use an FFT kernel which has an optimal balance
of multiplications and additions [7]
and runs therefore virtually at peak performance in cases
where the data is available from first level cache.
In addition to these algorithmically efficient
concepts, an optimal implementation is needed. Software pipelining is used in order to
get good performance on this processor's deep pipeline. Since none of the
compiler options we tried gave satisfactory results, all the FFT kernels were
software pipelined by hand.
The real-to-complex 3-D FFT of size
for the electrostatic potential part runs at 140 Mflops and is nearly two times faster
than the corresponding FFT from the SGI library.
The exchange correlation functional is given by
where and the coefficients
and
are given
in reference [5].
The calculation of
from the charge
density
is numerically much more expensive than the evaluation
of the rational function. Unfortunately, the calculation of
can not be
avoided, since only a rational function of
gives a correct asymptotic
behaviour of
at low densities. To speed up this part of the calculation,
we compute
by a Newton iteration of the equation
.
We take advantage of the fact that the exchange correlation
potential is not needed to full machine precision. We also make use of
the smooth variation of the charge density on the grid which ensures
that the solution of the Newton
iteration on one grid point is a very good initial guess for the Newton
iteration of the next grid point. The calculations of the divisions in this
part of the program is accelerated by replacing them with the
reciprocal instruction which runs with a repeat rate of only 11 clock cycles.
Using all these tricks, the calculation of
is speeded up by a factor of 3 compared to a straightforward evaluation of
.
Since the divisions in this kernel are a significant part
of the program, we count one division as 22 floating point operations.
This corresponds to the maximum number of operations which one
floating point unit can do in 11 cycles. Let us also point out that this
throughput for reciprocal instruction on the SGI is really impressive. On an IBM RS6000,
a division takes 15 clock cycles corresponding to 60 floating point operations while
on a DEC Alpha EV4 processor a division takes 61 cycles corresponding to 61
floating point operations. We are attaining a speed
of 230 Mflops in this section of the code.
As mentioned above, the mono-pole correction requires the addition of a neutralizing
Gaussian charge density to the total charge density and subtracting a
potential from the solution obtained by the FFT techniques.
A direct evaluation of a Gaussian and error function on all the grid points is
extremely expensive. We therefore approximate both of them by a Chebychev expansion
of order 50, which gives close to machine precision accuracy for the values of
encountered in the computational problem. We schedule the calculation of both special
functions such that
only one evaluation of the Chebychev recursion is needed to evaluate both of them.
For the determination of the centre of mass of each localized orbital, we
multiply the orbital charge density by a smooth cutoff function, that is
also related to an error function. This function is also expanded in a
Chebychev polynomial of degree 50. In order to avoid instruction dependencies in the
Chebychev recursion, we carefully hand-optimised this part of the program
which runs at 180 Mflops per processor.
The program was originally developed for massively parallel computers, where communication is usually the bottleneck. Keeping communication at a minimum was therefore the leading rationale at all stages of the development. As a consequence, communication is virtually negligible on this 6 processor machine and we utilised MPI for portability reasons, even though it is certainly not optimal for a shared memory machine of this type.
The program can be applied to both molecules and periodic solids. For the benchmark run, we chose the calculation of the electronic structure of a vacancy in a 64 atom Silicon system shown in Figure 2. The position of the missing atom is denoted by the little red ball. One can see the Jahn-Teller distortion of the lattice atoms around the vacancy and one of the new bonds, that is formed between two atoms that are brought together by this distortion. This bond is shown by the blue electron cloud. To determine the number of operations of our benchmark, we ran a serial version of the program on a Cray YMP under the hardware performance monitor. We thus obtained the number of additions (138 G), multiplications (146 G) and divisions (4.56 G). Special function calls (exponentials etc) were not taken into account. We also did not take into account that a small fraction of the calculations is duplicated in the parallel version. The number of operations obtained this way is a lower bound to the real number of operations. For the time measurements we took the elapsed time of the parallel 6-way run (424 sec). Acounting for the divisions as described above, the computational speed was .903 Gflops. It is worth mentioning that the program runs at roughly 40 Mflops on the Cray, due to the software pipelining of the FFT's and recursive nature of several subroutines that destroy vectorization. The price of our target machine is 144.100 dollar (see Appendix). For the test run we used a 2-way memory interleaving configuration on a 8 processor Power Challenge and utilized only 6 of its processors. Technical staff from the SGI company confirmed that this configuration is absolutely identical to the machine whose price is detailed below. The MIPS R8000 processor which equips both machines has a peak performance of 360 Mflops and a 4 MByte secondary cache.
This then results in a performance/price ratio of 6.3 Gflops per million dollars.
We thank Mike Teter and Cyrus Umrigar for their help with exchange correlation potentials. The support of the Cornell Theory Centre and the access to their high performance parallel computers was essential during the whole development of the program. We also thank Rosario Caltabiano and Robert Montone of SGI for their continuous support on the SGI machine. We thank Mark Scannapieco of the Cornell Theory Centre for his competent system administration work on the SGI.