Electronic Structure of Materials Using Self-Interaction Corrected Density Functional Theory

Adolfy Hoisie
Cornell Theory Center
Ithaca,NY 14853,USA

Stefan Goedecker
Max-Planck Institute for Solid State Research
Stuttgart, Germany

Jurg Hutter
Max-Planck Institute for Solid State Research
Stuttgart, Germany

Abstract:

We have developed an highly efficient electronic structure code, for parallel computers using message passing. The algorithm takes advantage of the natural parallelism in quantum chemistry problems to obtain very high performance even on a large number of processors. Most of the terms which scale cubically with respect to the number of atoms have been eliminated allowing the treatment of very large systems. It uses one of the most precise versions of Density Functional Theory, namely Self-Interaction Corrected Density Functional Theory. On a 6 processor Silicon Graphics Symmetric Multiprocessor based on the MIPS R8000 microprocessor, we obtain a performance of 6.3 Gflops per million dollar.

Density Functional Theory

Density Functional theory (DFT) is one of the most popular methods for electronic structure calculations. It is at the same time highly accurate and faster than traditional quantum chemistry methods such as Hartree Fock (HF) theory. Recently, a lot of interest focused on improving the standard Local Density Approximation (LDA) of density functional theory. Gradient corrected DFT is one such scheme. Very encouraging results have been reported that demonstrate accuracy comparable with the best quantum chemistry methods such as Coupled Cluster methods. Gradient corrected schemes cancel most of the non-physical self-interactions of the LDA approximation, which are responsible for most of its deficiencies with respect to accuracy.

Another approach to improve upon LDA is to introduce the self-interaction corrections [1] (SIC) directly. This leads to a numerically more complicated scheme since now each orbital feels a different potential. Therefore, it is no longer possible to view the electronic structure problem as an eigenvalue problem in a external potential that has to satisfy a self-consistency condition. Using a preconditioned gradient minimization approach [2], with an DIIS convergence accelerator [3], we are able to find the electronic ground state in a number of iterations not much larger than the number of iterations necessary in ordinary LDA theory. Plane waves are used as basis set in this calculation. This necessitates pseudo-potentials to eliminate the highly localized electronic core states, that are impossible to describe by a reasonable number of plane waves.

General Outline of the Computational Tasks

An electronic structure program of this type consists of a fairly large number of subtasks performing different calculations that vary widely with respect to their computational needs (e.g., memory bandwidth or floating point performance). Several new algorithmic techniques were developed, that reformulate the computational subtasks in such a way as to obtain very high performance on modern RISC architectures. Details will be given in the next section. In this section we will discuss the general outline of the main computational tasks.

The program is based on the orbital parallel paradigm [4], i.e. each processor calculates one or several localized orbitals. Physical interactions between the orbitals lead to inter-processor communication. The two basic interactions involved are related to the fact that the orbitals repel each other electrostatically and that they have to be mutually orthogonal to satisfy the Pauli exclusion principle. Since the calculation of one localized orbital and its potential takes most of the CPU time, this parallelization scheme leads to rather modest communication requirements and high parallel performance. Its good scalability on a IBM SP2 is shown in Figure 1 for the case of a 64 atom Silicon system.

Figure 1: Speedup of this electronic structure code for a 64 atom Silicon system on a IBM SP2. The solid line is the measured speedup, the dashed line the ideal speedup.

The electrostatic potential acting on a given orbital is obtained by Fast Fourier Transform (FFT) techniques. The charge density giving rise to this potential is obtained by summing over the charge densities of all the other orbitals in the system. In practice, this is effectuated by doing first a global inter-processor reduction sum over all the orbitals followed by subtracting the charge density of the orbital on which the potential is acting. If one uses FFT techniques to solve Poisson's equation the total charge (i.e. electronic plus ionic charge) has to be zero. Due to the electrostatic self-interaction correction, this is not the case. Therefore, one has to add a localized neutralizing charge distribution, that is analytically representable, and whose electrostatic potential is known analytically as well. The simplest charge-potential pair satisfying these requirements is a Gaussian charge distribution that gives rise to a potential of the form , where is the error function. The centre of the Gaussian charge distribution is chosen to coincide with the centre of the localized orbitals in order to preserve conservation of crystal momentum in periodic structures. The calculations of the exchange correlation potential in the SIC-LDA scheme requires again the total electronic density and the individual orbital densities for evaluating the exchange correlation self-interaction correction. Most of the published exchange correlation potential forms contain many special functions such as logarithms and square roots, that are very slow to calculate. We used a new rational functional form, that is an order of magnitude faster to evaluate than these other forms.

Traditionally, pseudo-potentials are applied in Fourier space rather than real space. This leads first to large cubically scaling terms (that prevent one from calculating very large atomic systems) and second to large memory requirements. In our approach, we have implemented a new pseudopotential with optimal properties for real space implementation [5]. The computational load due to this form of pseudopotential scales only quadratically with the molecular system size. Also, since it has a very simple analytical form, it can be evaluated on the fly (i.e. it is recalculated every time it is needed) and has therefore very small memory requirements. This is essential in bringing down the memory requirements to a very modest 80 Mbytes for our test run. Since the recalculation of this pseudopotential leads to short loops, involving predominantly multiplications, this kernel runs at roughly 60 Mflops. It is the slowest part of our program. However, taking into account the large memory saving, it proved to be a useful tradeoff.

The kinetic energy of each orbital is calculated in Fourier space, where the kinetic energy operator is a diagonal matrix. In this part FFTs are thus again involved. Full advantage is taken of the fact that the orbitals are real (leading to real-to-complex FFT) and of the fact that in the initial stages of the FFT most of the Fourier coefficients are zero. The sparsity of the Fourier coefficients leads to shorter loops, thereby reducing the flop rate by approximately 10 Mflops/sec. Nevertheless this proved to be worthwhile, since it reduced the elapsed time of this part of the calculation by roughly one third.

Several new techniques are utilized for the orthogonalization of the orbitals. For a symmetric Löwdin orthogonalization, one needs the overlap matrix between the occupied orbitals. Optimal load balancing can be obtained only if each processor calculates some fraction of each element of this overlap matrix . This computation requires a data structure that is entirely different from the basic orbital parallel data structure. We have developed a data transformation, consisting of two data transpositions. One of them is an on-processor transposition, whereas the second one is a global inter-processor transposition handled by a single call to the all-to-all MPI routine. The sequence of these two transformation reorders the data in the required way. The calculation of the matrix is done by a fixed point iteration, allowing full parallelization of this part as well. This orthogonalization task is the only computation in the code scaling cubically. By making use of the sparsity of the overlap matrix in the case of large systems, the scaling of this part could actually easily be reduced. However, because the orthogonalization is fully parallelized, its contribution to the CPU time is very small even for large systems. For the 64 Silicon atom system considered here it contributes only 3 percent to the total CPU time, while for a 256 atom system, its contribution rises to 10 percent. Reducing the scaling would introduce additional overhead for these system sizes and effectively slow down the program.

Detailed description of the main subroutines

Fast Fourier routines

Fast Fourier Transformations have a large ratio of load/stores to floating point operations. On RISC processors, FFT performance is usually limited by the speed at which data can be fed into the CPU. We have therefore implemented a data access pattern [6] that leads to mainly stride one data access. In addition, this access pattern creates long inner loops, an important optimization consideration for the R8000 processor which has a rather deep pipeline. The 4 MByte-large secondary cache of the R8000 architecture boosts performance significantly compared to other RISC processors without this feature, since it can hold all the data needed for moderately large 3-D FFT transforms. In addition, we use an FFT kernel which has an optimal balance of multiplications and additions [7] and runs therefore virtually at peak performance in cases where the data is available from first level cache. In addition to these algorithmically efficient concepts, an optimal implementation is needed. Software pipelining is used in order to get good performance on this processor's deep pipeline. Since none of the compiler options we tried gave satisfactory results, all the FFT kernels were software pipelined by hand. The real-to-complex 3-D FFT of size for the electrostatic potential part runs at 140 Mflops and is nearly two times faster than the corresponding FFT from the SGI library.

Calculation of the exchange correlation potential

The exchange correlation functional is given by

where and the coefficients and are given in reference [5]. The calculation of from the charge density is numerically much more expensive than the evaluation of the rational function. Unfortunately, the calculation of can not be avoided, since only a rational function of gives a correct asymptotic behaviour of at low densities. To speed up this part of the calculation, we compute by a Newton iteration of the equation . We take advantage of the fact that the exchange correlation potential is not needed to full machine precision. We also make use of the smooth variation of the charge density on the grid which ensures that the solution of the Newton iteration on one grid point is a very good initial guess for the Newton iteration of the next grid point. The calculations of the divisions in this part of the program is accelerated by replacing them with the reciprocal instruction which runs with a repeat rate of only 11 clock cycles. Using all these tricks, the calculation of is speeded up by a factor of 3 compared to a straightforward evaluation of . Since the divisions in this kernel are a significant part of the program, we count one division as 22 floating point operations. This corresponds to the maximum number of operations which one floating point unit can do in 11 cycles. Let us also point out that this throughput for reciprocal instruction on the SGI is really impressive. On an IBM RS6000, a division takes 15 clock cycles corresponding to 60 floating point operations while on a DEC Alpha EV4 processor a division takes 61 cycles corresponding to 61 floating point operations. We are attaining a speed of 230 Mflops in this section of the code.

Electrostatic mono-pole correction

As mentioned above, the mono-pole correction requires the addition of a neutralizing Gaussian charge density to the total charge density and subtracting a potential from the solution obtained by the FFT techniques. A direct evaluation of a Gaussian and error function on all the grid points is extremely expensive. We therefore approximate both of them by a Chebychev expansion of order 50, which gives close to machine precision accuracy for the values of encountered in the computational problem. We schedule the calculation of both special functions such that only one evaluation of the Chebychev recursion is needed to evaluate both of them. For the determination of the centre of mass of each localized orbital, we multiply the orbital charge density by a smooth cutoff function, that is also related to an error function. This function is also expanded in a Chebychev polynomial of degree 50. In order to avoid instruction dependencies in the Chebychev recursion, we carefully hand-optimised this part of the program which runs at 180 Mflops per processor.

Communication subroutines

The program was originally developed for massively parallel computers, where communication is usually the bottleneck. Keeping communication at a minimum was therefore the leading rationale at all stages of the development. As a consequence, communication is virtually negligible on this 6 processor machine and we utilised MPI for portability reasons, even though it is certainly not optimal for a shared memory machine of this type.

Description of the benchmark run

The program can be applied to both molecules and periodic solids. For the benchmark run, we chose the calculation of the electronic structure of a vacancy in a 64 atom Silicon system shown in Figure 2. The position of the missing atom is denoted by the little red ball. One can see the Jahn-Teller distortion of the lattice atoms around the vacancy and one of the new bonds, that is formed between two atoms that are brought together by this distortion. This bond is shown by the blue electron cloud. To determine the number of operations of our benchmark, we ran a serial version of the program on a Cray YMP under the hardware performance monitor. We thus obtained the number of additions (138 G), multiplications (146 G) and divisions (4.56 G). Special function calls (exponentials etc) were not taken into account. We also did not take into account that a small fraction of the calculations is duplicated in the parallel version. The number of operations obtained this way is a lower bound to the real number of operations. For the time measurements we took the elapsed time of the parallel 6-way run (424 sec). Acounting for the divisions as described above, the computational speed was .903 Gflops. It is worth mentioning that the program runs at roughly 40 Mflops on the Cray, due to the software pipelining of the FFT's and recursive nature of several subroutines that destroy vectorization. The price of our target machine is 144.100 dollar (see Appendix). For the test run we used a 2-way memory interleaving configuration on a 8 processor Power Challenge and utilized only 6 of its processors. Technical staff from the SGI company confirmed that this configuration is absolutely identical to the machine whose price is detailed below. The MIPS R8000 processor which equips both machines has a peak performance of 360 Mflops and a 4 MByte secondary cache.

This then results in a performance/price ratio of 6.3 Gflops per million dollars.

Acknowledgements

We thank Mike Teter and Cyrus Umrigar for their help with exchange correlation potentials. The support of the Cornell Theory Centre and the access to their high performance parallel computers was essential during the whole development of the program. We also thank Rosario Caltabiano and Robert Montone of SGI for their continuous support on the SGI machine. We thank Mark Scannapieco of the Cornell Theory Centre for his competent system administration work on the SGI.

Appendix: Total price of the machine

References

1: J. P. Perdew and A. Zunger. Phys. Rev. B. 23, 5048 (1981).
2: M. P. Teter, M. C. Payne and D. C. Allan. Phys. Rev. B. 40, 255 (1989).
3: J. Hutter, H.P. Lüthi and M. Parrinello. Comp. Mat. Sci. 2, 244 (1994).
4: S. Goedecker. J. of Comp. Phys, 118, 261 (1995).
5: S. Goedecker, M. Teter and J. Hutter. To be published in Phys. Rev. B.
6: S. Goedecker. Comp. Phys. Commun. 76, 294 (1993).
7: S. Goedecker. SIAM, J. of Sc. Comp., to be published.