Miscellanea

Home IH site Publications Software Blog e-mail ReadmeMiscellaneaCurriculum Valid HTML 4.01! Valid CSS! CC License
Music | FLOSS | Logic | Gaussian

I hope the info in this page will be helpful to the reader...

Tips on calculation efficiency | Various Benchmarks | Troubleshooting

Tips on calculation efficiency

Frequency Calculations: Memory
nCPU

Geometry Optimizations: The resend trick
Memory
Different levels of theory

Single points: nCPU

Frequency calculations

Amount of RAM memory

A crucial parameter for frequency calculations is the amount of memory asigned to the job through the keyword %mem. For a frequency calculation Gaussian tries to solve something called CPHF equations, computing some integrals, and storing the results in a (big) matrix.

The values in such matrix can be stored on disk, in memory, or recomputed as needed. Nowadays the CPUs are so fast, and the disk I/O operations so slow (in comparison), that a Direct algorithm (means recalculate as needed, and is the default in Gaussian) is undoubtely the fastest in almost any case. Anyway, even the Direct algorithm stores as big a part of the CPHF matrix as it is able to in the RAM memory, because retrieving a result from the memory is faster than recomputing it (unlike retrieving it from the hard disk).

It follows from the previous paragraph that the bigger the amount of memory allocated for a freq job, the bigger the chunk of CPHF matrix stored there, and smaller the part that has to be recomputed (whenever its results are needed later in the calculation) because it didn't fit in the memory.

It is common wisdom that whenever a job requests a large amount of memory to a computer, it retaliates asigning less CPU time to it (damn vengeful computers!), so requesting a lot of memory just in case is not necessarily the best thing to do. There is an interesting tool (at Orpheus) called freqmem (freqmem ENTER for usage instructions), that gives an estimate of the optimum memory allocation (in megawords, do times 8 for MB). By the way, in my ~/MyTools/ directory I have a script called mem_freq.pl, which, when called with the name of a Gaussian output as argument, reads the number of atoms and basis functions, calling then freqmem, and outputting the number of MB of %mem required for a freq calc on that molecule, with that basis set (with a default of Direct, RHF and spd functions. For Conventional method of integration, UHF and/or f or higher functions, call freqmem directly). Using this script is a comfortable way of finding out how much mem to allocate for the freq calc corresponding to an opt job just done, for example.

This value is the minimum memory so that the whole CPHF integral matrix fits in it. Recall that the default value for %mem in Gaussian, if nothing else is specified, is a pitifull 48MB. This is roughly a 10% of the potential of some of our PCs!

The output of freqmem, unfortunately, does not seem to be the best %mem asignation to do. Some results follow, which will hopefully clearify the subject. Clicking on the pictures leads to double-size versions of them.

These figures display the results of running frequency calculations of CnCl2n+2 moieties, at B3LYP/6-311+G* level of theory, run on node tx41 of Orpheus (a 1.7GHz P4, with 256MB of RAM). The green vertical lines mark the memory required by each one of them, according to the aforementioned freqmem command, while the red line marks the physical memory limit of the computer (256MB).

Two variables are displayed against the value given to %mem in the input (in MB). The blue dots correspond to the number of derivatives done at once; if the allocated memory is not enough to store all the derivatives more than one pass will be necesary to complete the CPHF matrix, and this will have a negative impact on the performance.

On the other hand, the green crosses mark the percent time difference between the given %mem and %mem=50MB, which is a good approximation to the default 48MB. The more negative the value, the faster the job finished (by the way, Walltimes have been used, not CPU times).

ccl4 c2cl6
c3cl8 c4cl10
c5cl12 c6cl14

Brief conclusions

The bigger the job, the more critical the effect a larger memory allocation has on performance. For small jobs the default is fine, and going to the high-end is very countereffective. For larger calculations, allocating too much RAM is also negative, indeed, but much less so than allocating too little memory.

It also looks like using just the amount of memory output by freqmem is not enough for optimum performance; an extra bit seems to be necesary (just don't ask me how much more).

Only God knows why %mem=70 is so bad for C3Cl8, as is %mem=80 for C4Cl10.

t vs. nCPU

A frequency calculation has been performed into [Cp2ZrCH3]+ at B3LYP/LanL2DZ theory level (90 active electrons, 287 basis functions, 528 primitives). The calculation was carried out in 1,2 and 4 nodes of the vlarge queue at Orpheus, and 1,2 and 4 CPUs of a single node at Arina. The results are given in the following table:

t_vs_ncpu_freq.png

Conclusions:

The frequency calculation speed scales well with the number of CPUs, at least if the %mem is set to a high enough number (so that the real number of used CPUs does not decrease because of lack of memory), and up to 4 processors.

Geometry optimizations

The resend trick

In the first step of a geometry optimization the wavefunction corresponding to the starting geometry is calculated, and then the Hessian is derived from it, so that the forces acting upon each atom are computed, and subsequently the atoms are displaced in the direction of the forces obtained.

It is a common phenomenon to have an opt job in which the forces have long converged, but does not fully converge because the values of the displacementes are still too large.

I don't really understand how it works, but the displacement assigned to each atom is not exactly proportional to the force acting upon it, but also by some kind of "history" of previous steps (I don't know if it is only true for GDIIS or always). Anyway, we can eliminate the "history" part by killing the job and sending it again, with the usual guess=read and geom=check keywords. So easy. I have had many a job converge in just one step after being resent this way.

It is also known that the wavefunction, although fully converged each opt step, is calculated starting, as a guess, from the converged wavefunction of the previous step. We can force a "full" restart of the wavefunction (that is, take as a guess the default Huckel wavefunction or whatever), simply by not using guess=read when following the procedure outlined in the paragraph above. This, a bit surprisingly maybe, can also lead to faster convergence in some cases (and, of course, is a complete waste of time 90% of the time, but...).

The effect of %mem

Calculations simmilar to those performed with CnCl2n+2 frequencies have been carried out for Proline n-mer GDIIS optimizations. In this case one can conclude that the value of %mem has absolutely no effect whatsoever, except in the case of very small molecules (where probably all the previous steps are taken into account for the DIIS).

pro1 pro2
pro3 pro4
pro3 pro4

Different methods applied to an example

I have studied the following couple of molecules:

pro1 pro2
[Cp2ZrCH3]+ [Cp2ZrCH3]+/ [CH3B(C6F5)3]-

I have optimized their geometries and calculated the energies with different methods, as outlined in the following table:

Reference B3LYP/SKBJ//B3LYP/SKBJ
test-1 B3LYP/SKBJ//HF/SKBJ
test-2 B3LYP/SKBJ//B3LYP/SKBJ with C6F5 rings at STO-3G
test-3 B3LYP/SKBJ//B3LYP/SKBJ with C6F5 rings Frozen
test-4 B3LYP/SKBJ//HF/SKBJ with C6F5 rings Frozen
test-5 B3LYP/SKBJ//B3LYP/SKBJ without opt=(gdiis)
test-6 B3LYP/SKBJ//HF/SKBJ without opt=(gdiis)
test-7 B3LYP/SKBJ//B3LYP/SKBJ with internal C6F5 ring distances Frozen
test-8 B3LYP/SKBJ//HF/SKBJ with internal C6F5 ring distances Frozen

The first thing to mention is that I have used opt=(gdiis) in all calculations, except where noted otherwise. This keyword seems quite usefull, at least in this system. In fact, in the case of test-5, the optimization of the cation went from converging in 44 steps/18 hours (Reference), to not converging in 150 steps/57 hours.

The second thing is that usign STO-3G for some atoms caused convergence problems,and a noticeable increase in computation time. It is clear that the SKBJ pseudopotential is more time-effective than even the smallest all-electron basis set, and probably not really less accurate.

Thirdly, freezing some internal coordinates can lead to oscillatory convergence, as in the case of test-7, where it reached near-convergence normally, but then could not fully converge in 102 steps/452 hours.

Results for tests 1,3-4,6-8 are presented:

Reference test-1 test-3 test-4 test-6 test-7 test-8
Complex
t(min) 7154.5 3288.2 3079.8 570.2 4194.6 27107.2 3151.0
Ncyc 27 25 12 2 33 102 24
t ratio - 0.46 0.43 0.08 0.59 3.79 0.44
tcyc ratio - 0.50 0.97 1.08 0.48 1.00 0.50
Cation
t(min) 1080.2 357.8 1080.2 357.8 2001.8 1080.2 357.8
Ncyc 44 49 44 49 54 44 49
t ratio - 0.33 1.00 0.33 1.85 1.00 0.33
tcyc ratio - 0.51 1.00 0.46 0.47 1.00 0.46

Conclusions:

  1. Provided that Gaussian has no pruned grid for Zr, the integration over the space of DFT functional parts is excruciatingly slow, leading to an astonishing time save when only HF integrals are computed. A subsequent B3LYP single point results a very good approximation for energy differences (not shown).
  2. Freezing some of the optimization variables (tests 3 and 4) leads to an additional time save, as could have been predicted. Anyway, even if freezing the cartesian coordinates gives a noticeable save, doing so with internal variables (interatomic distances) is not really effective (tests 7-8).

Single Point Calculations

%mem vs. nCPU

Some single point calculations have been carried out at Arina on the same system but with four different basis sets, with 691, 863, 1254 and 2469 basis functions, respectively. Each calculation has been run on 1, 2 and 4 CPUs (of the same node) and the minimum %mem-s as to have the job correctly done and actually running on all the CPUs, have been collected and pictured in the following graph:

mem_vs_ncpu

It turns out that the required amount of memory is nearly linear with the amount of CPUs running the job:

%mem = A nCPU + B

The values for the equation above, and the linear regression coefficient, R, are summarized below:

NB A B R
691 20 11.4 0.99419
863 35 16.4 0.99718
1254 65 25 1.00000
2469 227.5 66 0.99891