Music | FLOSS |
Logic | Gaussian
## Tips on calculation efficiency

### Frequency calculations

### Amount of RAM memory

#### Brief conclusions

### t vs. nCPU

### Geometry optimizations

#### The *resend* trick

#### The effect of

#### Different methods applied to an example

### Single Point Calculations

**A** nCPU + **B**

I hope the info in this page will be helpful to the reader...

Tips on calculation efficiency | Various Benchmarks | TroubleshootingFrequency Calculations: |
Memory |

nCPU | |

Geometry Optimizations: |
The resend trick |

Memory | |

Different levels of theory | |

Single points: |
nCPU |

A **crucial** parameter for frequency calculations is the amount of memory asigned to the job through the keyword `%mem`

. For a frequency calculation Gaussian tries to solve something called *CPHF equations*, computing some integrals, and storing the results in a (big) matrix.

The values in such matrix can be stored on disk, in memory, or recomputed as needed. Nowadays the CPUs are so fast, and the disk I/O operations so slow (in comparison), that a **Direct** algorithm (means recalculate as needed, and is the default in Gaussian) is undoubtely the fastest in almost any case. Anyway, even the **Direct** algorithm stores as big a part of the CPHF matrix as it is able to in the RAM memory, because retrieving a result from the memory is faster than recomputing it (unlike retrieving it from the hard disk).

It follows from the previous paragraph that the bigger the amount of memory allocated for a freq job, the bigger the chunk of CPHF matrix stored there, and smaller the part that has to be recomputed (whenever its results are needed later in the calculation) because it didn't fit in the memory.

It is common wisdom that whenever a job requests a large amount of memory to a computer, it *retaliates* asigning less CPU time to it (damn vengeful computers!), so requesting a lot of memory *just in case* is not necessarily the best thing to do. There is an interesting tool (at *Orpheus*) called `freqmem`

(`freqmem ENTER`

for usage instructions), that gives an estimate of the optimum memory allocation (in megawords, do times 8 for MB). By the way, in my `~/MyTools/`

directory I have a *script* called `mem_freq.pl`

, which, when called with the name of a Gaussian output as argument, reads the number of atoms and basis functions, calling then `freqmem`

, and outputting the number of MB of `%mem`

required for a freq calc on that molecule, with that basis set (with a default of **Direct**, **RHF** and **spd** functions. For **Conventional** method of integration, **UHF** and/or **f** or higher functions, call `freqmem`

directly). Using this *script* is a comfortable way of finding out how much mem to allocate for the freq calc corresponding to an opt job just done, for example.

This value is the *minimum* memory so that the **whole** CPHF integral matrix fits in it. Recall that the **default** value for `%mem`

in Gaussian, if nothing else is specified, is a pitifull **48MB**. This is roughly a 10% of the potential of some of our PCs!

The output of `freqmem`

, unfortunately, does not seem to be the best `%mem`

asignation to do. Some results follow, which will hopefully clearify the subject. Clicking on the pictures leads to double-size versions of them.

These figures display the results of running frequency calculations of C_{n}Cl_{2n+2} moieties, at B3LYP/6-311+G* level of theory, run on node **tx41** of *Orpheus* (a 1.7GHz P4, with 256MB of RAM). The green vertical lines mark the memory required by each one of them, according to the aforementioned `freqmem`

command, while the red line marks the physical memory limit of the computer (256MB).

Two variables are displayed against the value given to `%mem`

in the input (in MB). The blue dots correspond to the number of derivatives done *at once*; if the allocated memory is not enough to store all the derivatives more than one pass will be necesary to complete the CPHF matrix, and this will have a negative impact on the performance.

On the other hand, the green crosses mark the percent time *difference* between the given `%mem`

and `%mem=50MB`

, which is a good approximation to the default **48MB**. The more negative the value, the *faster* the job finished (by the way, Walltimes have been used, not CPU times).

The bigger the job, the more critical the effect a larger memory allocation has on performance. For small jobs the default is fine, and going to the high-end is *very* countereffective. For larger calculations, allocating too much RAM is also negative, indeed, but ** much** less so than allocating too little memory.

It also looks like using just the amount of memory output by `freqmem`

is not enough for optimum performance; an extra bit seems to be necesary (just don't ask me **how much** more).

Only God knows why `%mem=70`

is **so** bad for C_{3}Cl_{8}, as is `%mem=80`

for C_{4}Cl_{10}.

A frequency calculation has been performed into [Cp_{2}ZrCH_{3}]^{+} at B3LYP/LanL2DZ theory level (90 active electrons, 287 basis functions, 528 primitives). The calculation was carried out in 1,2 and 4 nodes of the *vlarge* queue at *Orpheus*, and 1,2 and 4 CPUs of a single node at *Arina*. The results are given in the following table:

**Conclusions:**

The frequency calculation speed scales well with the number of CPUs, at least if the `%mem`

is set to a high enough number (so that the real number of used CPUs does not decrease because of lack of memory), and up to 4 processors.

In the first step of a geometry optimization the wavefunction corresponding to the starting geometry is calculated, and then the Hessian is derived from it, so that the forces acting upon each atom are computed, and subsequently the atoms are displaced in the direction of the forces obtained.

It is a common phenomenon to have an opt job in which the forces have long converged, but does not fully converge because the values of the displacementes are still too large.

I don't really understand how it works, but the displacement assigned to each atom is not exactly proportional to the force acting upon it, but also by some kind of "history" of previous steps (I don't know if it is only true for GDIIS or always). Anyway, we can eliminate the "history" part by killing the job and sending it again, with the usual `guess=read`

and `geom=check`

keywords. So easy. I have had many a job converge in just one step after being resent this way.

It is also known that the wavefunction, although fully converged *each* opt step, is calculated starting, as a guess, from the converged wavefunction of the previous step. We can force a "full" restart of the wavefunction (that is, take as a guess the default Huckel wavefunction or whatever), simply by **not** using `guess=read`

when following the procedure outlined in the paragraph above. This, a bit surprisingly maybe, can also lead to faster convergence in some cases (and, of course, is a complete waste of time 90% of the time, but...).

`%mem`

Calculations simmilar to those performed with C_{n}Cl_{2n+2} frequencies have been carried out for Proline n-mer GDIIS optimizations. In this case one can conclude that the value of `%mem`

has **absolutely** no effect whatsoever, except in the case of very small molecules (where probably **all** the previous steps are taken into account for the DIIS).

I have studied the following couple of molecules:

[Cp_{2}ZrCH_{3}]^{+} |
[Cp_{2}ZrCH_{3}]^{+}/ [CH_{3}B(C_{6}F_{5})_{3}]^{-} |

I have optimized their geometries and calculated the energies with different methods, as outlined in the following table:

Reference | B3LYP/SKBJ//B3LYP/SKBJ |

test-1 | B3LYP/SKBJ//HF/SKBJ |

test-2 | B3LYP/SKBJ//B3LYP/SKBJ with C_{6}F_{5} rings at STO-3G |

test-3 | B3LYP/SKBJ//B3LYP/SKBJ with C_{6}F_{5} rings Frozen |

test-4 | B3LYP/SKBJ//HF/SKBJ with C_{6}F_{5} rings Frozen |

test-5 | B3LYP/SKBJ//B3LYP/SKBJ without `opt=(gdiis)` |

test-6 | B3LYP/SKBJ//HF/SKBJ without `opt=(gdiis)` |

test-7 | B3LYP/SKBJ//B3LYP/SKBJ with internal C_{6}F_{5} ring distances Frozen |

test-8 | B3LYP/SKBJ//HF/SKBJ with internal C_{6}F_{5} ring distances Frozen |

The first thing to mention is that I have used `opt=(gdiis)`

in all calculations, except where noted otherwise. This keyword seems quite usefull, at least in this system. In fact, in the case of test-5, the optimization of the cation went from converging in 44 steps/18 hours (Reference), to not converging in 150 steps/57 hours.

The second thing is that usign STO-3G for some atoms caused convergence problems,and a noticeable *increase* in computation time. It is clear that the SKBJ pseudopotential is more time-effective than even the smallest all-electron basis set, and probably not really less accurate.

Thirdly, freezing some internal coordinates can lead to oscillatory convergence, as in the case of test-7, where it reached near-convergence normally, but then could not fully converge in 102 steps/452 hours.

Results for tests 1,3-4,6-8 are presented:

Reference | test-1 | test-3 | test-4 | test-6 | test-7 | test-8 | |

Complex |
|||||||

t(min) | 7154.5 | 3288.2 | 3079.8 | 570.2 | 4194.6 | 27107.2 |
3151.0 |

Ncyc | 27 | 25 | 12 | 2 | 33 | 102 |
24 |

t ratio | - | 0.46 | 0.43 | 0.08 | 0.59 | 3.79 |
0.44 |

tcyc ratio | - | 0.50 | 0.97 | 1.08 | 0.48 | 1.00 |
0.50 |

Cation |
|||||||

t(min) | 1080.2 | 357.8 | 1080.2 | 357.8 | 2001.8 | 1080.2 | 357.8 |

Ncyc | 44 | 49 | 44 | 49 | 54 | 44 | 49 |

t ratio | - | 0.33 | 1.00 | 0.33 | 1.85 | 1.00 | 0.33 |

tcyc ratio | - | 0.51 | 1.00 | 0.46 | 0.47 | 1.00 | 0.46 |

**Conclusions:**

- Provided that Gaussian has no pruned grid for Zr, the integration over the space of DFT functional parts is excruciatingly slow, leading to an astonishing time save when only HF integrals are computed. A subsequent B3LYP
*single point*results a very good approximation for energy differences (not shown). - Freezing some of the optimization variables (tests 3 and 4) leads to an additional time save, as could have been predicted. Anyway, even if freezing the cartesian coordinates gives a noticeable save, doing so with internal variables (interatomic distances) is not really effective (tests 7-8).

`%mem`

vs. nCPUSome *single point* calculations have been carried out at *Arina* on the same system but with four different basis sets, with 691, 863, 1254 and 2469 basis functions, respectively. Each calculation has been run on 1, 2 and 4 CPUs (of the same node) and the minimum `%mem`

-s as to have the job correctly done and actually running on *all* the CPUs, have been collected and pictured in the following graph:

It turns out that the required amount of memory is nearly linear with the amount of CPUs running the job:

`%mem`

= The values for the equation above, and the linear regression coefficient, R, are summarized below:

N_{B} |
A | B | R |

691 | 20 | 11.4 | 0.99419 |

863 | 35 | 16.4 | 0.99718 |

1254 | 65 | 25 | 1.00000 |

2469 | 227.5 | 66 | 0.99891 |