Notes about training DL models

Here are a few lessons learned about tuning DL models:

We should train with the maximal batch size the GPU RAM allows
Decaying learning rate (LR) tends to lead to better converged models
This is also true for Adam
For DataLoader, set number of workers to four times the number of GPU. Pin memory.
A rule of thumb: LR should be doubled when doubling the batch size
Increasing LR too much leads to loss surface curvature effects becoming important. The training becomes unstable and loss balloons.
If LR is too big, the big changes in parameters can cause the model forget earlier batches (forgetfulness). To see how big a problem it is, evaluate fixed model on earlier batches and see how loss deteriorates.
Gradient accumulation does not extend the runtime and can effectively increase the batch size.
For Kaggle competitions, K-fold CV leads to much tighter correlation between CV and LB and a single train/test split.