Notes about training DL models

Here are a few lessons learned about tuning DL models:

  • We should train with the maximal batch size the GPU RAM allows
  • Decaying learning rate (LR) tends to lead to better converged models
  • This is also true for Adam
  • For DataLoader, set number of workers to four times the number of GPU. Pin memory.
  • A rule of thumb: LR should be doubled when doubling the batch size
  • Increasing LR too much leads to loss surface curvature effects becoming important. The training becomes unstable and loss balloons.
  • If LR is too big, the big changes in parameters can cause the model forget earlier batches (forgetfulness). To see how big a problem it is, evaluate fixed model on earlier batches and see how loss deteriorates.
  • Gradient accumulation does not extend the runtime and can effectively increase the batch size.
  • For Kaggle competitions, K-fold CV leads to much tighter correlation between CV and LB and a single train/test split.

6 Apr 2022