Here are a few lessons learned about tuning DL models:
- We should train with the maximal batch size the GPU RAM allows
- Decaying learning rate (LR) tends to lead to better converged models
- This is also true for Adam
- For DataLoader, set number of workers to four times the number of GPU. Pin
memory.
- A rule of thumb: LR should be doubled when doubling the batch size
- Increasing LR too much leads to loss surface curvature effects becoming
important. The training becomes unstable and loss balloons.
- If LR is too big, the big changes in parameters can cause the model forget
earlier batches (forgetfulness). To see how big a problem it is, evaluate fixed
model on earlier batches and see how loss deteriorates.
- Gradient accumulation does not extend the runtime and can effectively increase
the batch size.
- For Kaggle competitions, K-fold CV leads to much tighter correlation between CV
and LB and a single train/test split.
6 Apr 2022