35

When training neural networks, one hyperparameter is the size of a minibatch. Common choices are 32, 64, and 128 elements per mini batch.

Are there any rules/guidelines on how big a mini-batch should be? Or any publications which investigate the effect on the training?

Ethan
  • 1,657
  • 9
  • 25
  • 39
Martin Thoma
  • 19,540
  • 36
  • 98
  • 170

2 Answers2

32

In On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima there are a couple of intersting statements:

It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize [...]

large-batch methods tend to converge to sharp minimizers of the training and testing functions—and as is well known, sharp minima lead to poorer generalization. n. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation.

From my masters thesis: Hence the choice of the mini-batch size influences:

  • Training time until convergence: There seems to be a sweet spot. If the batch size is very small (e.g. 8), this time goes up. If the batch size is huge, it is also higher than the minimum.
  • Training time per epoch: Bigger computes faster (is efficient)
  • Resulting model quality: The lower the better due to better generalization (?)

It is important to note hyper-parameter interactions: Batch size may interact with other hyper-parameters, most notably learning rate. In some experiments this interaction may make it hard to isolate the effect of batch size alone on model quality. Another strong interaction is with early stopping for regularisation.

See also

Martin Thoma
  • 19,540
  • 36
  • 98
  • 170
0

Based upon Andrew Ng's Deep Learning Specialisation Course 2, here're a few things to be kept in mind:

  1. Use mini-batch gradient descent if you have a large training set. Else for a small training set, use batch gradient descent.
  2. Mini-batch sizes are often chosen as a power of 2, i.e., 16,32,64,128,256 etc.
  3. Now, while choosing a proper size for mini-batch gradient descent, make sure that the minibatch fits in the CPU/GPU.
  4. 32 is generally a good choice

To know more, you can read this: A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size