Batch Normalization

Image from https://medium.com/@ilango100/batch-normalization-speed-up-neural-network-training-245e39a62f85

In 2015, a very effective approach (called Batch Normalization) has been proposed to address the vanishing/exploding gradients problems. This learns two parameters to find the optimal scale and mean of the inputs for each layer. Batch Normalization is placed just before the activation function of each layer. It zero-centers and normalizes the inputs over the current mini-batch (this is why it is called batch normalization), and then scales and shift the results using the two parameters ($latex \gamma$ and $latex \beta$) per layer. The model is trained to find optimal values for the scaling and shifting parameters. Let’s take a look at the following equations:

$latex \mu = \frac{1}{m} \sum_{i=1}^{m} \textbf{x}^{(i)}$

$latex \sigma^2 = \frac{1}{m} \sum_{i=1}^{m} (\textbf{x}^{(i)} – \mu)^2$

$latex \textbf{x}^{(i)}_{new} = \frac{\textbf{x}^{(i)} – \mu}{\sqrt{\sigma^2 + \epsilon}}$

$latex \textbf{z}^{(i)} = \gamma \textbf{x}^{(i)}_{new} + \beta$

These equations are straightforward to understand how the batch normalization works. $latex \mu$ is the mean over the mini-batch that has m training instances $latex \textbf{x}$. $latex \sigma$ is the standard deviation, and $latex \textbf{x}^{(i)}_{new}$ is the zero-centered and normalized input $latex \textbf{x}^{(i)}$. $latex \epsilon$ is to avoid division by zero. It is typically set up as $latex 10^{-5}$. $latex \gamma$ and $latex \beta$ are the scaling and shifting parameters to train, respectively.

Here is a question. How can the batch normalization be used for testing your model? During the test, there is no mini-batch. In this case, the whole training set’s mean and standard deviation are used in the batch normalization.

Now let’s take a look at pros and cons of batch normalization.

  • Pros
    • Reduces the vanishing gradients problem
    • Less sensitive to the weight initialization
    • Able to use much larger learning rates to speed up the learning process
    • Acts like a regularizer
  • Cons
    • Slower predictions due to the extra computations at each layer