4 Approaches to Deal with Vanishing/Exploding Gradients Problems
One of the common problem in deep learning is vanishing gradients problem or exploding gradients problem. It makes hard for lower layers to train. In this post, we will look at what the vanishing and exploding gradients problems are and approaches that alleviate these problems.
The backpropagation algorithm propagates gradients from the output layer to the input layer to minimize the cost function by updating the parameters with the gradients. In neural networks with many layers, it often shows that the gradients become smaller and smaller as the backpropagation goes down to the lower layers. This is called the vanishing gradients problem, and it makes the parameters in the lower layers unchanged or very small updated. This means that the lower layers are not trained well. The opposite problem is called the exploding gradients problem. The gradients get bigger and bigger during the backpropagation, so there are too much updates on the parameters in the lower layers. It makes the model diverge.
Why does the vanishing gradients problem happen?
Before around 2010, it was popular to use a sigmoid function as an activation function and a normal distribution with a mean of 0 and a standard deviation of 1 as a random initialization for parameters. Unfortunately, these approaches show that the variance of the outputs of each layer becomes much greater than the variance of its inputs. As a result, it turns out that the activation function saturates at 0 or 1 at the top layers. This means that the gradient becomes close to 0, and backpropagation begins with the little gradient. It turns out that parameters in the low layers will not be updated.
Approaches to handling the vanishing/exploding gradients problems
1) Initialization
We can alleviate this problem by making the variance of the outputs of each layer to be close to the variance of its inputs. Some effective weight initialization approaches have been proposed, and they show good performances. Most popular approaches are Xavier initialization and He initialization.
2) Non-saturating Activation Functions
Using non-saturating activation functions would be a good choice to alleviate this problem. The options could be leaky ReLU, randomized leaky ReLU, parametric leaky ReLU, ELU and more.
3) Batch Normalization
In 2015, a very effective approach (called Batch Normalization) has been proposed to address these problems. This learns the optimal scale and mean of the inputs for each layer. Batch Normalization is placed just before the activation function of each layer. It zero-centers and normalizes the inputs, and then scales and shift the results using two parameters per layer. The model is trained to find optimal values for the scaling and shifting parameters.
4) Gradient Clipping
Dealing with the exploding gradients problem is much easier than the vanishing problem. We can alleviate this problem just by clipping the gradients during backpropagation when they exceed a threshold. This is called Gradient Clipping.