Gradient Descent

Gradient Descent is an optimization algorithm. The general idea of Gradient Descent is to tweak parameters $latex \theta$ iteratively in order to minimize a cost function by measuring the gradient of the cost function with regards to the parameters. First of all, the parameters can be initialized with random values or by initializers, and then we update them gradually one step at a time. At each step, we decrease the cost function until it converges to a minimum.

The learning rate is one of important hyperparameters in Gradient Descent. If we set the learning rate too small, the algorithm will take a long time to converge to a minimum. On the other hand, if the learning rate is too high, then the Gradient Descent algorithm might fail to converge at all. To find a good learning rate, we can use grid search or randomized search. There are three main variants of Gradient Descent: Batch Gradient Descent, Mini-batch Gradient Descent, and Stochastic Gradient Descent. We can choose an appropriate variant based on the amount of data, and a trade-off between the accuracy of the parameter update and the time it takes to perform an update.

Batch Gradient Descent

Batch Gradient Descent computes the gradient of the cost function with regards to the parameters $latex \theta$ for the entire observations as follows:

$latex \theta = \theta – \eta \cdot \nabla_\theta J(\theta)$

where $latex \eta$ is the learning rate, and $latex J(\theta)$ is the cost function. This approach requires looking at all observations to perform just one update. Batch Gradient Descent is guaranteed to converge to the global minimum for a convex error surface and to the local minimum for a non-convex error surface. However, the main disadvantage of Batch Gradient Descent is that it is very slow with a large dataset and therefore is not suitable for use with online models.

Stochastic Gradient Descent

Stochastic Gradient Descent update a parameter for each observation $latex x^{(i)}$ and label $latex y^{(i)}$ as follows:

$latex \theta = \theta – \eta \cdot \nabla_\theta J(\theta; x^{(i)}; y^{(i)})$

This approach usually works much faster than Batch Gradient Descent because it does not require all the observations to perform each update. This means that, Stochastic Gradient Descent typically allows us to update a model online. Generally, knowledge graphs and user behavior datasets are huge and dynamic, so Stochastic Gradient Descent could be a good fit for these kinds of data. However, this gradient descent method performs frequent updates with a high variance that cause the loss function to fluctuate as shown in Figure 1, while Batch Gradient Descent tends to converge to the minimum smoothly. Indeed, SGD can fluctuate around the minimum and might not converge to the minimum at all. Conversely, Stochastic Gradient Descent is able to jump to new and better local or global minima. We can improve SGD by decreasing the learning rate $latex \eta$ over time.

Figure 1. Stochastic Gradient Descent Fluctuations in the Objective Function as Gradient Steps (Source: Wikipedia)

Mini-batch Gradient Descent

Mini-batch Gradient Descent performs an update for every mini-batch of $latex n$ observations as follows:

$latex \theta = \theta – \eta \cdot \nabla_\theta J(\theta; x^{(i:i+n)}; y^{(i:i+n)})$

At each step, Mini-batch Gradient Descent computes the gradients on random sets of $latex n$ observations called mini-batches instead of computing the gradients based on all observations or based on just one observation. This approach reduces the variance of the parameter updates and can get a performance boost in a distributed computing environment. The common mini-batch sizes range between $latex 50$ and $latex 256$, but can vary for different applications.

AdaGrad

AdaGrad adapts the learning rate $latex \eta$ to the parameters $latex \theta$ by performing larger updates for infrequent parameters and smaller updates for frequent parameters. This ensures that this approach works well with sparse data such as knowledge graphs. Knowledge graphs are very sparse when it comes to connections between entities. AdaGrad updates every parameter $latex \theta_i$ at every time step $latex t$ as follows:

$latex \theta_{t+1,i} = \theta_{t,i} – \eta \cdot \nabla_\theta J(\theta_i)$

and then, this approach modifies the learning rate $latex \eta$ at each time step $latex t$ based on the past gradient as follows:

$latex \theta_{t+1,i} = \theta_{t,i} \;\; – \;\; \frac{\eta}{\sqrt{G_{t,i} + \epsilon}} \cdot \nabla_{\theta} J(\theta_i)$

where $latex G_t \in \mathbb{R}^{d \times d}$ is a diagonal matrix, and $latex G_{t,i}$ is the $latex i$-th diagonal element, which is the sum of squares of the gradients with regards to $latex \theta_i$ up to the time step $latex t$. $latex \epsilon$ is a smoothing term to avoid division by zero.

Now we can vectorize the update for all parameters $latex \theta$ by using an element-wise multiplication $latex \odot$ as follows:

$latex \theta_{t+1} = \theta_{t} \;\; – \;\; \frac{\eta}{\sqrt{G_{t} + \epsilon}} \odot \nabla_{\theta} J(\theta)$

The main benefit of AdaGrad is that we do not need to manually tune the learning rate. However, there are also disadvantages. One of them is the accumulated sum in $latex G_t$ keeps growing during training. It makes the learning rate to shrink to a very small value.

AdaDelta

AdaDelta is an extended version of AdaGrad. It reduces the AdaGrad’s weakness with regards to decreasing learning rate. This algorithm employs a window of accumulated past gradients, while AdaGrad accumulates all past squared gradients.