The Normal Equation

When we train a linear regression model (i.e., $latex \hat{y} = \theta^T \cdot \textbf{x}$), we might consider two different ways to train it:

Using a “closed-form” solution that directly computes the parameters $latex \theta$ that minimize the cost function.
Using an iterative optimization approach, called gradient descent, that gradually tweaks the parameters to minimize the cost function.

The closed-form solution is also called the Normal Equation.

$latex \theta = (\textbf{X}^T \cdot \textbf{X})^{-1} \cdot \textbf{X}^T \cdot \textbf{y}$

where $latex \textbf{X}$ is a $latex m \times (n+1)$ matrix that contains training instances. m is the number of training instances, and n is the number of features. $latex \textbf{y}$ is the vector of target values.

We can get the similar parameters from both approaches: the closed-form and the iterative approach. It seems we don’t have to use a complex algorithm to find the parameters that best fit the model by using the normal equation. One issue with the normal equation is high computational complexity (about $latex O(n^{2.4})$ to $latex O(n^3)$). The computational complexity of inverting $latex \textbf{X}^T \cdot \textbf{X}$, which is the $latex (n+1) \times (n+1)$ matrix, is pretty high when the number of features (i.e., n) is large. Also the training data need to fit in memory to compute the inversion. This means that it is hard to run on a parallel computing system. Iterative approaches would be better choice for a large number of features or many training instances.

Posts

Gradient Descent

Batch vs. Online Learning

Knowledge Graph Completion

What are Embeddings?

Batch Normalization

Gradient Boosting

Handling Categorical Values

How to Deal with Missing Values

Underfitting vs. Overfitting

The Normal Equation