L1 and L2 as Regularization for a Linear Model

Regularization is a good way to reduce overfitting. For a linear model, the model can be regularized by penalizing the weights of the model. Simply speaking, the regularization prevents the weights from fitting the training set perfectly by decreasing the value of the weights. This can happen by adding a regularization term to the cost function. In this post, we will look at two widely used regularizations: L1 regularization (also called Lasso Regression) and L2 regularization (also called Ridge Regression).

L1 Regularization
- It uses the L1-norm of the weights as the regularization term.
- Cost function (Mean Squared Error in this case) + Regularization term:

L2 Regularization
- The L2-norm of the weights is added to the cost function.
- Cost function (Mean Squared Error in this case) + Regularization term:

$latex \lambda$ is the hyperparameter to control how much the model is regularized. If $latex \lambda$ is zero, then the model becomes just linear regression. If $latex \lambda$ is very large, then all weights will be closed to zero, and it will lead to under-fitting.

Comparisons between L1 and L2 as Regularization
- Closed form solution
  - Regression with the L2 regularization can be performed either by computing a closed-form equation or by using Gradient Descent. L1 doesn’t have a closed-form solution. The closed-form equation is linear with regards to the number of instances in the training set. That means it can work efficiently on large training sets if they can fit in memory. In the case of the large number of features, the closed-form equation gets pretty slow because of the computational complexity of inverting an n x n matrix where n is the number of features.
- Feature selection
  - The L1 regularization tends to give zero value to the weights of the least important features. It performs like feature selection and produces a sparse model (i.e., few features have non-zero weights).

Posts

Gradient Descent

Batch vs. Online Learning

Knowledge Graph Completion

What are Embeddings?

Batch Normalization

Gradient Boosting

The Normal Equation

Handling Categorical Values

How to Deal with Missing Values

Underfitting vs. Overfitting