Softmax Regression

In a previous post, we have looked at Logistic Regression for binary classification. In this post, we will see how the logistic regression can be generalized for multiple classes. This model is called Softmax regression or Multinomial logistic regression. First of all, the softmax regression computes a softmax score of an instance for each class k. The following equation shows the softmax score function:

As  you can see in the above equation, each class k has its own parameter vector $latex \theta^{(k)}$. The dot-product between the vector of an instance and the parameter vector of a class shows the score that the instance belongs to the class.   

The dot-product between the vector of an instance and the parameter vector of a class shows the score that the instance belongs to the class.   

Now, let’s see how the softmax function uses the scores to compute the probability that the instance x belongs to the class k. The following equation shows the softmax function:

It computes the exponential of a score for a class k, and then normalizes it by dividing by the sum of the exponentials all over the classes. K indicates the number of all classes. The softmax regression computes the probabilities of the instance for every classes and predicts the class with the highest probability. 

How to train the model? and What is the cost function for it?

The main idea of training the model is same as the logistic regression. It is to train the model to output a high probability for the target class and low probabilities for the other classes. The following cross entropy can be used as the cost function. The softmax regression can be trained by minimizing the cost function.

where m is the number of instances, and $latex y_{k}^{(i)}$ is 1 if k is the target class for the i-th instance. Otherwise, it is 0.