Logistic Regression

You might guess that this algorithm is only for regression problems because of its name. However, Logistic Regression also can be used for classification problems. Logistic Regression estimates the probability of an instance belonging to a particular class. For example, what is the estimated probability of this email being spam (classes: spam and non-spam)? What is the probability that this candidate will win an election (classes: win and lose)? What is the probability that this user will click a promotion link on an online shopping email (classes: click and not click)? 

It is straightforward to understand how the estimated probability is used for classification. If the estimated probability is greater than 50%, then it predicts that the instance belongs to a class (called the positive class), otherwise it predicts that it belongs to the other class (called the negative class).

How to estimate probabilities?

This Logistic Regression model looks like Linear Regression, but the logistic function $latex \sigma ( \cdot )$ is used to the result of the linear regression. The following figure shows the logistic function:  

When the input value t of the function is positive, the output will be bigger than 0.5. The output will be smaller than 0.5 when t is negative. The output of the logistic function looks like S-shape. The logistic regression model makes predictions as follows:

If the estimated probability is smaller than 50%, the prediction is that the instance belongs to the class 0 (called the negative class). Otherwise, it belongs to the class 1 (called the positive class). This model works as a binary classifier. 

How to train the model? and What is the cost function for it?

The logistic regression model is trained to estimate high probabilities (close to 1) for the instances that belong to the positive class and low probabilities (close to 0) for the instances that belong to the negative class by tweaking the parameters $latex \theta$. The following cost function (log loss) reflects this training idea:

$latex y^{(i)}$ is the target class (0 or 1) of the i-th instance. When $latex y$ is 1 and the estimated probability is close to 0, then the loss of this instance becomes large, and the average cost will increase. When $latex y$ is 0 and the estimated probability is close to 1, then the loss value becomes large, and the average cost will also increase. This cost function is convex, so there is no local minimum. You can optimize the parameters by finding the global minimum. The following equation is the partial derivatives of the cost function with regards to the j-th feature: