6 Approaches to Deal with Imbalanced Classes
Have you seen this problem in your dataset? You might see the imbalanced classes problem when you deal with classification problems. This is very common. Imagine that you create a classification model and get 98% accuracy. Can you say your model is good because it shows the pretty high accuracy? What if around 98% of the data belongs to one class? We can make 98% accuracy without building a model by just predicting all of the test cases as belonging to one class.
For example, in a cancer dataset, 98% of tumors might be benign, and only 2% of them might be malignant. We can say that all of the patients don’t have a cancer if we just want a high accuracy. In a house rental dataset, only 2% of tenants might have been evicted. We can make good predictions easily by saying that all of tenants will not be evicted.
You can have an imbalanced classes problem on multi-class classification as well as binary classification. Most of techniques can be used for both classifications.
Approaches to handling imbalanced classes
1) Collecting more data
In many cases, the data collection might be pretty expensive. However, this doesn’t mean that collecting more data should be ignored. If you are able to get more examples of minor classes, this approach would be pretty helpful to solve the class imbalance problem.
2) Choosing the right performance metrics
Selecting appropriate performance measures for you model is important to get more insight into the model. In the case of imbalanced classes, the following metrics can be applied:
- Precision: accuracy of the positive predictions
- TP / (TP + FP)
- Recall: ratio of positive instances that are correctly detected by the classifier
- TP / (TP + FN)
- F1 score: harmonic mean of precision and recall
- 2 x (precision x recall) / (precision + recall)
- AUC: relation between true positive rate and false positive rate
- Cohen’s kappa: classification accuracy normalized by the imbalance of the classes
3) Resampling the training set
You can make more balanced data by sampling your dataset. There are two main approaches: under-sampling and over-sampling.
- Under-sampling
- Under-sampling decreases the number of instances of the majority class by deleting them.
- Consider this approach when you have a lot of data.
- Advantages
- Improved run-time of the model
- Reduced storage for the training set
- Disadvantages
- Information loss
- Over-sampling
- Over-sampling increases the number of instances of the minority class by adding copies of them.
- Consider this approach when you don’t have a lot of data.
- Advantages
- No information loss
- Often outperforms under-sampling
- Disadvantages
- Over-sampling can lead to model overfitting since it duplicates the instances of the minority class.
4) Synthetic minority over-sampling
Generating synthetic samples technique can be used to avoid the overfitting problem, which occurs when copies of minority instances are added. The popular algorithm is SMOTE (Synthetic Minority Over-sampling Technique). It selects a subset of minority instances, and then creates new synthetic similar instances.
- Avantages
- helpful to avoid the overfitting problem occurred by the over-sampling with duplicates
- No information loss
- Disadvantages
- SMOTE works great for low dimensional data, but it is not very effective for high dimensional data
5) Ensemble techniques
So far, we have looked at approaches resampling the original data. There is an alternative approach to deal with imbalanced data. Ensemble methods such as bagging, boosting and gradient boosting can be used to handle imbalanced classes. For example, you train N models that use M instances of minority class and N different sets of instances of majority class. Each set would have the M number of instances. Each model can be trained with balanced classes.
6) Penalizing models
By modifying a cost function, you can mitigate the imbalanced classes problem. You can penalize a model by adding an additional cost for wrong classifications of the minority class. The penalties can make the model to pay more attention to the instances of the minority class. For example, penalized SVM can be used for this problem.
Conclusion
We have looked at many useful techniques to deal with imbalanced classes. Of course, there are more useful approaches that are not covered in this post. The listed approaches in this post can be considered as a good starting-point to solve this problem. For many cases, you might have to be more creative more than just applying an approach. Why don’t you try different approaches and find a best one for your model and data? How about trying different algorithms? For example, decision trees often work well on imbalanced classes. Why don’t you combine different approaches?
Good luck!