148 - 7 techniques to work with imbalanced data for machine learning in python
Imbalanced data is part of life! With a proper knowledge of the data set and a few techniques from this video imbalanced data can be easily managed.
Prerequisites: Pick the right metrics as overall accuracy does not provide information about the accuracy of individual classes. Look at confusion matrix and ROC_AUC.
Technique 0: Collect more data, if possible.
Technique 1: Pick decision tree based approaches as they work better than logistic regression or SVM. Random Forest is a good algorithm to try but beware of over fitting.
Technique 2: Up-sample minority class
Technique 3: Down-sample majority class
Technique 4: A combination of Over and under sampling.
Technique 5: Penalize learning algorithms that increase cost of classification
mistakes on minority classes.
Technique 6: Generate synthetic data (SMOTE, ADASYN)
Technique 7: Add appropriate weights to your deep learning model.
References:
https://imbalanced-learn.org/stable/over_sampling.html?highlight=smote
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html
Code generated in the video can be downloaded from here: https://github.com/bnsreenu/python_for_microscopists