Logo

Machine learning

 

Imbalancing :

An imbalanced dataset is one in which the classes are not equally represented.

Machine learning models, especially ones that are intended to optimize for accuracy, may perform poorly if one class has much fewer data than the other.

When this happens, the model might just forecast the majority class, which would result in a high accuracy score but subpar overall performance.

Unbalanced data can be addressed using a variety of ways, including:

  • Undersampling : Taking samples from the dominant class out of the dataset to make it more balanced.

  • Oversampling : In order to balance the dataset, oversampling involves adding fresh artificial samples to the minority class.

  • Class weight balancing : This refers to giving the minority class more weight during model training.

  • Ensemble techniques : Using ensemble approaches to build a more reliable classifier, such as bagging or boosting.

 

 

Codeblock E.1. Imbalancing demonstration.

 

In this example, we first created imbalanced dummy data with a class distribution of 90% for the majority class and 10% for the minority class using Scikit-learn's make_classification() tool.

The train_test_split() function was then used to divide the data into training and test sets.

Then, using the classification_report() function, we fitted a logistic regression model to the initial, unbalanced data and assessed how well it performed on the test set.

The majority class in the training data was then randomly undersampled using the RandomUnderSampler() method from the imblearn library to remedy the class imbalance.

The undersampled data was then used to construct a new logistic regression model, and its performance was assessed using the classification_report() function on the test set.

We can see that the model trained on the undersampled data performs better on the minority class (higher recall and F1-score) at the expense of somewhat worse performance on the majority class by comparing the classification reports for the original imbalanced data with the undersampled data. (lower precision and F1-score).

 

Download

Download. Download the ipynb files used here.

 

 

---- Summary ----

As of now you know all basics of Imbalancing.

  • Class imbalance refers to a situation where the number of observations in one class is significantly lower than the number of observations in another class.

  • Imbalanced data can lead to a biased model that predicts the majority class more accurately, while the minority class is predicted poorly.

  • There are several techniques to deal with class imbalance, including resampling methods (undersampling and oversampling), algorithmic methods (cost-sensitive learning and threshold-moving), and hybrid methods (a combination of resampling and algorithmic methods).

  • Undersampling involves randomly removing examples from the majority class, while oversampling involves replicating examples from the minority class.

  • Cost-sensitive learning adjusts the algorithm's misclassification costs to give more importance to the minority class.
    Threshold-moving adjusts the probability threshold of the classification algorithm to increase the sensitivity to the minority class.

  • The appropriate method to use depends on the specific problem and dataset, and experimentation may be needed to find the most effective approach.

  • etc..

Content

PythonPythonME

Introduction

Setting Up


________________________________________________________________________________________________________________________________
Footer
________________________________________________________________________________________________________________________________

Copyright © 2022-2023. Anoop Johny. All Rights Reserved.