Curse of imbalanced data

7 min readDec 11, 2022

In classification tasks, machine Learning models use the training data to learn and generate a general understanding of the relationships and hidden patterns among the features of the classes. For that to happen, enough number of samples of each class is needed. But that happens seldom in the real world. When a dataset contains samples of a certain class(es) much more than the others, the model is going to face a hard time learning the task and generalizing its understanding of the data. And in this case, the data is called Imbalanced.

The class(es) with a much larger number of examples (samples) is called the Majority Class(es), whereas the class(es) with much fewer samples is called the Minority Class(es).

Why is it a curse?

For a Machine Learning model to classify a data point into its correct class, it needs to be trained using enough similar examples of the class of that data point and other classes in the dataset before making the right prediction. If the model gets trained using too many examples of a particular class without enough examples of other classes, this can lead to overfitting and poor performance on unseen data.

Since imbalanced data sets are often biased towards a particular class, this can be a problem for machine learning models because it can lead to overtraining the model on one class (Majority class), and as a result, the model will have difficulty generalizing to other classes.

Additionally, imbalanced data can make it difficult for the model to learn the underlying rules and patterns in the data, due to a lack of sufficient examples, leading to poor performance on both the training data and unseen data. This can be particularly problematic for classification tasks, where accurate prediction of all classes is important.

Overall, imbalanced data can hinder the ability of a machine learning model to learn a task and make accurate predictions, making it a curse in the context of machine learning.

How do data get imbalanced?

Imbalanced data can happen for a number of reasons.

One common reason for imbalanced data is the inherent nature of the phenomenon being studied. For example, if you are studying medical data and trying to identify patients with a rare disease, it is likely that the number of patients with the disease will be much smaller than the number of patients without the disease. This can create imbalanced data where the majority of examples in the dataset are healthy patients, and only a small number are patients with the rare disease.

Another reason for imbalanced data can be the way in which the data is collected. For example, if the data is only collected from a certain subset of the population, such as only from urban areas, then the data may not be representative of the overall population. This can lead to imbalanced data where certain classes are underrepresented.

Imbalanced data can also occur if the data is labeled incorrectly, or if the labels are not applied uniformly across all examples in the dataset. For example, if a dataset is supposed to contain equal numbers of examples from each class, but the labels are applied incorrectly, then the data will be imbalanced.

When is data considered to be Imbalanced?

Data is considered to be imbalanced when the number of examples belonging to one class significantly outnumbers the examples belonging to the other classes.

The degree of imbalance in a dataset can vary depending on the percentage of examples belonging to each class.

In general, a dataset is considered to be imbalanced if the ratio of examples belonging to the majority class to the examples belonging to the minority class is greater than 10:1. For example, if a dataset contains 100 examples belonging to the majority class and only 10 examples belonging to the minority class, then the dataset would be considered to be imbalanced.

Ways to revoke the curse

There are several techniques that can be used to deal with imbalanced datasets. Some common techniques include:

Collecting more data: In some cases, imbalanced datasets can be addressed by simply collecting more data. This can help to balance the dataset and improve the performance of the machine learning model.
Resampling the data: Another common technique for dealing with imbalanced data is to resample the dataset. This involves either oversampling the minority class to create more examples of that class, or undersampling the majority class to reduce the number of examples in that class. This can help to balance the dataset and improve the performance of the model.
Using weighted algorithms: Some machine learning algorithms can be used with weighted classes, which means that the algorithm gives more importance to the minority class and less importance to the majority class. This can help to balance the dataset and improve the performance of the model.
Using algorithms that are robust to imbalanced data: Some algorithms are specifically designed to be robust to imbalanced data. These algorithms can be used to train a machine learning model on an imbalanced dataset without the need for resampling or weighting.

Oversampling (increase the number of samples)

Oversampling is a general term that refers to any method of increasing the representation of the minority class in a dataset. This can include upsampling, as well as more sophisticated methods that work on making the data in a more balanced state.

Oversampling the minority class involves creating new examples of the minority class by sampling from the original dataset to balance the dataset and improve the performance of the machine learning model.

Upsampling refers specifically to the process of randomly duplicating samples from the minority class from the original dataset to increase its representation in the dataset. This is a simple and straightforward method that can be effective in some cases, but it can also introduce noise and bias into the dataset.

Types of oversampling

Synthetic Minority Oversampling Technique (SMOTE): This is a sophisticated method that creates new, synthetic samples from the minority class, rather than simply duplicating existing ones.
Adaptive Synthetic (ADASYN): This is a variant of SMOTE that focuses on creating synthetic samples near the decision boundary between the classes.
Borderline-SMOTE: This method is similar to SMOTE, but it focuses on creating synthetic samples along the borderline between the minority and majority classes.
Informed Oversampling: This is a more recent method that uses classifier performance to guide the oversampling process, in order to create a more balanced and effective training dataset.

Undersampling (reduce the number of samples)

Undersampling is the process of reducing the size of a majority class in a dataset in order to balance the class distribution. This can help to improve the performance of a machine learning model by providing a more balanced and representative training dataset. However, undersampling can also discard potentially useful information, so it should be used carefully.

Downsampling and undersampling are two terms that are often used interchangeably in the context of machine learning, and they both refer to the process of reducing the size of a majority class in a dataset. However, there is a subtle difference between the two terms.

Downsampling is the process of randomly selecting a subset of the majority class examples in the original dataset and using only that subset to train the model.

Undersampling types

Random undersampling: It is done by randomly selecting a subset of the majority class examples in the original dataset and removing the rest. This might balance the dataset, yet it can introduce bias if the selected subset is not representative of the overall majority class.
Cluster-based undersampling: It starts with dividing the majority class examples into clusters and then selecting a subset of the clusters to create the undersampled dataset. This way the structure of the original dataset is preserved which reduces the risk of bias.
Tomek link undersampling: This method works on identifying and removing pairs of examples from the majority and minority classes that are closest to each other. This is performed using a distance measure to identify pairs of examples that are closest to each other. These pairs are then removed from the dataset, resulting in a more balanced dataset.

Overall, resampling (oversampling and undersampling) is a useful technique for dealing with imbalanced datasets. It can help to balance the dataset and improve the performance of the machine learning model, without the need for complex algorithms or specialized techniques. However, it is important to use resampling carefully, as it can introduce bias into the dataset if not done properly.

Overall, imbalanced data can be caused by a variety of factors, including the inherent nature of the phenomenon being studied, the way in which the data is collected and labeled, and other factors. It is important to carefully consider the issue of imbalanced data when working with machine learning algorithms, as imbalanced data can impact the performance of the algorithm and the accuracy of the results.