General information
Performs data resampling to receive a more balanced dataset.
An unbalanced dataset is one in which the target variable has more observations in one specific class than the others.
The problem is that models trained on unbalanced datasets often have poor results when they have to generalize. If the algorithm receives significantly more examples from one class, it gets biased towards that particular class. The algorithm is then prone to overfitting the majority class. Just by predicting the majority class, models would score high on their loss functions.
Description
Brick Locations
Bricks → Machine Learning → Balanced Sampling
Brick Parameters
- Target
Column from the dataset with expected classes. Resampling is done with respect to it.
- Number of samples per class
Expected number of observations for each class in the resulting dataset.
Brick Inputs/Outputs
- Inputs
Brick takes the dataset
- Outputs
Brick produces the resampled dataset
Example of usage
Let’s try to apply Balanced Sampling Brick to the dataset for Credit Card Fraud Detection.
To get the target variable quantities we can use Aggregate Data Brick with the following settings:
From the results, it’s obvious that there is a significant imbalance towards the ‘0’ class, which is reasonable since frauds happen far less frequently than normal transactions.
We can use Balanced Sampling Brick to resample the dataset with respect to the ‘Class’ column and set the ‘Number of observations per class’ equal to 5000.
As a result, we get the dataset of size 10 000, created by performing random sampling with replacement. It means that some of the records might be duplicated once or more times.
If we visualize the distribution of the ‘Class’ variable counts (Pivot Table Brick), we get the next dashboard:
If we trained a binary classification model (Logistic Regression in this example) on the initial highly imbalanced dataset, it would predict every observation as the majority class ‘0’ and get perfect scores on the training data, but fail to perform well on the test data.
However, if we balance the dataset before training the model, the model performs much better at distinguishing the rare class ‘1’, and, consequently, is better at generalization.