Using Crucio SMOTETOMEK for balancing data
This article will be about a combination of the two most powerful algorithms used for oversampling and undersampling imbalanced datasets, SMOTE and Tomek Links.
I talked about SMOTE in my previous article, so you should check it out because I will now rely more on the Tomek Links algorithm.
So starting with an explanation, a Tomek link is created when there exists a pair of instances of opposite classes who are their own nearest neighbors. In other words, they are pairs of opposing instances that are very close together.
We will use the Tomek link to exclude examples from the majority class that is too close to examples from the minority class.
Then we simply apply SMOTE to oversample minority class examples.
Using SMOTETOMEK from Crucio
If you still haven’t installed Crucio just type the following in the command line.
Now we have to import and use our algorithm
tomek = SMOTETOMEK()
new_df = tomek.balance(df,'target')
The SMOTETOMEK() initialization constructor can contain the following arguments:
- k (int > 0, default = 5) : The number of nearest neighbors from which SMOTE will sample data points.
- seed (int, default = 45): The number used to initialize the random number generator.
- binary_columns (list, default = None): The list of binary columns from the data set, so sampled data is approximated to the nearest binary value.
The balance() method takes as parameters the panda’s data frame and the name of the target column.
Example:
I took the same dataset from the previous article, the one about Legendary Pokemons.
Classifying our pokemons without any help, got us an accuracy of 95.8%
which is pretty good accuracy, but we can try to increase it without the SMOTETOMEK algorithm
new_df = smote.balance(df,’Legendary’)
Now we got an even better accuracy, 99% percent, yep that’s a small dataset, but it can be really good at a specific situation, so I recommend remembering it and try using it next time you see an imbalanced dataset.
And here is a little plot demonstrating how new examples were sampled
Conclusion:
SMOTETOMEK is an interesting technique that combines both undersampling (using Tomek Links) and oversampling (using SMOTE), and this combination can bring you great results if used wisely.
Thank you for reading!
Follow Sigmoid on Facebook, Instagram, and LinkedIn:
https://www.facebook.com/sigmoidAI
https://www.instagram.com/sigmo.ai/
https://www.linkedin.com/company/sigmoid/