Using Kydavra ANOVASelector for feature selection
Very often we can look at classification as a problem of finding differences between 2 groups. Before Machine Learning Statisticians were doing it quite a lot. Mostly they used such metrics as mean, variation, and standard deviation. However, it was a time-consuming process, and couldn’t be done with many groups. And there came the famous statistician Sir Ronald Aylmer Fisher, who proposed a method named Analysis of variance (shortly ANOVA). However we at Sigmoid think that a simple thing can be simpler, so we added ANOVASelector kydavra.
Using ANOVASelector from Kydavra library.
For those that are there mostly just for the solution to their problem there are the commands and the code:
So to install kydavra just write the next things in the command line:
pip install kydavra
After you cleaned the data, meaning NaN- value imputation, out layers elimination and, others, you can apply the selector. However ANOVASelector can sustain classification and regression tasks, so there will be 2 different paths. Also ANOVASelector has 2 arguments:
- significance_level (default = 0.05) : the significance level used to select features using p-value.
- classification (default = True) : If set to True, ANOVASelector becomes applicable to classification, if set to False, then it becomes a regression selector.
For classification:
from kydavra import ANOVASelector
anova = ANOVASelector()
selected_cols = anova.select(df, 'target')
If we will test the result of ANOVASelector on Pima Indians Diabetes Database then we also will erase the ‘SkinThickness’ column and also, we will get some more accuracy on some chucks.
before - [0.75539568 0.75362319 0.74637681 0.84057971 0.78985507]
after - [0.75539568 0.73188406 0.78985507 0.81884058 0.80434783]
For regression:
from kydavra import ANOVASelector
anova = ANOVASelector(classification = False)
selected_cols = anova.select(df, 'target')
If we will test the result of ANOVASelector on the Brazilian houses to rent dataset, we don’t see any growth in the performance of the algorithm. However, it erased the ‘animal’ column. Also, it reduced a little MSE of the model.
before - 1.079789470573728
after - 1.0682555357780834
So how it works?
Basically under the hood of this selector are used p-values found during the ANOVA test. If you want to find out more about the ANOVA I highly recommend this article.
I also explained p-value in the article about Kydavra PValueSelector, I highly recommend you to check it out.
If you have tried kydavra we invite you to share your impression by filling out this form.
Made with ❤ from Sigmoid.
Useful links:
- https://towardsdatascience.com/anova-analysis-of-variance-explained-b48fee6380af
- https://medium.com/towards-artificial-intelligence/find-features-that-really-explains-your-data-with-kydavra-pvalueselector-dbb5a1eda783
- https://en.wikipedia.org/wiki/Analysis_of_variance