Using Kydavra JensenShannonSelector for feature selection
What is JensenShannonSelector ?
JensenShannonSelector is a feature selector in kydavra that chooses feature columns based on the Jensen-Shannon divergence. It measures the similarity between two probability distributions.
How to calculate Jensen-Shannon divergence ?
It is based on the Kullback-Leibler divergence with the difference that it is symmetric and has a finite value. It can be calculated with the following formula:
The Kullback-Leibler divergence can be calculated the following way:
Also, Jensen-Shannon divergence is the square of Jensen-Shannon distance.
Using JensenShannonSelector
The JensenShannonSelector constructor has the following parameters:
- min_divergence (float, default: 0.1): the minimum accepted divergence with the target column
- max_divergence (float, default: 0.5): the maximum accepted divergence with the target column
The select method has the following parameters:
- dataframe (pd.DataFrame): the dataframe for which to apply the selector
- target (str): the name of the column that the feature columns will be compared with
This method returns a list of the column names selected from the dataframe.
Example using JensenShannonSelector
First of all, you should install kydavra if you don’t have it yet:
Now, you can import the selector:
Import a dataset and create a dataframe out of it:
df = pd.read_csv('./heart.csv')
As our selector expects numerical valued features, let’s select the columns that have a numerical data type:
Let’s instanciate our selector and choose the features compared with the ‘target’ column:
The selector returns the list of column names of the features that have a divergence with the choosen column between min_divergence and max_divergence.
With the heart.csv dataset in this example, the selector returns the following columns:
If we limit the divergence of the selected columns to be between 0.5 and 1 relative to the target column, we get the following:
Use case
Let’s create two classification models to predict the heart disease presence on a patient. On one of the models we will apply our selector.
You can find the dataset here: https://www.kaggle.com/ronitf/heart-disease-uci.
Let’s import the dataset first:
df = pd.read_csv('./heart.csv')
X = df.drop(columns=['target'])
y = df['target']
Now let’s create the model without the selector:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = LogisticRegression().fit(X_train, y_train)
Now, let’s add some metrics to be able to compare the two models:
print('Without selector:')
print(f'accuracy: {accuracy_score(y_test, clf.predict(X_test)):.2f}')
print(f'recall {recall_score(y_test, clf.predict(X_test)):.2f}')
print(f'AUC {roc_auc_score(y_test, clf.predict(X_test)):.2f}')
Fine. Let’s do the same thing for the second model. But now we will apply our selector. First, import it:
Let’s choose the columns that have a divergence between 0.1 and 0.5 relative to the target column.
cols = jensen.select(df, 'target')
print(f'\nselected columns: {cols}')
Now, let’s select these columns from the DataFrame. The target column remains the same:
y = df['target']
Continue with creating the model and printing the metrics:
clf_with_selector = LogisticRegression().fit(X_train, y_train)
print('\nWith selector:')
print(f'accuracy: {accuracy_score(y_test, clf_with_selector.predict(X_test)):.2f}')
print(f'recall {recall_score(y_test,
clf_with_selector.predict(X_test)):.2f}')
print(f'AUC {roc_auc_score(y_test, clf_with_selector.predict(X_test)):.2f}')
Here is what we get as an output:
accuracy: 0.75
recall 0.76
AUC 0.75
The results improved. Of course, it depends on the dataset, its preprocessing, the model and many other factors. But in general using JensenShannonSelector may help improve the performance of a model.
Made with ❤ by Sigmoid.
Follow us on Facebook, Instagram and LinkedIn:
https://www.facebook.com/sigmoidAI
https://www.instagram.com/sigmo.ai/
https://www.linkedin.com/company/sigmoid/