Using Kydavra ItakuraSaitoSelector for feature selection
What is ItakuraSaitoSelector?
It is a selector that is based on Itakura-Saito divergence, which measures the difference between an original spectrum and an approximation of it. The spectrum can be thought of as a continuous distribution.
Since in Machine Learning we usually deal with observable data, we will consider dicrete distributions, or finite distributions.
How is it calculated ?
Itakura-Saito divergence is a Bregman divergence generated by minus logarithmic function. It can be calculated with the following formula:
where P and Q are distributions.
Using ItakuraSaitoSelector
The ItakuraSaitoSelector constructor has the following parameters:
- EPS (float, default: 0.0001): a small value to add to the feature column in order to avoid division by zero.
- min_divergence (int, default: 0): the minimum accepted divergence with the target column
- max_divergence (int, default: 10): the maximum accepted divergence with the target column
The select method has the following parameters:
- dataframe (pd.DataFrame): the dataframe for which to apply the selector
- target (str): the name of the column that the feature columns will be compared with
This method returns a list of the column names selected from the dataframe.
Example using ItakuraSaitoSelector
First of all, you should install kydavra if you don’t have it yet:
Now, you can import the selector:
Import a dataset and create a dataframe:
df = pd.read_csv('./heart.csv')
Because our selector expects numerical valued features, let’s select the columns that have a numerical data type:
Let’s instanciate our selector and choose the features compared with the ‘target’ column:
The selector returns the list of column names of the features that have a divergence with the choosen column between min_divergence and max_divergence.
With the heart.csv dataset in this example, the selector returns the following columns:
If we limit the divergence of the selected columns to be between 1 and 3 relative to the target column, we get the following:
For divergence between 1 and 2 we get:
Use case
As we started with the heart disease dataset, let’s finish with it. We are to create a classification model that has to predict if a given patient has a heart disease or not.
You can find the dataset here: https://www.kaggle.com/ronitf/heart-disease-uci.
Let’s create two models. One will be trained without feature selection and one trained with selected features from our selector.
Start with importing the dataset:
df = pd.read_csv('./heart.csv')
X = df.drop(columns=['target'])
y = df['target']
Now create the model without the selector:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = LogisticRegression().fit(X_train, y_train)
Now, let’s add some metrics to be able to compare the two models:
print('Without selector:')
print(f'accuracy: {accuracy_score(y_test, clf.predict(X_test)):.2f}')
print(f'recall {recall_score(y_test, clf.predict(X_test)):.2f}')
print(f'AUC {roc_auc_score(y_test, clf.predict(X_test)):.2f}')
Good. Now let’s do the same thing for the second model and apply the selector. First, import it:
We will use the columns that have a divergence between 1 and 3 relative to the target column (which in this dataset is named ‘target’ for convenience):
cols = itakura.select(df, 'target')
print(f'\nselected columns: {cols}')
Now, let’s select these columns from the DataFrame. The target column remains the same:
y = df['target']
Continue with creating the model and printing the metrics:
clf_with_selector = LogisticRegression().fit(X_train, y_train)
print('\nWith selector:')
print(f'accuracy: {accuracy_score(y_test, clf_with_selector.predict(X_test)):.2f}')
print(f'recall {recall_score(y_test,
clf_with_selector.predict(X_test)):.2f}')
print(f'AUC {roc_auc_score(y_test, clf_with_selector.predict(X_test)):.2f}')
Here is what the program outputs:
accuracy: 0.79
recall 0.90
AUC 0.79selected columns: ['age', 'trestbps', 'chol', 'thalach', 'slope', 'thal']With selector:
accuracy: 0.82
recall 0.86
AUC 0.81
In the case of heart disease dataset, as this is a small dataset, if we remove columns, we’ll lose precious information the model trains with. But, if we had a bigger dataset with correlated feature columns, then using ItakuraSaitoSelector will help improve the model.
Made with ❤ by Sigmoid.
Follow us on Facebook, Instagram and LinkedIn:
https://www.facebook.com/sigmoidAI
https://www.instagram.com/sigmo.ai/
https://www.linkedin.com/company/sigmoid/