Reduce data diversity with imperio LogTransformer
One of the biggest problems with data in Data Science is its distribution, it almost every single time isn’t normal. It happens because we cannot have all samples in the world in one data set. However, there exists a bunch of methods that can change that. Today we will take a look at the Log-Transform.
How does Log transformation work?
Log transformation is a type of power transform, where we replace the x with the log(x). The base of the logarithm is usually the Euler’s number, however, it can be changed. The effects of applying Log transform on a variable are the following;
- It reduces the diversity of the variable.
- It brings the data distribution nearer to a normal one.
Please note that Box-Cox and YeoJohnson are doing it better. Even more, the second one doesn’t need the data to be positive and different from zero. The effect of the log-transform is illustrated below.
Using imperio LogTransformer:
All transformers from imperio follow the transformers API from sci-kit-learn, which makes them fully compatible with sci-kit learn pipelines. First, if you didn’t install the library, then you can do it by typing the following command:
pip install imperio
Now you can import the transformer, fit it and transform some data.
from imperio import LogTransformer
log = LogTransformer()
log.fit(X_train, y_train)
X_transformed = log.transform(X_test)
Also, you can fit and transform the data at the same time.
X_transformed = log.fit_transform(X_train, y_train)
As we said it can be easily used in a sci-kit learn pipeline.
from sklearn.pipeline import Pipeline
from imperio import LogTransformer
from sklearn.linear_model import LogisticRegressionpipe = Pipeline(
[
('boxcox', LogTransformer()),
('model', LogisticRegression())
])
Besides the sci-kit learn API, Imperio transformers have an additional function that allows the transformed to be applied on a pandas data frame.
new_df = log.apply(df, target = 'target', columns=['col1'])
The LogTransformer constructor has the following arguments:
- index (list, default = None): The list of indexes of the columns to apply the transformer on. If set to None it will be applied to all columns.
The apply function has the following arguments.
- df (pd.DataFrame): The pandas DataFrame on which the transformer should be applied.
- target (str): The name of the target column.
- columns (list, default = None): The list with the names of columns on which the transformers should be applied.
Important note: Before using log transform please make sure that your data is positive and different from 0, and any step in the pipeline will not generate negative or zero values.
Made with ❤ by Sigmoid.