LogCpTransformer#

The LogCpTransformer() applies the transformation log(x + C), where C is a positive constant.

You can enter the positive quantity to add to the variable. Alternatively, the transformer will find the necessary quantity to make all values of the variable positive.

Example#

Let’s load the California housing dataset that comes with Scikit-learn and separate it into train and test sets.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

from feature_engine import transformation as vt

# Load dataset
X, y = fetch_california_housing( return_X_y=True, as_frame=True)

# Separate into train and test sets
X_train, X_test, y_train, y_test =  train_test_split(
    X, y, test_size=0.3, random_state=0)

Now we want to apply the logarithm to 2 of the variables in the dataset using the LogCpTransformer(). We want the transformer to detect automatically the quantity “C” that needs to be added to the variable:

# set up the variable transformer
tf = vt.LogCpTransformer(variables = ["MedInc", "HouseAge"], C="auto")

# fit the transformer
tf.fit(X_train)

With fit() the LogCpTransformer() learns the quantity “C” and stores it as an attribute. We can visualise the learned parameters as follows:

# learned constant C
tf.C_
{'MedInc': 1.4999, 'HouseAge': 2.0}

Applying the log of a variable plus a constant in this dataset does not make much sense because all variables are positive, that is why the constant values C for the former variables are possible.

We will carry on with the demo anyways.

We can now go ahead and transform the variables:

# transform the data
train_t= tf.transform(X_train)
test_t= tf.transform(X_test)

Then we can plot the original variable distribution:

# un-transformed variable
X_train["MedInc"].hist(bins=20)
plt.title("MedInc - original distribution")
plt.ylabel("Number of observations")
../../_images/logcpraw.png

And the distribution of the transformed variable:

# transformed variable
train_t["MedInc"].hist(bins=20)
plt.title("MedInc - transformed distribution")
plt.ylabel("Number of observations")
../../_images/logcptransform.png

More details#

You can find more details about the LogCpTransformer() here:

For more details about this and other feature engineering methods check out these resources: