BoxCoxTransformer

The BoxCoxTransformer() applies the BoxCox transformation to numerical variables.

The Box-Cox transform is given by:

y = (x**lmbda - 1) / lmbda,  for lmbda != 0
log(x),                      for lmbda = 0

The BoxCox transformation implemented by this transformer is that of SciPy.stats.

The BoxCox transformation works only for strictly positive variables (>=0). If the variable contains 0 or negative values, the BoxCoxTransformer() will return an error.

If the variable contains values <=0, you should try using the YeoJohnsonTransformer() instead.

Example

Let’s load the house prices dataset and separate it into train and test sets (more details about the dataset here).

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine import transformation as vt

# Load dataset
data = data = pd.read_csv('houseprice.csv')

# Separate into train and test sets
X_train, X_test, y_train, y_test =  train_test_split(
            data.drop(['Id', 'SalePrice'], axis=1),
            data['SalePrice'], test_size=0.3, random_state=0)

Now we apply the BoxCox transformation to the 2 indicated variables:

# set up the variable transformer
tf = vt.BoxCoxTransformer(variables = ['LotArea', 'GrLivArea'])

# fit the transformer
tf.fit(X_train)

With fit(), the BoxCoxTransformer() learns the optimal lambda for the transformation. Now we can go ahead and trasnform the data:

# transform the data
train_t= tf.transform(X_train)
test_t= tf.transform(X_test)

Next, we make a histogram of the original variable distribution:

# un-transformed variable
X_train['LotArea'].hist(bins=50)
../../_images/lotarearaw.png

And now, we can explore the distribution of the variable after the transformation:

# transformed variable
train_t['GrLivArea'].hist(bins=50)
../../_images/lotareaboxcox.png

More details

You can find more details about the BoxCoxTransformer() here:

All notebooks can be found in a dedicated repository.