DecisionTreeEncoder#

The DecisionTreeEncoder() replaces categories in the variable with the predictions of a decision tree.

The transformer first encodes categorical variables into numerical variables using OrdinalEncoder(). You have the option to have the integers assigned to the categories as they appear in the variable, or ordered by the mean value of the target per category. You can regulate this behaviour with the parameter encoding_method. As decision trees are able to pick non-linear relationships, replacing categories by arbitrary numbers should be enough in practice.

After this, the transformer fits with this numerical variable a decision tree to predict the target variable. Finally, the original categorical variable is replaced by the predictions of the decision tree.

The motivation of the DecisionTreeEncoder() is to try and create monotonic relationships between the categorical variables and the target.

Let’s look at an example using the Titanic Dataset.

First, let’s load the data and separate it into train and test:

from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.encoding import DecisionTreeEncoder

X, y = load_titanic(
    return_X_y_frame=True,
    handle_missing=True,
    predictors_only=True,
    cabin="letter_only",
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0,
)

print(X_train[['cabin', 'pclass', 'embarked']].head(10))

We will encode the following categorical variables:

    cabin  pclass embarked
    M       2        S
    M       2        S
    M       2        C
   M       3        Q
    M       3        Q
    M       3        Q
    E       1        C
    M       2        S
    C       1        C
    E       1        S

We set up the encoder to encode the variables above with 3 fold cross-validation, using a grid search to find the optimal depth of the decision tree (this is the default behaviour of the DecisionTreeEncoder()). In this example, we optimize the tree using the roc-auc metric.

encoder = DecisionTreeEncoder(
    variables=['cabin', 'pclass', 'embarked'],
    regression=False,
    scoring='roc_auc',
    cv=3,
    random_state=0,
    ignore_format=True)

encoder.fit(X_train, y_train)

With fit() the DecisionTreeEncoder() fits 1 decision tree per variable. Now we can go ahead and transform the categorical variables into numbers, using the predictions of these trees:

train_t = encoder.transform(X_train)
test_t = encoder.transform(X_test)

train_t[['cabin', 'pclass', 'embarked']].head(10)

We can see the encoded variables below:

        cabin    pclass  embarked
 0.304843  0.436170  0.338957
 0.304843  0.436170  0.338957
 0.304843  0.436170  0.553073
0.304843  0.259036  0.373494
 0.304843  0.259036  0.373494
 0.304843  0.259036  0.373494
 0.611650  0.617391  0.553073
 0.304843  0.436170  0.338957
 0.611650  0.617391  0.553073
 0.611650  0.617391  0.338957

Additional resources#

In the following notebook, you can find more details into the DecisionTreeEncoder() functionality and example plots with the encoded variables:

Jupyter notebook

For more details about this and other feature engineering methods check out these resources:

Feature Engineering for Machine Learning#

Or read our book:

Python Feature Engineering Cookbook#

Both our book and course are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.

This site uses cookies

DecisionTreeEncoder#

Additional resources#