DecisionTreeCategoricalEncoder

The DecisionTreeCategoricalEncoder() replaces categories in the variable with the predictions of a decision tree. The transformer first encodes categorical variables into numerical variables using ordinal encoding. You have the option to have the integers assigned to the categories as they appear in the variable, or ordered by the mean value of the target per category. After this, the transformer fits with this numerical variable a decision tree to predict the target variable. Finally, the original categorical variable is replaced by the predictions of the decision tree.

The DecisionTreeCategoricalEncoder() works only with categorical variables. A list of variables can be indicated, or alternatively, the imputer will automatically select all categorical variables in the train set.

Note that a decision tree is fit per every single variable. With this transformer variables are not combined.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine import categorical_encoders as ce

# Load dataset
def load_titanic():
        data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
        data = data.replace('?', np.nan)
        data['cabin'] = data['cabin'].astype(str).str[0]
        data['pclass'] = data['pclass'].astype('O')
        data['embarked'].fillna('C', inplace=True)
        return data

data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
                data.drop(['survived', 'name', 'ticket'], axis=1),
                data['survived'], test_size=0.3, random_state=0)

X_train[['cabin', 'pclass', 'embarked']].head(10)
      cabin pclass embarked
501      n      2        S
588      n      2        S
402      n      2        C
1193     n      3        Q
686      n      3        Q
971      n      3        Q
117      E      1        C
540      n      2        S
294      C      1        C
261      E      1        S
# set up the encoder
encoder = ce.DecisionTreeCategoricalEncoder(variables=['cabin', 'pclass', 'embarked'],
          random_state=0)

# fit the encoder
encoder.fit(X_train, y_train)

# transform the data
train_t = encoder.transform(X_train)
test_t = encoder.transform(X_test)

train_t[['cabin', 'pclass', 'embarked']].head(10)
     cabin    pclass  embarked
501   0.304843  0.307580  0.338957
588   0.304843  0.307580  0.338957
402   0.304843  0.307580  0.558011
1193  0.304843  0.307580  0.373494
686   0.304843  0.307580  0.373494
971   0.304843  0.307580  0.373494
117   0.649533  0.617391  0.558011
540   0.304843  0.307580  0.338957
294   0.649533  0.617391  0.558011
261   0.649533  0.617391  0.338957

API Reference

class feature_engine.categorical_encoders.DecisionTreeCategoricalEncoder(encoding_method='arbitrary', cv=3, scoring='neg_mean_squared_error', param_grid={'max_depth': [1, 2, 3, 4]}, regression=True, random_state=None, variables=None)[source]

The DecisionTreeCategoricalEncoder() encodes categorical variables with predictions of a decision tree model.

The categorical variable will be first encoded into integers with the OrdinalCategoricalEncoder(). The integers can be assigned arbitrarily to the categories or following the mean value of the target in each category.

Then a decision tree will be fit using the resulting numerical variable to predict the target variable. Finally, the original categorical variable values will be replaced by the predictions of the decision tree.

Parameters
  • encoding_method (str, default='arbitrary') –

    The categorical encoding method that will be used to encode the original categories to numerical values.

    ’ordered’: the categories are numbered in ascending order according to the target mean value per category.

    ’arbitrary’ : categories are numbered arbitrarily.

  • cv (int, default=3) – Desired number of cross-validation fold to be used to fit the decision tree.

  • scoring (str, default='neg_mean_squared_error') – Desired metric to optimise the performance for the tree. Comes from sklearn metrics. See the DecisionTreeRegressor or DecisionTreeClassifier model evaluation documentation for more options: https://scikit-learn.org/stable/modules/model_evaluation.html

  • regression (boolean, default=True) – Indicates whether the encoder should train a regression or a classification decision tree.

  • param_grid (dictionary, default={'max_depth': [1,2,3,4]}) – The list of parameters over which the decision tree should be optimised during the grid search. The param_grid can contain any of the permitted parameters for Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier().

  • random_state (int, default=None) – The random_state to initialise the training of the decision tree. It is one of the parameters of the Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier(). For reproducibility it is recommended to set the random_state to an integer.

  • variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and select all object type variables.

encoder\_

Encoder pipeline containing the ordinal encoder and decision tree discretiser.

Type

sklearn Pipeline

fit(X, y=None)[source]

Learns the numbers that should be used to replace the categories in each variable.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the categorical variables.

  • y (pandas series.) – The target variable. Required to train the decision tree and for ordered ordinal encoding.

transform(X)[source]

Returns the predictions of the decision tree based of the variable’s original value.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The input samples.

Returns

X_transformed – Dataframe with variables encoded with decision tree predictions.

Return type

pandas dataframe of shape = [n_samples, n_features]