DecisionTreeDiscretiser

API Reference

class feature_engine.discretisation.DecisionTreeDiscretiser(variables=None, cv=3, scoring='neg_mean_squared_error', param_grid=None, regression=True, random_state=None)[source]

The DecisionTreeDiscretiser() replaces continuous numerical variables by discrete, finite, values estimated by a decision tree.

The methods is inspired by the following article from the winners of the KDD 2009 competition: http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf

The DecisionTreeDiscretiser() works only with numerical variables. A list of variables can be passed as an argument. Alternatively, the discretiser will automatically select all numerical variables.

The DecisionTreeDiscretiser() first trains a decision tree for each variable.

The DecisionTreeDiscretiser() then transforms the variables, that is, makes predictions based on the variable values, using the trained decision tree.

Parameters
variables: list, default=None

The list of numerical variables to transform. If None, the discretiser will automatically select all numerical variables.

cv: int, default=3

Desired number of cross-validation fold to be used to fit the decision tree.

scoring: str, default=’neg_mean_squared_error’

Desired metric to optimise the performance for the tree. Comes from sklearn.metrics. See DecisionTreeRegressor or DecisionTreeClassifier model evaluation documentation for more options: https://scikit-learn.org/stable/modules/model_evaluation.html

param_grid: dictionary, default=None

The list of parameters over which the decision tree should be optimised during the grid search. The param_grid can contain any of the permitted parameters for Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier().

If None, then param_grid = {‘max_depth’: [1, 2, 3, 4]}

regression: boolean, default=True

Indicates whether the discretiser should train a regression or a classification decision tree.

random_stateint, default=None

The random_state to initialise the training of the decision tree. It is one of the parameters of the Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier(). For reproducibility it is recommended to set the random_state to an integer.

Attributes

binner_dict_:

Dictionary containing the fitted tree per variable.

scores_dict_:

Dictionary with the score of the best decision tree, over the train set.

variables_:

The variables to discretise.

n_features_in_:

The number of features in the train set used in fit.

See also

sklearn.tree.DecisionTreeClassifier
sklearn.tree.DecisionTreeRegressor

References

1

Niculescu-Mizil, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection”. JMLR: Workshop and Conference Proceedings 7: 23-34. KDD 2009 http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf

Methods

fit:

Fit a decision tree per variable.

transform:

Replace continuous values by the predictions of the decision tree.

fit_transform:

Fit to the data, then transform it.

fit(X, y)[source]

Fit the decision trees. One tree per variable to be transformed.

Parameters
X: pandas dataframe of shape = [n_samples, n_features]

The training dataset. Can be the entire dataframe, not just the variables to be transformed.

y: pandas series.

Target variable. Required to train the decision tree.

Returns
self
Raises
TypeError
  • If the input is not a Pandas DataFrame

  • If any of the user provided variables are not numerical

ValueError
  • If there are no numerical variables in the df or the df is empty

  • If the variable(s) contain null values

transform(X)[source]

Replaces original variable with the predictions of the tree. The tree outcome is finite, aka, discrete.

Parameters
X: pandas dataframe of shape = [n_samples, n_features]

The input samples.

Returns
X_transformed: pandas dataframe of shape = [n_samples, n_features]

The dataframe with transformed variables.

rtype

DataFrame ..

Raises
TypeError

If the input is not a Pandas DataFrame

ValueError
  • If the variable(s) contain null values

  • If the dataframe is not of the same size as the one used in fit()

Example

In the original article, each feature of the challenge dataset was recoded by training a decision tree of limited depth (2, 3 or 4) using that feature alone, and letting the tree predict the target. The probabilistic predictions of this decision tree were used as an additional feature, that was now linearly (or at least monotonically) correlated with the target.

According to the authors, the addition of these new features had a significant impact on the performance of linear models.

In the following example, we recode 2 numerical variables using decision trees.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine.discretisation import DecisionTreeDiscretiser

# Load dataset
data = data = pd.read_csv('houseprice.csv')

# Separate into train and test sets
X_train, X_test, y_train, y_test =  train_test_split(
            data.drop(['Id', 'SalePrice'], axis=1),
            data['SalePrice'], test_size=0.3, random_state=0)

# set up the discretisation transformer
disc = DecisionTreeDiscretiser(cv=3,
                          scoring='neg_mean_squared_error',
                          variables=['LotArea', 'GrLivArea'],
                          regression=True)

# fit the transformer
disc.fit(X_train, y_train)

# transform the data
train_t= disc.transform(X_train)
test_t= disc.transform(X_test)

disc.binner_dict_
{'LotArea': GridSearchCV(cv=3, error_score='raise-deprecating',
              estimator=DecisionTreeRegressor(criterion='mse', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort=False, random_state=None,
                                              splitter='best'),
              iid='warn', n_jobs=None, param_grid={'max_depth': [1, 2, 3, 4]},
              pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
              scoring='neg_mean_squared_error', verbose=0),
 'GrLivArea': GridSearchCV(cv=3, error_score='raise-deprecating',
              estimator=DecisionTreeRegressor(criterion='mse', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort=False, random_state=None,
                                              splitter='best'),
              iid='warn', n_jobs=None, param_grid={'max_depth': [1, 2, 3, 4]},
              pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
              scoring='neg_mean_squared_error', verbose=0)}
# with tree discretisation, each bin does not necessarily contain
# the same number of observations.
train_t.groupby('GrLivArea')['GrLivArea'].count().plot.bar()
plt.ylabel('Number of houses')
../_images/treediscretisation.png