DecisionTreeDiscretiser

class feature_engine.discretisation.DecisionTreeDiscretiser(variables=None, cv=3, scoring='neg_mean_squared_error', param_grid=None, regression=True, random_state=None)[source]

The DecisionTreeDiscretiser() replaces numerical variables by discrete, i.e., finite variables, which values are the predictions of a decision tree.

The method is inspired by the following article from the winners of the KDD 2009 competition: http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf

The DecisionTreeDiscretiser() trains a decision tree per variable. Then, it transforms the variables, with predictions of the decision tree.

The DecisionTreeDiscretiser() works only with numerical variables. A list of variables to transform can be indicated. Alternatively, the discretiser will automatically select all numerical variables.

More details in the User Guide.

Parameters
variables: list, default=None

The list of numerical variables to transform. If None, the discretiser will automatically select all numerical variables.

cv: int, default=3

Desired cross-validation fold to fit the decision tree.

scoring: str, default=’neg_mean_squared_error’

Desired metric to optimise the performance of the tree. Comes from sklearn.metrics. See the DecisionTreeRegressor or DecisionTreeClassifier model evaluation documentation for more options: https://scikit-learn.org/stable/modules/model_evaluation.html

param_grid: dictionary, default=None

The hyperparameters for the decision tree to test with a grid search. The param_grid can contain any of the permitted hyperparameters for Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier().

If None, then param_grid = {'max_depth': [1, 2, 3, 4]}.

regression: boolean, default=True

Indicates whether the discretiser should train a regression or a classification decision tree.

random_stateint, default=None

The random_state to initialise the training of the decision tree. It is one of the parameters of the Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier(). For reproducibility it is recommended to set the random_state to an integer.

Attributes
binner_dict_:

Dictionary containing the fitted tree per variable.

scores_dict_:

Dictionary with the score of the best decision tree per variable.

variables_:

The variables that will be discretised.

n_features_in_:

The number of features in the train set used in fit.

See also

sklearn.tree.DecisionTreeClassifier
sklearn.tree.DecisionTreeRegressor

References

1

Niculescu-Mizil, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection”. JMLR: Workshop and Conference Proceedings 7: 23-34. KDD 2009 http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf

Methods

fit:

Fit a decision tree per variable.

transform:

Replace continuous variable values by the predictions of the decision tree.

fit_transform:

Fit to the data, then transform it.

fit(X, y)[source]

Fit one decision tree per variable to discretize with cross-validation and grid-search for hyperparameters.

Parameters
X: pandas dataframe of shape = [n_samples, n_features]

The training dataset. Can be the entire dataframe, not just the variables to be transformed.

y: pandas series.

Target variable. Required to train the decision tree.

fit_transform(X, y=None, **fit_params)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
Xarray-like of shape (n_samples, n_features)

Input samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None

Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters.

Returns
X_newndarray array of shape (n_samples, n_features_new)

Transformed array.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsdict

Parameter names mapped to their values.

set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters
**paramsdict

Estimator parameters.

Returns
selfestimator instance

Estimator instance.

transform(X)[source]

Replaces original variable values with the predictions of the tree. The decision tree predictions are finite, aka, discrete.

Parameters
X: pandas dataframe of shape = [n_samples, n_features]

The input samples.

Returns
X_new: pandas dataframe of shape = [n_samples, n_features]

The dataframe with transformed variables.

rtype

DataFrame ..