DecisionTreeDiscretiser#

class feature_engine.discretisation.DecisionTreeDiscretiser(variables=None, cv=3, scoring='neg_mean_squared_error', param_grid=None, regression=True, random_state=None)[source]#

The DecisionTreeDiscretiser() replaces numerical variables by discrete, i.e., finite variables, which values are the predictions of a decision tree.

The method is inspired by the following article from the winners of the KDD 2009 competition: http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf

The DecisionTreeDiscretiser() trains a decision tree per variable. Then, it transforms the variables, with predictions of the decision tree.

The DecisionTreeDiscretiser() works only with numerical variables. A list of variables to transform can be indicated. Alternatively, the discretiser will automatically select all numerical variables.

More details in the User Guide.

Parameters

variables: list, default=None

The list of numerical variables to transform. If None, the transformer will automatically find and select all numerical variables.

cv: int, cross-validation generator or an iterable, default=3

Determines the cross-validation splitting strategy. Possible inputs for cv are:

None, to use cross_validate’s default 5-fold cross validation

int, to specify the number of folds in a (Stratified)KFold,

CV splitter

(https://scikit-learn.org/stable/glossary.html#term-CV-splitter)

An iterable yielding (train, test) splits as arrays of indices.

For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls. For more details check Scikit-learn’s cross_validate’s documentation.

scoring: str, default=’neg_mean_squared_error’

Desired metric to optimise the performance of the tree. Comes from sklearn.metrics. See the DecisionTreeRegressor or DecisionTreeClassifier model evaluation documentation for more options: https://scikit-learn.org/stable/modules/model_evaluation.html

param_grid: dictionary, default=None

The hyperparameters for the decision tree to test with a grid search. The param_grid can contain any of the permitted hyperparameters for Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier(). If None, then param_grid will optimise the ‘max_depth’ over [1, 2, 3, 4].

regression: boolean, default=True

Indicates whether the discretiser should train a regression or a classification decision tree.

random_stateint, default=None

The random_state to initialise the training of the decision tree. It is one of the parameters of the Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier(). For reproducibility it is recommended to set the random_state to an integer.

Attributes

binner_dict_:: Dictionary containing the fitted tree per variable.
scores_dict_:: Dictionary with the score of the best decision tree per variable.
variables_:: The group of variables that will be transformed.
feature_names_in_:: List with the names of features seen during fit.
n_features_in_:: The number of features in the train set used in fit.

See also

sklearn.tree.DecisionTreeClassifier
sklearn.tree.DecisionTreeRegressor

References

1: Niculescu-Mizil, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection”. JMLR: Workshop and Conference Proceedings 7: 23-34. KDD 2009 http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from feature_engine.discretisation import DecisionTreeDiscretiser
>>> np.random.seed(42)
>>> X = pd.DataFrame(dict(x= np.random.randint(1,100, 100)))
>>> y_reg = pd.Series(np.random.randn(100))
>>> dtd = DecisionTreeDiscretiser(random_state=42)
>>> dtd.fit(X, y_reg)
>>> dtd.transform(X)["x"].value_counts()
-0.090091    90
0.479454    10
Name: x, dtype: int64

You can also apply this for classification problems adjusting the scoring metric.

>>> y_clf = pd.Series(np.random.randint(0,2,100))
>>> dtd = DecisionTreeDiscretiser(regression=False, scoring="f1", random_state=42)
>>> dtd.fit(X, y_clf)
>>> dtd.transform(X)["x"].value_counts()
0.480769    52
0.687500    48
Name: x, dtype: int64

Methods

fit:	Fit a decision tree per variable.
fit_transform:	Fit to data, then transform it.
get_feature_names_out:	Get output feature names for transformation.
get_params:	Get parameters for this estimator.
set_params:	Set the parameters of this estimator.
transform:	Replace continuous variable values by the predictions of the decision tree.

fit(X, y)[source]#

Fit one decision tree per variable to discretize with cross-validation and grid-search for hyperparameters.

Parameters

X: pandas dataframe of shape = [n_samples, n_features]: The training dataset. Can be the entire dataframe, not just the variables to be transformed.
y: pandas series.: Target variable. Required to train the decision tree.

fit_transform(X, y=None, **fit_params)[source]#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

Xarray-like of shape (n_samples, n_features): Input samples.
yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None: Target values (None for unsupervised transformations).
**fit_paramsdict: Additional fit parameters.

Returns

X_newndarray array of shape (n_samples, n_features_new): Transformed array.

get_feature_names_out(input_features=None)[source]#

Get output feature names for transformation. In other words, returns the variable names of transformed dataframe.

Parameters

input_featuresarray or list, default=None

This parameter exits only for compatibility with the Scikit-learn pipeline.

If None, then feature_names_in_ is used as feature names in.
If an array or list, then input_features must match feature_names_in_.

Returns

feature_names_out: list: Transformed feature names.

rtype: List[Union[str, int]] ..

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns

routingMetadataRequest: A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsdict: Parameter names mapped to their values.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**paramsdict: Estimator parameters.

Returns

selfestimator instance: Estimator instance.

transform(X)[source]#

Replaces original variable values with the predictions of the tree. The decision tree predictions are finite, aka, discrete.

Parameters

X: pandas dataframe of shape = [n_samples, n_features]: The input samples.

Returns

X_new: pandas dataframe of shape = [n_samples, n_features]: The dataframe with transformed variables.

rtype: DataFrame ..

This site uses cookies

DecisionTreeDiscretiser#