SmartCorrelatedSelection#

When we have big datasets, more than 2 features can be correlated. We could have 3, 4 or more features that are correlated. Thus, which one should be keep and which ones should we drop?

SmartCorrelatedSelection tries to answer this question.

From a group of correlated variables, the SmartCorrelatedSelection will retain the one with:

  • the highest variance

  • the highest cardinality

  • the least missing data

  • the most important (based on embedded selection methods)

And drop the rest.

Features with higher diversity of values (higher variance or cardinality), tend to be more predictive, whereas features with least missing data, tend to be more useful.

Procedure#

SmartCorrelatedSelection will first find correlated feature groups using any correlation method supported by pandas.corr(), or a user defined function that returns a value between -1 and 1.

Then, from each group of correlated features, it will try and identify the best candidate based on the above criteria.

If the criteria is based on feature importance, SmartCorrelatedSelection will train a machine learning model using the correlated feature group, derive the feature importance from this model, end then keep the feature with the highest important.

SmartCorrelatedSelection works with machine learning models that derive coefficients or feature importance values.

If the criteria is based on variance or cardinality, SmartCorrelatedSelection will determine these attributes for each feature in the group and retain that one with the highest.

If the criteria is based on missing data, SmartCorrelatedSelection will determine the number of NA in each feature from the correlated group and keep the one with less NA.

Example

Let’s see how to use SmartCorrelatedSelection in a toy example. Let’s create a toy dataframe with 4 correlated features:

import pandas as pd
from sklearn.datasets import make_classification
from feature_engine.selection import SmartCorrelatedSelection

# make dataframe with some correlated variables
def make_data():
    X, y = make_classification(n_samples=1000,
                               n_features=12,
                               n_redundant=4,
                               n_clusters_per_class=1,
                               weights=[0.50],
                               class_sep=2,
                               random_state=1)

    # transform arrays into pandas df and series
    colnames = ['var_'+str(i) for i in range(12)]
    X = pd.DataFrame(X, columns=colnames)
    return X

X = make_data()

Now, we set up SmartCorrelatedSelection to find features groups which (absolute) correlation coefficient is >0.8. From these groups, we want to retain the feature with highest variance:

# set up the selector
tr = SmartCorrelatedSelection(
    variables=None,
    method="pearson",
    threshold=0.8,
    missing_values="raise",
    selection_method="variance",
    estimator=None,
)

With fit() the transformer finds the correlated variables and selects the one to keep. With transform() it drops them from the dataset:

Xt = tr.fit_transform(X)

The correlated feature groups are stored in the transformer’s attributes:

tr.correlated_feature_sets_

Note that in the second group, 4 features are correlated among themselves.

[{'var_0', 'var_8'}, {'var_4', 'var_6', 'var_7', 'var_9'}]

In the following attribute we find the features that will be removed from the dataset:

tr.features_to_drop_
['var_0', 'var_4', 'var_6', 'var_9']

If we now go ahead and print the transformed data, we see that the correlated features have been removed.

print(print(Xt.head()))
      var_1     var_2     var_3     var_5    var_10    var_11     var_8  \
0 -2.376400 -0.247208  1.210290  0.091527  2.070526 -1.989335  2.070483
1  1.969326 -0.126894  0.034598 -0.186802  1.184820 -1.309524  2.421477
2  1.499174  0.334123 -2.233844 -0.313881 -0.066448 -0.852703  2.263546
3  0.075341  1.627132  0.943132 -0.468041  0.713558  0.484649  2.792500
4  0.372213  0.338141  0.951526  0.729005  0.398790 -0.186530  2.186741

      var_7
0 -2.230170
1 -1.447490
2 -2.240741
3 -3.534861
4 -2.053965

More details#

In this notebook, we show how to use SmartCorrelatedSelection with a different relation metric: