MeanMedianImputer

The MeanMedianImputer() replaces missing data with the mean or median of the variable. It works only with numerical variables. A list of variables to impute can be indicated, or the imputer will automatically select all numerical variables in the train set. For more details, check the API Reference below.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

import feature_engine.missing_data_imputers as mdi

# Load dataset
data = pd.read_csv('houseprice.csv')

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0)

# set up the imputer
median_imputer = mdi.MeanMedianImputer(imputation_method='median',
                                       variables=['LotFrontage', 'MasVnrArea'])
# fit the imputer
median_imputer.fit(X_train)

# transform the data
train_t= median_imputer.transform(X_train)
test_t= median_imputer.transform(X_test)

fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
../_images/medianimputation.png

API Reference

class feature_engine.missing_data_imputers.MeanMedianImputer(imputation_method='median', variables=None)[source]

The MeanMedianImputer() transforms features by replacing missing data by the mean or median value of the variable.

The MeanMedianImputer() works only with numerical variables.

Users can pass a list of variables to be imputed as argument. Alternatively, the MeanMedianImputer() will automatically find and select all variables of type numeric.

The imputer first calculates the mean / median values of the variables (fit).

The imputer then replaces the missing data with the estimated mean / median (transform).

Parameters
  • imputation_method (str, default=median) – Desired method of imputation. Can take ‘mean’ or ‘median’.

  • variables (list, default=None) – The list of variables to be imputed. If None, the imputer will select all variables of type numeric.

fit(X, y=None)[source]

Learns the mean or median values.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. User can pass the entire dataframe, not just the variables that need imputation.

  • y (pandas series or None, default=None) – y is not needed in this imputation. You can pass None or y.

imputer_dict\_

The dictionary containing the mean / median values per variable. These values will be used by the imputer to replace missing data. The imputer_dict_ is created when fitting the imputer.

Type

dictionary

transform(X)[source]

Replaces missing data with the learned parameters.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The input samples.

Returns

X_transformed – The dataframe without missing values in the selected variables.

Return type

pandas dataframe of shape = [n_samples, n_features]