MeanMedianImputer¶
API Reference¶
-
class
feature_engine.imputation.
MeanMedianImputer
(imputation_method='median', variables=None)[source]¶ The MeanMedianImputer() replaces missing data by the mean or median value of the variable. It works only with numerical variables.
We can pass a list of variables to be imputed. Alternatively, the MeanMedianImputer() will automatically select all variables of type numeric in the training set.
The imputer:
first calculates the mean / median values of the variables (fit).
Then replaces the missing data with the estimated mean / median (transform).
- Parameters
- imputation_methodstr, default=median
Desired method of imputation. Can take ‘mean’ or ‘median’.
- variableslist, default=None
The list of variables to be imputed. If None, the imputer will select all variables of type numeric.
Attributes
imputer_dict_ :
Dictionary with the mean or median values per variable.
Methods
fit:
Learn the mean or median values.
transform:
Impute missing data.
fit_transform:
Fit to the data, then transform it.
-
fit
(X, y=None)[source]¶ Learn the mean or median values.
- Parameters
- Xpandas dataframe of shape = [n_samples, n_features]
The training dataset.
- ypandas series or None, default=None
y is not needed in this imputation. You can pass None or y.
- Returns
- self
- Raises
- TypeError
If the input is not a Pandas DataFrame
If any of the user provided variables are not numerical
- ValueError
If there are no numerical variables in the df or the df is empty
-
transform
(X)[source]¶ Replace missing data with the learned parameters.
- Parameters
- Xpandas dataframe of shape = [n_samples, n_features]
The data to be transformed.
- Returns
- Xpandas dataframe of shape = [n_samples, n_features]
The dataframe without missing values in the selected variables.
- rtype
DataFrame
..
- Raises
- TypeError
If the input is not a Pandas DataFrame
- ValueError
If the dataframe is not of same size as that used in fit()
Example¶
The MeanMedianImputer() replaces missing data with the mean or median of the variable. It works only with numerical variables. A list of variables to impute can be indicated, or the imputer will automatically select all numerical variables in the train set. For more details, check the API Reference below.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.imputation import MeanMedianImputer
# Load dataset
data = pd.read_csv('houseprice.csv')
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0)
# set up the imputer
median_imputer = MeanMedianImputer(imputation_method='median', variables=['LotFrontage', 'MasVnrArea'])
# fit the imputer
median_imputer.fit(X_train)
# transform the data
train_t= median_imputer.transform(X_train)
test_t= median_imputer.transform(X_test)
fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
