ArbitraryNumberImputer

The ArbitraryNumberImputer() replaces missing data with an arbitrary value determined by the user. It works only with numerical variables. A list of variables can be indicated, or the imputer will automatically select all numerical variables in the train set.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

import feature_engine.missing_data_imputers as mdi

# Load dataset
data = pd.read_csv('houseprice.csv')

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0)

# set up the imputer
arbitrary_imputer = mdi.ArbitraryNumberImputer(
arbitrary_number=-999, variables=['LotFrontage', 'MasVnrArea'])

# fit the imputer
arbitrary_imputer.fit(X_train)

# transform the data
train_t= arbitrary_imputer.transform(X_train)
test_t= arbitrary_imputer.transform(X_test)

fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
../_images/arbitraryvalueimputation.png

API Reference

class feature_engine.missing_data_imputers.ArbitraryNumberImputer(arbitrary_number=999, variables=None)[source]

The ArbitraryNumberImputer() replaces missing data in each variable by an arbitrary value determined by the user.

Parameters
  • arbitrary_number (int or float, default=999) – the number to be used to replace missing data.

  • variables (list, default=None) – The list of variables to be imputed. If None, the imputer will find and select all numerical type variables.

fit(X, y=None)[source]

Checks that the variables are numerical.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. User can pass the entire dataframe, not just the variables to impute.

  • y (None) – y is not needed in this imputation. You can pass None or y.

transform(X)[source]

Replaces missing data with the learned parameters.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The input samples.

Returns

X_transformed – The dataframe without missing values in the selected variables.

Return type

pandas dataframe of shape = [n_samples, n_features]