ArbitraryNumberImputer¶
API Reference¶
-
class
feature_engine.imputation.
ArbitraryNumberImputer
(arbitrary_number=999, variables=None, imputer_dict=None)[source]¶ The ArbitraryNumberImputer() replaces missing data in each variable by an arbitrary value determined by the user. It works only with numerical variables.
We can impute all variables with the same number, in which case we need to define the variables to impute in
variables
and the imputation number inarbitrary_number
. Alternatively, we can pass a dictionary of variable and numbers to use for their imputation.For example, we can impute varA and varB with 99 like this:
transformer = ArbitraryNumberImputer( variables = ['varA', 'varB'], arbitrary_number = 99 ) Xt = transformer.fit_transform(X)
Alternatively, we can impute varA with 1 and varB with 99 like this:
transformer = ArbitraryNumberImputer( imputer_dict = {'varA' : 1, 'varB': 99] ) Xt = transformer.fit_transform(X)
- Parameters
- arbitrary_numberint or float, default=999
The number to be used to replace missing data.
- variableslist, default=None
The list of variables to be imputed. If None, the imputer will find and select all numerical type variables. This parameter is used only if
imputer_dict
is None.- imputer_dictdict, default=None
The dictionary of variables and the arbitrary numbers for their imputation.
Attributes
imputer_dict_ :
Dictionary with the values to replace NAs in each variable.
Methods
fit:
This transformer does not learn parameters.
transform:
Impute missing data.
fit_transform:
Fit to the data, then transform it.
-
fit
(X, y=None)[source]¶ This method does not learn any parameter. Checks dataframe and finds numerical variables, or checks that the variables entered by user are numerical.
- Parameters
- Xpandas dataframe of shape = [n_samples, n_features]
The training dataset.
- yNone
y is not needed in this imputation. You can pass None or y.
- Returns
- self
- Raises
- TypeError
If the input is not a Pandas DataFrame
If any of the user provided variables are not numerical
- ValueError
If there are no numerical variables in the df or the df is empty
-
transform
(X)[source]¶ Replace missing data with the learned parameters.
- Parameters
- Xpandas dataframe of shape = [n_samples, n_features]
The data to be transformed.
- Returns
- Xpandas dataframe of shape = [n_samples, n_features]
The dataframe without missing values in the selected variables.
- rtype
DataFrame
..
- Raises
- TypeError
If the input is not a Pandas DataFrame
- ValueError
If the dataframe is not of same size as that used in fit()
Example¶
The ArbitraryNumberImputer() replaces missing data with an arbitrary value determined by the user. It works only with numerical variables. A list of variables can be indicated, or the imputer will automatically select all numerical variables in the train set. A dictionary with variables and their arbitrary values can be indicated to use different arbitrary values for variables.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.imputation import ArbitraryNumberImputer
# Load dataset
data = pd.read_csv('houseprice.csv')
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0)
# set up the imputer
arbitrary_imputer = ArbitraryNumberImputer(arbitrary_number=-999, variables=['LotFrontage', 'MasVnrArea'])
# fit the imputer
arbitrary_imputer.fit(X_train)
# transform the data
train_t= arbitrary_imputer.transform(X_train)
test_t= arbitrary_imputer.transform(X_test)
fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
