The ArbitraryNumberImputer() replaces missing data with an arbitrary value determined by the user. It works only with numerical variables. A list of variables can be indicated, or the imputer will automatically select all numerical variables in the train set. A dictionary with variables and their arbitrary values can be indicated to use different arbitrary values for variables.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split import feature_engine.missing_data_imputers as mdi # Load dataset data = pd.read_csv('houseprice.csv') # Separate into train and test sets X_train, X_test, y_train, y_test = train_test_split( data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0) # set up the imputer arbitrary_imputer = mdi.ArbitraryNumberImputer( arbitrary_number=-999, variables=['LotFrontage', 'MasVnrArea']) # fit the imputer arbitrary_imputer.fit(X_train) # transform the data train_t= arbitrary_imputer.transform(X_train) test_t= arbitrary_imputer.transform(X_test) fig = plt.figure() ax = fig.add_subplot(111) X_train['LotFrontage'].plot(kind='kde', ax=ax) train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red') lines, labels = ax.get_legend_handles_labels() ax.legend(lines, labels, loc='best')
ArbitraryNumberImputer(arbitrary_number=999, variables=None, imputer_dict=None)¶
The ArbitraryNumberImputer() replaces missing data in each variable by an arbitrary value determined by the user.
variables (list, default=None) – The list of variables to be imputed. If None, the imputer will find and select all numerical type variables. Attribute is used only if imputer_dict attribute is None.
imputer_dict (dict, default=None) – The dictionary of variables and their arbitrary numbers. If imputer_dict is not None, it has to be dictionary with all values of integer or float type. If None, variables attribute is used for imputation.
Checks that the variables are numerical.
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. User can pass the entire dataframe, not just the variables to impute.
y (None) – y is not needed in this imputation. You can pass None or y.
The dictionary containing the values that will be used to replace each variable.
Replaces missing data with the learned parameters.
X (pandas dataframe of shape = [n_samples, n_features]) – The input samples.
X_transformed – The dataframe without missing values in the selected variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]