- class feature_engine.imputation.AddMissingIndicator(missing_only=True, variables=None)¶
The AddMissingIndicator() adds additional binary variables that indicate if data is missing. It will add as many missing indicators as variables indicated by the user.
Binary variables are named with the original variable name plus ‘_na’.
The AddMissingIndicator() works for both numerical and categorical variables. You can pass a list with the variables for which the missing indicators should be added. Alternatively, the imputer will select and add missing indicators to all variables in the training set.
how=missing_only, the imputer will add missing indicators only to those variables that show missing data in during fit. These may be a subset of the variables you indicated.
- missing_only: bool, default=True
Indicates if missing indicators should be added to variables with missing data or to all variables.
True: indicators will be created only for those variables that showed missing data during fit.
False: indicators will be created for all variables
- variables: list, default=None
The list of variables to be imputed. If None, the imputer will find and select all variables.
List of variables for which the missing indicators will be created.
The number of features in the train set used in fit.
Learn the variables for which the missing indicators will be created
Add the missing indicators.
Fit to the data, then trasnform it.
- fit(X, y=None)¶
Learn the variables for which the missing indicators will be created.
- X: pandas dataframe of shape = [n_samples, n_features]
The training dataset.
- y: pandas Series, default=None
y is not needed in this imputation. You can pass None or y.
The list of variables for which missing indicators will be added.
If the input is not a Pandas DataFrame
Add the binary missing indicators.
- Xpandas dataframe of shape = [n_samples, n_features]
The dataframe to be transformed.
- X_transformedpandas dataframe of shape = [n_samples, n_features]
The dataframe containing the additional binary variables. Binary variables are named with the original variable name plus ‘_na’.
The AddMissingIndicator() adds a binary variable indicating if observations are missing (missing indicator). It adds a missing indicator for both categorical and numerical variables. A list of variables for which to add a missing indicator can be passed, or the imputer will automatically select all variables.
The imputer has the option to select if binary variables should be added to all variables, or only to those that show missing data in the train set, by setting the option how=’missing_only’.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from feature_engine.imputation import AddMissingIndicator # Load dataset data = pd.read_csv('houseprice.csv') # Separate into train and test sets X_train, X_test, y_train, y_test = train_test_split( data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0) # set up the imputer addBinary_imputer = AddMissingIndicator( variables=['Alley', 'MasVnrType', 'LotFrontage', 'MasVnrArea']) # fit the imputer addBinary_imputer.fit(X_train) # transform the data train_t = addBinary_imputer.transform(X_train) test_t = addBinary_imputer.transform(X_test) train_t[['Alley_na', 'MasVnrType_na', 'LotFrontage_na', 'MasVnrArea_na']].head()