DropMissingData

API Reference

class feature_engine.imputation.DropMissingData(missing_only=True, variables=None)[source]

The DropMissingData() will delete rows containing missing values. It provides similar functionality to pandas.drop_na().

It works for both numerical and categorical variables. You can enter the list of variables for which missing values should be removed from the dataframe. Alternatively, the imputer will automatically select all variables in the dataframe.

Note The transformer will first select all variables or all user entered variables and if missing_only=True, it will re-select from the original group only those that show missing data in during fit, that is in the train set.

Parameters
missing_only: bool, default=True

If true, missing observations will be dropped only for the variables that have missing data in the train set, during fit. If False, observations with NA will be dropped from all variables indicated by the user.

variables: list, default=None

The list of variables to be imputed. If None, the imputer will find and select all variables in the dataframe.

Attributes

variables_:

List of variables for which the rows with NA will be deleted.

n_features_in_:

The number of features in the train set used in fit.

Methods

fit:

Learn the variables for which the rows with NA will be deleted

transform:

Remove observations with NA

fit_transform:

Fit to the data, then transform it.

return_na_data:

Returns the dataframe with the rows that contain NA .

fit(X, y=None)[source]

Learn the variables for which the rows with NA will be deleted.

Parameters
X: pandas dataframe of shape = [n_samples, n_features]

The training dataset.

y: pandas Series, default=None

y is not needed in this imputation. You can pass None or y.

Returns
self
Raises
TypeError

If the input is not a Pandas DataFrame

return_na_data(X)[source]

Returns the subset of the dataframe which contains the rows with missing values. This method could be useful in production, in case we want to store the observations that will not be fed into the model.

Parameters
X: pandas dataframe of shape = [n_samples, n_features]

The dataframe to be transformed.

Returns
X: pandas dataframe of shape = [obs_with_na, features]

The dataframe containing only the rows with missing values.

rtype

DataFrame ..

Raises
TypeError

If the input is not a Pandas DataFrame

transform(X)[source]

Remove rows with missing values.

Parameters
X: pandas dataframe of shape = [n_samples, n_features]

The dataframe to be transformed.

Returns
X_transformed: pandas dataframe

The complete case dataframe for the selected variables, of shape [n_samples - rows_with_na, n_features]

rtype

DataFrame ..

Example

DropMissingData() deletes rows with missing values. It works with numerical and categorical variables. You can pass a list of variables to impute, or the transformer will select and impute all variables. The trasformer has the option to learn the variables with missing data in the train set, and then remove observations with NA only in those variables.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

from feature_engine.imputation import DropMissingData

# Load dataset
data = pd.read_csv('houseprice.csv')

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0)

# set up the imputer
missingdata_imputer = DropMissingData(variables=['LotFrontage', 'MasVnrArea'])

# fit the imputer
missingdata_imputer.fit(X_train)

# transform the data
train_t= missingdata_imputer.transform(X_train)
test_t= missingdata_imputer.transform(X_test)

# Number of NA before the transformation
X_train['LotFrontage'].isna().sum()
189
# Number of NA after the transformation:
    train_t['LotFrontage'].isna().sum()
0
# Number of rows before and after transformation
print(X_train.shape)
print(train_t.shape)
(1022, 79)
(829, 79)