DropMissingData

API Reference

class feature_engine.imputation.DropMissingData(missing_only=True, variables=None)[source]

works for both numerical and categorical variables. DropMissingData can automatically select all the variables, or alternatively, all the variables with missing data in the train set. Then the observations with NA will be dropped for these variable groups.

The user has the option to indicate for which variables the observations with NA should be removed.

Parameters
missing_onlybool, default=True

If true, missing observations will be dropped only for the variables that were seen to have NA in the train set, during fit. If False, observations with NA will be dropped from all variables.

variableslist, default=None

The list of variables to be imputed. If None, the imputer will find and select all variables with missing data.

**Note**
The transformer will first select all variables or all user entered
variables and if `missing_only=True`, it will re-select from the original group
only those that show missing data in during fit, that is in the train set.

Attributes

variables_:

List of variables for which the rows with NA will be deleted.

Methods

fit:

Learn the variables for which the rows with NA will be deleted

transform:

Remove observations with NA

fit_transform:

Fit to the data, then transform it.

return_na_data:

Returns the dataframe with the rows that contain NA .

fit(X, y=None)[source]

Learn the variables for which the rows with NA will be deleted.

Parameters
Xpandas dataframe of shape = [n_samples, n_features]

The training dataset.

ypandas Series, default=None

y is not needed in this imputation. You can pass None or y.

Returns
self
Raises
TypeError

If the input is not a Pandas DataFrame

return_na_data(X)[source]

Returns the subset of the dataframe which contains the rows with missing values. This method could be useful in production, in case we want to store the observations that will not be fed into the model.

Parameters
Xpandas dataframe of shape = [n_samples, n_features]

The dataset to from which rows containing NA should be retained.

Returns
Xpandas dataframe of shape = [obs_with_na, features]

The cdataframe portion that contains only the rows with missing values.

rtype

DataFrame ..

Raises
TypeError

If the input is not a Pandas DataFrame

transform(X)[source]

Remove rows with missing values.

Parameters
Xpandas dataframe of shape = [n_samples, n_features]

The dataframe to be transformed.

Returns
X_transformedpandas dataframe

The complete case dataframe for the selected variables, of shape [n_samples - rows_with_na, n_features]

rtype

DataFrame ..

Example

DropMissingData() deletes rows with NA values. It works with numerical and categorical variables. The user can pass a list of variables for which to delete rows with NA. Alternatively, DropMissingData() will default to all variables. The trasformer has the option to learn the variables with NA in the train set, and then remove observations with NA in only those variables.

    import numpy as np
    import pandas as pd
    from sklearn.model_selection import train_test_split

    from feature_engine.imputation import DropMissingData

    # Load dataset
    data = pd.read_csv('houseprice.csv')

    # Separate into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['Id', 'SalePrice'], axis=1),
    data['SalePrice'],
    test_size=0.3,
    random_state=0)

    # set up the imputer
    missingdata_imputer = DropMissingData(variables=['LotFrontage', 'MasVnrArea'])

    # fit the imputer
    missingdata_imputer.fit(X_train)

    # transform the data
    train_t= missingdata_imputer.transform(X_train)
    test_t= missingdata_imputer.transform(X_test)

# Number of NA before the transformation:
X_train['LotFrontage'].isna().sum()
189
# Number of NA after the transformation:
    train_t['LotFrontage'].isna().sum()
0
# Number of rows before and after transformation
print(X_train.shape)
    print(train_t.shape)
(1022, 79)
(829, 79)